IABSD.fr/src/sys/netinet

Branch :


Log

Author Commit Date CI Message
148caabb 2025-07-08 00:47:41 remove unneeded includes; ok bluhm@
37d47b61 2025-07-07 00:55:15 remove prototypes for removed functions
ad175204 2025-07-02 05:44:46 have route sourceaddr use RTF_GATEWAY to decide when to kick in. previously it used !RTF_HOST and !RTF_LLINFO. the intention with route sourceaddr was to use it except when a peer was on link. however, it is possible to have host routes (ie, RTF_HOST) via a gateway, which ended up not using the route sourceaddr when it should not have. by definition any route with RTF_GATEWAY set is not directly connected, so using it seems to better suit what route sourceaddr is doing. discussed with and ok claudio@ denis@ tested by denis@
3ae74042 2025-06-30 12:43:22 Unlock IPCTL_DEFTTL case of ip_sysctl(). Read-only access at initialization time of interfaces, PCBs and IP packets. ok bluhm
6036672b 2025-06-29 00:33:46 make the argument to ether_ntoa const. this drives me nuts when i want to print something out of what's already const. casting it works, but feels gross. ok guenther@ tb@ deraadt@ enh says this is already in bionic/glibc/musl
2a3a2a0f 2025-06-26 21:46:40 Fix TCP netstat counter. In my previous commit I forgot an ASSIGN() in tcp_sysctl_tcpstat(). Then the counter index was not incremented. This caused that values in netstat -s were of by one in position with the description. reported and OK jan@
b71ef036 2025-06-25 10:33:53 Push netlock down to mrt{,6}_sysctl_mfc(). Move copyout() and sleeping M_WAITOK malloc(9) out of netlock. Keep exclusive netlock instead of shared because the walker callback does `rt_gateway' dereference. Leave locking relaxation to the further diffs. ok bluhm
72282bea 2025-06-24 18:05:51 Unlock IPCTL_IPPORT_* cases of ip_sysctl(). Corresponding variables accessed read-only only within in_pcbpickport(). ok bluhm
1213dbc0 2025-06-23 20:59:25 Unlock IPCTL_MTUDISC case of ip_sysctl(). `ip_mtudisc' is atomically accessed boolean, so allow only 0 and 1 values to set. Also, while `ip_mtudisc' is 0, the rt_timer_queue_flush() will be triggered all the times even if we do read access. There is no reason for that, so flush the queue only if this thread successfully assigned 0 value. The rt_timer_queue_flush() requires to be serialized with the netlock. ok bluhm
251e27b0 2025-06-23 20:56:38 Unlock IPCTL_MRTPROTO case of ip_sysctl(). We do read-only access from sysctl(2) interface. Also `ip_mrtproto' is immutable. ok bluhm
88803725 2025-06-23 12:05:45 Remove ip6_divert.h header file. All redundant code for IPv6 divert has been removed or merged into ip_divert.c. What remains is the ip6_divert.h header file. Sysctl does not exists anymore, the leftovers are #ifdef _KERNEL. The few IPv6 specific declarations can be easily moved into ip_divert.h and the IPv6 header deleted. OK mpi@ mvs@
1df70e8b 2025-06-23 09:16:32 Move IP{,V6}CTL_MULTIPATH cases of ip{,6}_sysctl() out of netlock. Add missing membar_producer() before `rtgeneration' update to invalidate route cache after ipmultipath changes. Feedback and ok from bluhm.
4d34df45 2025-06-21 14:21:17 Move IP{,V6}CTL_MTUDISCTIMEOUT cases of ip{,6}_sysctl() out of netlock. They are identical, so unlock them both. Use the temporary `ip_sysctl_lock' rwlock(9) for value assignment and the following rt_timer_queue_change() serialization. ok bluhm
a89c75a2 2025-06-20 05:08:07 have icmp_reflect use route sourceaddr. this makes it behave like the in_pcb source address selection. ok claudio@ bluhm@
29a15cf4 2025-06-18 17:45:07 Remove sysctl for divert6 recv and send space. UDP has a common sysctl for recv and send space, but divert had a special knob for inet6. Remove net.inet6.divert.recvspace and net.inet6.divert.sendspace sysctl and use the net.inet.divert values instead. OK dlg@ mvs@
eddb43a3 2025-06-18 16:15:46 Handle sockets that are closing in parallel. After unlocking, sockets may close on one CPU while others are processing packets. For TCP the socket lock prevents this. Add a netstat counter for dropped packets if an inpcb has no associated socket after the lock has been acquired. UDP, raw IP, and divert sockets rely only on reference count. They do not take socket lock as they have little state to protect and want to avoid the performance penalty. This means that inp->inp_socket->so_pcb can become NULL anytime. Remove a kassert that checks this value. All other parts of the socket are either read only or protected by mutex. Sockets may still receive data into socket buffer while the socket is closing. After releasing all references, sorele() calls m_purge() and has to deal with it. OK mvs@
154de61c 2025-06-17 03:48:14 white space tweak
c5e46c79 2025-06-17 01:35:07 guard tcp stat updates with splnet now that interrupts can update them. some tcp stack lso/tso statistics are now updated by drivers that implement tso/lro, and this code can run from an interrupt handler. eg, ifq_restart from a transmit completion interrupt can call a drivers start routine, which can preempt the tcp stack running on behalf of a system call. the tcp stats are per cpu counters so they can be updated without coordinating with each other, but, as per the counters manpage, updates on a cpu have to be serialised. this is to prevent corruption of at least the generation number each cpu uses to version the counter updates. since all the tcp stats are a single set of counters, any update has to prevent preemption. ok claudio@ bluhm@
2be43eb3 2025-06-16 07:11:58 Move atomically accessed `udpencap_enable' and `udpencap_port' sysctl(2) variables out of netlock. ok bluhm
939e9288 2025-06-12 20:37:56 Carefully prune sysctl nodes with #ifndef SMALL_KERNEL recover space lost to other bloaty software. ok bluhm
6bb594b9 2025-06-12 19:10:17 Fix use-after-free of inpcb. In tcp_input.c rev 1.451 a socket leak was fixed which introduced a use-after-free of the inpcb. If syn_cache_get() goes to the resetandabort case, the listen inpcb is stored in listeninp and inp. There the call to in_pcbunref(inp) accidently frees the listen socket. After copying inp to listeninp, set inp to NULL. Reported-by: syzbot+42a7b662604561ceb05b@syzkaller.appspotmail.com Reported-by: syzbot+05b4b109c890334897af@syzkaller.appspotmail.com OK deraadt@ claudio@
45ccd79b 2025-06-12 07:17:00 put revarprequest() under #ifdef NFSCLIENT only called from revarpwhoarewe() which is under #ifdef NFSCLIENT
0afa2cf9 2025-06-11 14:30:07 Fix socket leak in TCP SYN cache. My socket reference counting commit tcp_input.c rev 1.450 has introduced a socket leak. This resulted in mbufs lying in the socket buffers not beeing freed. The TCP SYN cache called soref() to avoid freeing the socket when it was working with it. But the unref got lost when socket reference count moved into the inp. We have to hold the reference over tcp_drop() in the abort case to unlock the socket afterwards. But tcp_drop() removes the inpcb from the table and drops this reference. Call in_pcbref() instead of soref() to have references for both inp and so. After tcp_drop() call in_pcbsounlock() and in_pcbunref(). Then the memory is freed in the final step. While there move m_freem() out of the socket lock. When syn_cache_get() returns the socket successfully, keep the reference count on the inp. Then tcp_input() can work with this inpcb and unref it at the end. OK mvs@; commit it claudio@; tested by job@
32cfac28 2025-06-08 17:06:19 Remove TCP timeout reaper. The TCP timeout reaper is no longer necessary. Idea was to prevent timeout handlers from using TCP sockets that were already closed. But now tcp_close() runs with socket lock and tcp_timer_enter() checks that intotcpcb(inp) is not NULL while holding the socket lock. So timeout cannot run after TCP has been closed. OK mvs@
94371bfe 2025-06-06 13:13:37 Simplify IP divert defines. Noone wants to override divert packet defines via compiler options. Simply move them to ip_divert.h header file. OK claudio@ mvs@
a433cb91 2025-06-04 12:37:00 Use struct divstat for both IPv4 and IPv6. Code for divert and divert6 is very similar, reduce the difference further. Struct divstat and div6stat are basically the same, so remove the latter. divert6_sysctl_div6stat() can also go away. For now keep distinct divert counters for IPv4 and IPv6, but they are counting the same things. OK mvs@
46196a20 2025-06-03 16:51:26 Reference count the socket within internet PCB. Instead of protecting inp_socket with special mutex when being freed and reference counting in in_pcbsolock_ref(), better reference count inp_socket. When an inpcb is created, inp_socket is initialized with a socket pointer that is reference counted. When the inpcb is freed, release the socket reference. As incpb is reference counted itself, we can always access the socket memory when we have a valid inpcb counter. in_pcbsolock() can just lock the socket, no reference counting is needed. This reduces contention a bit. As in_pcbdetach() is protected by socket lock, in_pcbsolock() has to check so_pcb while holding the lock to skip sockets that are being closed. OK mvs@
410349f9 2025-06-03 14:49:05 Move atomically accessed `esp_enable' sysctl(2) variable out of netlock. ok bluhm
bb8746de 2025-05-28 06:27:04 avoid some integer overflow by casting before multiplying.
9dcd3de5 2025-05-27 07:52:49 Convert IP6_EXTHDR_GET() macro to ip6_exthdr_get() inline function. Make the new function static inline so it can stay in the same netinet/ip6.h header file. Returning a void pointer avoids all the type casts. Convert the panic("m_pulldown malfunction") to an kassert and move it into m_pulldown(). Keep the offset and length parameter int as this type is what m_pulldown() expects. OK claudio@
442fb2d5 2025-05-24 12:27:23 Pass mbuf pointer to IP6_EXTHDR_GET() macro. Passing an mbuf pointer instead of mbuf to IP6_EXTHDR_GET() simplifies memory management. m_pulldown() may free the mbuf and the macro sets m to NULL. This is not obvious, it is cleaner to pass a pointer if the value may be modified. Another reason for this change is that I am slowly converting the protocol input functions to deal with mbuf pointer instead of mbuf. m_pullup() and m_pulldown() may modify or free the mbuf, and then dangling pointers are lying around in the callers. We had problmes with that before. Prefer to adjust all pointers, which is possible when using *mp instead of m and setting mp to NULL after free. Then either a NULL dereference happens or a double free is ignored. Both are much easier to deal with than a use after free. OK claudio@
403e61e5 2025-05-23 23:39:30 replace timeout_add_tv with timeout_add_usec this changes the calculation of the interval slightly, but in practice it should not make a difference. this is a step toward deprecating timeout_add_tv. ok mpi@ bluhm@
40adde36 2025-05-22 03:12:33 Fix trailing whitespace.
f4df864d 2025-05-22 03:09:00 Remove redundant NULL check from divert_packet() that it already in in_pcbunref().
fa0a14c8 2025-05-21 09:33:48 Get rid of unused `pr_hardlimit_warning', `pr_hardlimit_ratecap' and `pr_hardlimit_warning_last'. ok dlg tedu
1363fb03 2025-05-20 18:41:06 Unlock TCPCTL_REASS_LIMIT and TCPCTL_SACKHOLE_LIMIT cases of tcp_sysctl(). Use the pool lock to serialize pool_sethardlimit() with the rest pool layer. Also use `sysctl_lock' to serialize pool and sysctl variable modification. Since the whole tcp_sysctl() became mp-safe, move it out of sysctl locks. ok bluhm
b8a5eea1 2025-05-20 18:40:09 Move move IPCTL_SOURCEROUTE case of ip_sysctl() out of netlock. It is atomically accessed integer. sysctl_securelevel_int() is mp-safe. ok bluhm
23f18c79 2025-05-20 05:51:43 Call in_pcbselsrc() and in6_pcbselsrc() with const sockaddr parameter. Functions in_pcbselsrc() and in6_pcbselsrc() use the destination sockaddr to determine a suitable source address. As the destination is only used for lookup, it should never be modified. Use const to let the compiler verify that. On the way down the callstack, also convert a bunch of other lookup functions to const. OK kn@
206922f6 2025-05-19 06:50:00 nd6log is gone, mop up nd6_debug sysctl. input & OK kn, OK bluhm
f4fc7468 2025-05-19 02:27:57 Disable TCP softlro for small kernels. Saves 6.7K object size. OK kn@ jan@
76ae5c94 2025-05-18 03:18:36 ip_mroute: remove unused origin parameter of mfc_find() ok mvs
7f65c2e6 2025-05-14 14:32:15 Move `encdebug' sysctl(2) variable out of netlock. It is idely used in the DPRINTF() macros, but disabled by default. We don't really need to enforce loading `encdebug' value each time, but at least it is consistent that way. ok bluhm
976cf1eb 2025-05-13 20:06:10 Move `ipsec_expire_acquire' sysctl(2) variable out of netlock. It is atomically accessed integer local to ipsp_acquire_sa(). ok bluhm
e5a8ab83 2025-05-13 17:27:53 Move `ipsec_keep_invalid' sysctl(2) variable out of netlock. It is atomically accessed integer local to reserve_spi(). ok bluhm
ee24dc58 2025-05-13 09:17:41 Unlock TCPCTL_ROOTONLY and TCPCTL_BADDYNAMIC cases of tcp_sysctl(). The copy-paste from udp_sysctl(). It is not reasonable to combine this code with UDP and make it more complex. ok bluhm
254506a7 2025-05-13 09:16:33 Move `psec_require_pfs' sysctl(2) variable out of netlock. It is atomically accessed integer local to pfkeyv2_acquire(). ok bluhm
39947927 2025-05-12 17:21:21 Unlock UDPCTL_ROOTONLY and UDPCTL_BADDYNAMIC cases of udp_sysctl(). Use temporary buffer to exchange data with the userland. So we need to take exclusive netlock only if we modify values. The shared netlock is still required even if we read values, but it doesn't stop packets processing in the most paths. ok bluhm
89ae69af 2025-05-12 17:20:09 Move bunch of `ipsecctl_vars' variables out of netlock. The `ipsec_soft_allocations', `ipsec_exp_allocations', `ipsec_soft_bytes', `ipsec_exp_bytes', `ipsec_soft_timeout', `ipsec_exp_timeout', `ipsec_soft_first_use' and `ipsec_exp_first_use' are local to pfkeyv2_acquire(). They are loaded together during `sadb_comb' initialization, so make sense to unlock them together too. It looks reasonable to drop 'ipsec_' prefix for local variables, otherwise the names are too long, and I think this should made code reading worse. For consistency I dropped 'ipsec_' prefix for `ipsec_def_*_local' too. ok bluhm
65799e18 2025-05-12 05:07:17 remove unused extern, gcc warned about an unused variable
7fb52fc4 2025-05-09 19:53:41 Move ipsec-enc-alg, ipsec-auth-alg and ipsec-comp-alg sysctl variables out of netlock. Use integers instead of strings for in-kernel representation. So we can update them atomically. Also we remove strcmp() from the hot (?) path. Note, behavior was changed for a little. Against previous it denies to set incorrect values. In such case the EINVAL error will be returned to userland. ok bluhm
64ff7099 2025-05-09 17:40:08 In TCP softlro split compare and concat functions. Split the function that compares two mbuf chains from the actual concatenation. This clarifies where the point of no return is. Now we have tcp_softlro_check() to figure out if a single packet is suitable for LRO. tcp_softlro_compare() takes two packets and returns whether they fit together. tcp_softlro_concat() unconditionally does the concatenation. The main function tcp_softlro_glue() is the wrapper around them and should be called by drivers. OK jan@
b9e56c4e 2025-05-09 14:43:47 mp-safe multicast stats with per cpu counters ok mvs, bluhm
b8e20da2 2025-05-08 20:22:56 Only accept simple TCP timestamp options for softlro. Instead of parsing all TCP options, consider only two cases. No options or only timestamps at aligned position. The latter is the common case which must be fast. Ignoring other packets with complicated options will not decrease speed. In tcp_softlro_check() only packets without options or simple timestamps are considered for LRO. Enforce that timestamps are increasing to detect sequence number wraparounds. When concatenating TCP packets, take the timestamps from the tail which are more recent. OK jan@
d958b4f6 2025-05-07 14:10:19 Cache socket lock during TCP input. Parallel TCP input is running for a few days now and looks quite stable. Final step is to implement caching of the socket lock. Without large receive offloading (LRO) in the driver layer, it is very likely that consecutive TCP segments are in the input queue. This leads to contention of the socket lock between TCP input and socket receive syscall from userland. With this commit, ip_deliver() moves all TCP packets that are in the softnet queue temporarily to a TCP queue. This queue is per softnet thread so no locking is needed. Finally in the same shared netlock context, tcp_input_mlist() processes all TCP packets. It keeps a pointer to the socket lock. tcp_input_solocked() switches the lock only when the TCP stream changes. A bunch of packets are processed and placed into the socket receive buffer under the same lock. Then soreceive() can copy huge chunks to userland. The contention of the socket lock is gone. On a 4 core machine I see between 12% to 22% improvement with 10 parallel TCP streams. When testing only with a single TCP stream, throughput increases between 38% to 100%. tested by Mark Patruck a while ago; OK mvs@
9c5af36d 2025-05-04 23:05:17 Fix race in TCP SYN cache get. Setting the local and foreign address of a newly created socket did not happen atomically. During socket setup there was a small window for an incpb that had a bound laddr, but faddr was emtpy. Although both listen and new socket are locked during syn_cache_get(), in_pcblookup_listen() could find the incpb of the new socket. When a SYN packet of another connection arrived in parallel, it was processed with the socket under construction instead of the listen socket. Setting both faddr and laddr together in in_pcbset_addr() fixes the race. The relevant code has been copied from in_pcbconnect(). The table mutex inpt_mtx guarantees that in_pcblookup_listen() finds the listen socket. bug found and fix tested by Mark Patruck; OK mvs@
4a6e8021 2025-04-29 20:31:42 Remove dead flag from TCP SYN cache. The TCP SYN cache timer uses SCF_DEAD flag to detect closed listen socket. Note that syn_cache_rm() is setting sc_inplisten to NULL in the same atomic section where SCF_DEAD is set. Also syn_cache_timer() uses the SYN cache mutex to check sc_inplisten and SCF_DEAD together. Eliminate SCF_DEAD and rely on existing pointer to listen socket. OK mvs@
dd3d2b7b 2025-04-29 14:53:22 In TCP software LRO remove potential ethernet padding. With some network interfaces the RX mbuf returned from the driver may contain ethernet padding. My tests did not see this behavior with ixl(4) hardware. But inserting ethernet padding into a TCP stream would be completely wrong, so adding a sanity check anyway. Remove content of the mbuf that is behind the TCP payload. OK jan@
642ec8ab 2025-04-28 21:12:35 Comment socket lock in TCP SYN cache fields. TCP SYN cache has been switched from net lock to socket lock a while ago. Adjust forgotten comments for locking in struct syn_cache. OK mvs@
58cc3164 2025-04-26 13:58:08 Run TCP input in parallel on multiple CPUs. Mark the protocol input function tcp_input() as MP-safe. Then it is called directly from the IP deliver loop with shared net lock. Do not enqueue TCP packets to wait for exclusive net lock. This results in more contention on the socket lock. Throughput optimization for that problem could be done later. tested by Mark Patruck; OK mvs@
9094b91d 2025-04-23 17:52:12 Refactor TCP softlro extract header. The header of the tail mbuf has already been extraced. Do all header extraction in tcp_softlro_glue() and pass them down to tcp_softlro(). Check port number before address as IPv6 address comparison is expensive. The empty segment check has already been done in tcp_softlro_check(), remove it from tcp_softlro(). OK jan@
1b07127f 2025-04-22 22:34:27 Fix whitespace introduced in previous commit.
e684669a 2025-04-22 19:59:55 Refactor TCP softlro checks. In TCP software LRO do all checks that affect only a single packet before looping over the enqueued packets. Fragment and TCP protocol checks can be removed, ether_extract_headers() already does that. After the check is successful, store the payload length in ph_mss. This is done later anyway. If this mbuf field contains a positive length, we know that the check has already been done. OK jan@
f427bb86 2025-04-21 09:54:53 Consistent naming for TCP softtso and softlro. If the hardware does not support TSO or LRO, an alternative implementation in software is used. Call these functions tcp_softlro_glue() and tcp_softtso_chop() and use the same order of paramters. OK jan@
d908cbc5 2025-04-16 17:17:06 Add a software implementation of TCP Large Receive Offload. This diff adds a SoftLRO implementation in tcp_input.c as its counter part tcp_chopper() (which is a SoftTSO implementation) in tcp_output.c. We just use SoftLRO in ixl(4) for now, but default off. Because of some unclear bugs in ixl(4) and our network stack. The mbuf chain length produced by SoftLRO is limited to a maximum of 8. This avoids m_defrag() calls in which leads to races in ixl(4) tx/rx interrupts and mbuf handling. We also use the packet type field in the receive descriptors to differentiate between TCP and other Packets. So, we have two mbuf lists and non-TCP traffic it not influenced by SoftLRO. This it not necessary for all drivers, but helps to avoid a decrease of UDP bulk traffic. Thanks to all the testers: Mark Patruck, Yuichiro NAITO, Hrvoje Popovski jj@, lucas@ and bluhm@. With tweaks from dlg@ and bluhm@. ok bluhm@
96fb1b48 2025-04-16 12:51:11 Take socket lock in TCP input. In preparation to run tcp_input() in parallel, the socket has to be locked while processing incoming TCP packets. After inpcb lookup, in addition take the socket lock and increase the socket reference counter. The function in_pcbsolock_ref() upgrades a from a locked inpcb to a locked socket by using mutex and refcount. syn_cache_get() unlocks the listen socket and returns a locked socket, syn_cache_add() relies on socket lock. With this commit, exclusive net lock is still held. TCP thoughput gets slower by 6% due to the additional mutex and refcount. But I want to see if locking and refcount works, before switching tcp_input() to shared net lock. Running TCP in parallel will more than compensate the cost of locking. tested as part of parallel TCP input by Mark Patruck, Hrvoje Popovski OK mvs@
c677dfb5 2025-03-12 23:27:17 Set M_BCAST for packets going to 0.0.0.0 or 255.255.255.255, so the upper layers can handle them properly. Found by IIJ. ok bluhm
72aeecac 2025-03-12 01:44:27 Fix the problem that skips the various checks for packets for broadcast mistakenly introduced by the revision 1.103 imported from netbsd 24 years ago. Especially, the problem has allowed one to send broadcast packets without the SO_BROADCAST option. Found by IIJ. ok blumn
c328ddbe 2025-03-11 15:31:03 Get rid of unused `so' argument in sbappendaddr(). No functional changes. ok bluhm
5eb2beee 2025-03-11 15:29:36 The *.{send,recv}space sysctl values should not exceed the SB_MAX limit. Adjust it for udp(4) and divert(4) sockets. Otherwise it will be impossible to create such sockets regardless on actual resources usage. The UNIX sockets already have the SB_MAX for upper limit. ok claudio bluhm
39a611d6 2025-03-10 15:11:46 Get rid of unused `so' argument in sbappendstream(). No functional changes. ok bluhm
b7808d31 2025-03-04 15:11:30 Pass struct netstack to sec_input(). Kernel crashed in route6_cache() due to bogous netstack. ipsec_common_input_cb() was called with netstack pointer NULL, but in ipv6_input() the pointer was 1. In between lies sec_input() that was called without netstack pointer, but passed an arbitrary value to if_vinput(). There was a parameter missing in its prototype. The buggy code did compile due to a missing include file. crash reported by Mikolaj Kucharski; OK claudio@
1c7441f2 2025-03-02 21:28:31 Cache route per softnet thread with netstack. Introduce struct netstack to pass memory down the network stack. Currently this is only implemented as part of struct softnet serving as thread local storage. Especially the interface input and IP protocol input functions if_input() and pr_input() have been extended by a netstack parameter. if_input_process() selects the netstack pointer of the currently running softnet thread and passes it to the input functions. The first user of this storage is the route cache in ipv4_input() and ipv6_input(). For consecutive packets it can reuse the route to the same destination. Cache invalidation via route generation number has already been implemented before. OK claudio@ dlg@
eafd8c8c 2025-03-01 21:03:19 Fix TCP checksum for IPv6 packets with extension headers. When tcp_input() called in6_cksum() to verify the TCP checksum, the IPv6 header length instead of the TCP header offset was used. Hence TCP packets with IPv6 extension headers were never accepted. from Giovanni Pimpinella; OK sashan@
0f757f3d 2025-02-24 20:16:14 IPsec path MTU uses routing table before pf switches it. If pf(4) switches the rtable, the route for path MTU discovery must be generated in the original routing table. For that ip_output() keeps the original rtableid. Then a local TCP socket uses the correct route. This did not work when IPsec was involed. Pass orig_rtableid also to ip_output_ipsec_send() to use the same logic in ip_output_ipsec_pmtu_update(). A similar change is necessary for ip6_output() and ip6_forward(). OK markus@
49f05fab 2025-02-24 09:40:01 Refactor LRO turn off code Its easier to turn off LRO via ioctl calls inside of several hardware and pseudo interfaces. Thus, we avoid manipulating internal data structures form the outside and avoid unnecessary reinitializations. Tested by bluhm@ OK bluhm@
4039bfa0 2025-02-17 20:31:25 Handle RTF_GATEWAY route with rt_gwroute NULL. rtrequest_delete() calls rt_putgwroute() to set rt_gwroute to NULL. When another thread holds a reference to such a route, an assertion failed in rtisvalid() and rt_getll(). Handle this case, rt_getll() may return NULL then. OK claudio@
7a23ec91 2025-02-17 12:46:02 Toeplitz hash for UDP and IPv6 TCP output. IPv4 TCP output uses the toeplitz hash as flow id. It is calculated in in_pcbconnect() and can be used for all connected sockets. Add it for UDP and TCP IPv6, too. As pf calculates its own hash, this affects only setups with pf disabled. It gives an improvement in traffic distribution over the queues and 20% performance increase with UDP send on v4/v6 and TCP send on v6 without pf. tested and OK sf@
2f5a0aea 2025-02-17 08:56:33 Get rid of unused `so' argument in sbreserve(). ok bluhm
3411e7e2 2025-02-16 11:39:28 Revert SMR protection of rt_gwroute. Using a smr_barrier() in rt_putgwroute() slows down adding routes. This is the hot path for BGP router. Syncing the FIB is now taking ages and the system is close to unrespnsive in that time. found by claudio@
51442b8a 2025-02-14 13:14:13 add tunneldf support to sec(4) sec(4) is a very thin wrapper around the existing ipsec output processing for encapsulating packets, and inherited the behaviour that the DF flag was propagated from the encapsulated packet to the outer ip header. this means if the sec(4) interface has a large mtu and is carrying packets with DF set over a network that can't transport large(r) packets, these packets are effectively dropped. ipsec applied via the SPD copes with this by having SAs figure out the path mtu and using that when applying policy, but sec(4) is an interface, so the network stack uses the interface mtu rather than the associated SA path mtu. rfc4459 discusses this kind of problem has offers a variety of solutions. this implements one of the simpler options, which is to allow the tunnel endpoints to manage the DF regardless of the payload and reassemble the encapsulated packets. to actually do this, ipsec output packet processing has to be able to take an argument that says how you want DF to be handled. in the future we're going to look at how we can use the path mtu determined by the ipsec SA to try and implement one of the other solutions from the RFC, which is to signal the lower mtu to the sources of tunnelled packets. tested by and ok claudio@
dddbedba 2025-02-13 21:01:34 Fix route entry race when accessing rt_gwroute. Kassert in rt_getll() was triggered as rt_gwroute could be NULL. Problem was introduced by shared netlock around tcp_timer_rexmt(). PMTU discovery calls rtrequest_delete() which was missing proper locking around rt_gwroute. As rt_getll() is called by ARP and ND6 resolve in the hot path, use SMR to provide the pointer to rt_gwroute lockless. Reference count of the returned route is incremented, caller has to free it. Modifying rt_gwroute or rt_cachecnt in rt_putgwroute() is protected by per route lock. OK mvs@
bd92615b 2025-02-12 21:28:10 Use socket lock for inpcb notify. The notify and ctlinput functions were not MP safe. They need socket lock which can be aquired by in_pcbsolock_ref(). Of course in_pcbnotifyall() has to be called without holding a socket lock. Rename in_rtchange() to in_pcbrtchange(). This is the correct namespace and the functions take care of the inpcb route. OK mvs@
ba453f40 2025-02-10 15:06:57 Fix TCP maximum segment size with IPsec. When IPsec is used, if_get(m->m_pkthdr.ph_ifidx) returns enc0. Its if_mtu is 0 which results in negative mss. After fixing a signed integer comparison bug with imax(), tcp_mss_adv() used mssdflt, which is 512. So the TCP SYN cache sent packets with a small maximum TCP segment number. The underlying problem is, that SYN cache used the incoming interface m->m_pkthdr.ph_ifidx for the outgoing MTU. The correct way is to use the route of the destination address like tcp_mss() does it. The SYN cache has a struct route which can be used. An additional route lookup does not happen as the route is cached and will be reused by ip_output(). OK mvs@
1a5460a4 2025-02-06 23:53:55 Never install path MTU routes for IPsec transport mode SAs. Prevent installation of PMTU-routes for transport mode ESP-SAs in both cases when ip_output_ipsec_pmtu_update() gets called. from markus@
ea68d214 2025-02-06 13:40:57 Call pru_attach() with shared solock() within socreate(). The internal internet attach functions look MP safe. Do external unlocking and release within error path because sofree() relies on exclusive solock(). ok bluhm
e489f29b 2025-02-06 13:39:31 Get rid of unused `so' argument in sbflush(). No functional changes. ok bluhm
8c2fa8a9 2025-02-05 10:15:10 Fix race in inpcb mutex to socket lock conversion. Testing parallel TCP input revealed a race in in_pcbsolock_ref(). The mutex inp_sofree_mtx is used to reliably get the socket from the incpb and refcount it. Then the socket lock is used to prevent further calls to in_pcbdetach() or sofree(). But between releasing the inpcb mutex and acquiring the socket lock, the inpcb could detach. So when holding the socket lock reassure that the inpcb is still associated by the socket. Otherwise locking the socket belonging to the inpcb has failed. OK mvs@
5d1d4693 2025-01-31 11:48:18 Get rid of unused `so' argument in sbdrop(). No functional changes. ok bluhm
1d90c3fb 2025-01-30 14:40:50 Get rid of unused `so' argument in sbspace(). No functional changes. ok bluhm
fd7b854c 2025-01-30 08:52:33 Get rid of unused `so' argument in sb_notify(). No functional changes. ok bluhm
5d680383 2025-01-26 17:21:26 Syn cache calls TCP drop instead of socket abort. Instead of calling socket layer soabort() and then down via tcp_abort() which ends in tcp_drop(), call tcp_drop() directly from the TCP syn cache. The errno is not relevant as the new socket is dropped before it can be reached from userland. OK mvs@
6c6f11bf 2025-01-25 23:55:32 Rename old socket to more specific listen socket in TCP syn cache. OK mvs@
e2bf3321 2025-01-25 22:06:41 Keep socket lock in sonewconn() for new connection. For TCP input unlocking we need a consistent lock of the newly created socket. Instead of releasing the lock in sonewconn() and grabbing it again later, it is better that sonewconn() returns a locked socket. For now only change syn_cache_get() which calls in_pcbsounlock_rele() at the end. Following diffs will push the unlock into tcp_input(). OK mvs@
f168c03c 2025-01-25 02:06:40 Check the source address for the tunneled packets. ok mvs
92546b59 2025-01-23 12:51:51 Fix inpcb leak in divert attach. All other internet socket attach functions first call soreserve() and then in_pcballoc(). This avoids an in_pcbdetach() in the error path. Current divert attach code may leak the inpcb. Reorder calls to allow simple error handling. OK mvs@
cbb583bb 2025-01-22 09:37:06 Convert bcopy() to memcpy() in tcp_respond(). Struct ip, ip6, and th point to locations on m, which is new memory from m_gethdr(). There is no overlapping memory, so use memcpy. from dhill@; OK mvs@
e835bce2 2025-01-16 11:59:20 Remove net lock from TCP sysctl for keep alive. Keep copies in seconds for the sysctl and update timer variables atomically when they change. tcp_maxidle was historically calculated in tcp_slowtimo() as the timers were called from there. Better calculate maxidle when needed. tcp_timer_init() is useless, just initialize data. While there make the names consistent. input sthen@; OK mvs@
680a5d21 2025-01-14 13:49:44 Remove exclusive net lock from TCP timers. TCP timers can run with shared netlock and socket lock. Use common tcp_timer_enter() and tcp_timer_leave() that lock the socket and do reference counting. Then incpb and socket always exist. input and OK mvs@
5d0c3a6e 2025-01-10 20:19:03 Fix indent.
4e5e13a2 2025-01-09 16:47:24 Run TCP sysctl ident and drop with shared net lock. Convert exclusive net lock for TCPCTL_IDENT and TCPCTL_DROP to shared net lock and push it down into tcp_ident(). Grab the socket lock there with in_pcbsolock_ref(). Move socket release from in_pcbsolock() to in_pcbsounlock_rele() and add _ref and _rele suffix to the inpcb socket lock functions. They both lock and refcount now. in_pcbsounlock_rele() ignores NULL sockets to make the unlock path in error case simpler. Socket lock also protects tcp_drop() and tcp_close() now, so the socket pointer from incpb may be NULL during unlock. In tcp_ident() improve consistency check of address family. OK mvs@