IABSD.fr/src/sys/net

Branch :


Log

Author Commit Date CI Message
49f05fab 2025-02-24 09:40:01 Refactor LRO turn off code Its easier to turn off LRO via ioctl calls inside of several hardware and pseudo interfaces. Thus, we avoid manipulating internal data structures form the outside and avoid unnecessary reinitializations. Tested by bluhm@ OK bluhm@
45a54130 2025-02-21 22:21:20 Move kassert from resolve to add case in rtrequest(). In case RTM_RESOLVE there is already an assertion about ifa_ifp != NULL. Move it down after the fallthrough to cover also RTM_ADD. This should give a better hint from syzkaller what is going wrong. Reported-by: syzbot+f77fe03091e5efd9aaf9@syzkaller.appspotmail.com OK claudio@
9a71ea36 2025-02-21 06:20:12 replace "if (!task_del) taskq_barrier" with "taskq_del_barrier". as per src/sys/kern/kern_task.c r1.36, it's possible for a task to be re-added while it's currently running. in this situation the "if (!task_del)" skips the barrier but doesn't do anything about the currently running code, which taskq_del_barrier properly handles. ----------------------------------------------------------------------
4039bfa0 2025-02-17 20:31:25 Handle RTF_GATEWAY route with rt_gwroute NULL. rtrequest_delete() calls rt_putgwroute() to set rt_gwroute to NULL. When another thread holds a reference to such a route, an assertion failed in rtisvalid() and rt_getll(). Handle this case, rt_getll() may return NULL then. OK claudio@
3411e7e2 2025-02-16 11:39:28 Revert SMR protection of rt_gwroute. Using a smr_barrier() in rt_putgwroute() slows down adding routes. This is the hot path for BGP router. Syncing the FIB is now taking ages and the system is close to unrespnsive in that time. found by claudio@
51442b8a 2025-02-14 13:14:13 add tunneldf support to sec(4) sec(4) is a very thin wrapper around the existing ipsec output processing for encapsulating packets, and inherited the behaviour that the DF flag was propagated from the encapsulated packet to the outer ip header. this means if the sec(4) interface has a large mtu and is carrying packets with DF set over a network that can't transport large(r) packets, these packets are effectively dropped. ipsec applied via the SPD copes with this by having SAs figure out the path mtu and using that when applying policy, but sec(4) is an interface, so the network stack uses the interface mtu rather than the associated SA path mtu. rfc4459 discusses this kind of problem has offers a variety of solutions. this implements one of the simpler options, which is to allow the tunnel endpoints to manage the DF regardless of the payload and reassemble the encapsulated packets. to actually do this, ipsec output packet processing has to be able to take an argument that says how you want DF to be handled. in the future we're going to look at how we can use the path mtu determined by the ipsec SA to try and implement one of the other solutions from the RFC, which is to signal the lower mtu to the sources of tunnelled packets. tested by and ok claudio@
dddbedba 2025-02-13 21:01:34 Fix route entry race when accessing rt_gwroute. Kassert in rt_getll() was triggered as rt_gwroute could be NULL. Problem was introduced by shared netlock around tcp_timer_rexmt(). PMTU discovery calls rtrequest_delete() which was missing proper locking around rt_gwroute. As rt_getll() is called by ARP and ND6 resolve in the hot path, use SMR to provide the pointer to rt_gwroute lockless. Reference count of the returned route is incremented, caller has to free it. Modifying rt_gwroute or rt_cachecnt in rt_putgwroute() is protected by per route lock. OK mvs@
910ed27a 2025-02-05 18:29:17 Limit net.bpf.maxbufsize sysctl(8) to a value that malloc(9) can handle. Introduce MALLOC_MAX definition to keep this value in sync and use it system wide. Reported-by: syzbot+3b7e5274349f7165bf5f@syzkaller.appspotmail.com ok claudio bluhm
e7387209 2025-02-03 09:44:30 The previous missed release of the reference counter of the pipex session. Also, check the source address on PPPoE as well. ok mvs CVSe ----------------------------------------------------------------------
bec1a366 2025-02-03 08:58:52 Limit RX queue of loopback interfaces with 8192 packets. Unlimited queues allow to reach mbufs limit and make network unusable on some architectures. Based on diff proposed by dlg@, but limits only loopback interfaces. Tested by bluhm, additional arm64 tests by kirill. ok bluhm
9915416f 2025-02-01 21:10:02 Fix pf fragment hole count. Fragment reassembly finishes when no holes are left in the fragment queue. In certain overlap conditions, the hole counter was wrong and pf(4) created an incomplete IP packet. Before adjusting the length, remove the overlapping fragment from the queue and insert it again afterwards. pf_frent_remove() and pf_frent_insert() adjust the hole counter automatically. bug reported and fix tested by Lucas Aubard with Johan Mazel, Gilles Guette and Pierre Chifflier; OK claudio@
1d90c3fb 2025-01-30 14:40:50 Get rid of unused `so' argument in sbspace(). No functional changes. ok bluhm
f08653c5 2025-01-25 14:51:34 wg(4) logging enhancement. * Updated wg(4) debug logging to use log(9) instead of printf(9) * Logging now includes IP addresses of remote endpoints From Lloyd <ng2d68 at proton dot me> ok sthen kirill
d1df5f10 2025-01-25 10:53:36 Fix if_getgrouplist() mistype made in previous commit. Found and reported by anton@
f168c03c 2025-01-25 02:06:40 Check the source address for the tunneled packets. ok mvs
eb64c487 2025-01-24 09:19:07 Move interface groups copyout(9)s out of netlock within ifioctl_get(). The interface groups use complicated linking scheme with special data structures allowing to link multiple interfaces with multiple groups. We can't use iterators here, because some path are netlock covered and we can't sleep in refcnt_finalize(9). We also can't use double locking to protect this linking data because this new lock will cover very wide area in kernel. Link desired interface groups or interfaces from group into temporary lists, protected by new dedicated `if_tmplist_lock' rwlock(9). Bump the reference counter to make concurrent destruction thread wait until temporary linked data became unused. Delivered data are immutable, so netlock required only while filling temporary lists. ok bluhm
53d74c0c 2025-01-24 09:16:55 Move copyout(9) out of netlock within sysctl_source(). Netlock required only to store data to local variable, the rest could be done lockless. Use union of sockaddr_in and sockaddr_in6 as temporary buffer. ok bluhm
f6224b7e 2025-01-21 17:40:57 Copy if_data stuff to ifnet descriptor. 'if_data' structure contains interface data like type or baudrate mixed with statistics counters. Each interface descriptor 'ifnet' contains the instance of 'if_data' for deliver to the userland by request. It is not clean which lock protects this data. Some old drivers rely on kernel lock, some rely on net lock, ifioctl() relies on both. Moreover, network interface could have per-cpu counters, but in such case both `if_data' counters and per-cpu counters will be used by some pseudo drivers like bridge(4). Copy 'if_data' stuff into 'ifnet' descriptor to separate interface counters from the rest data. This separation allows to start using of consistent locking for such data. Note, non per-cpu counters represented as array and accessed in the per-cpu counters style to unify future usage paths. ok bluhm
129cf7bd 2025-01-19 03:27:27 make BIOCSWTIMEOUT work with kq events. makes sense jmatthew@
8b2d8634 2025-01-16 17:20:23 Move some copyout()s within ifioctl_get() out of shared netlock. UVM releases exclusive netlock while going to swap, but it can't determine is shared netlock held or not. ifioctl_get() does read-only access, so it could follow sogetopt() way. The copyout()s under shared netlock kept for ifgroup stuff, I will fix this separately. ok bluhm
a4cc1f24 2025-01-15 06:15:44 let pppoe data packets go through if_vinput instead of the pppoeinq. provide pppoe_vinput() for ether input to call. if the packet is for data in an established pppoe session, it can push it straight into the stack with if_vinput. otherwise pppoe_vinput returns the packet to ether_input, which can queue it for processing with the existing code. this should improve throughput and reduce jitter for pppoe input, and there's some evidence that it reduces packet loss. tested by maurice janssen and myself.
0e5b9f78 2025-01-09 18:20:29 Replace bcopy() with memcpy() in route_peeraddr(). from dhill@; OK mvs@
91b74865 2025-01-07 05:36:52 Delete ether_frm_control() which just returned EOPNOTSUPP: pru_control() does that automatically when pr_usrreqs.pru_control is NULL and there are no current plans to add ioctls() on this. ok dlg@
476a4da7 2025-01-05 12:36:48 Retire PR_MPSOCKET flag. TCP socket layer is MP safe for more than a week now. That means all protocols with pr_usrreqs have the PR_MPSOCKET flag. Remove PR_MPSOCKET and use the logic that was used when set. OK mvs@
84d9c64a 2025-01-03 21:27:40 Use atomic operations to modify the MTU of route. When unlocking TCP, path MTU discovery will run in parallel. To keep route MTU consistent, make access to rt_mtu atomic. Use compare-and-swap function to detect whether another thread is modifying the MTU field. In this case skip updating rt_mtu. OK mvs@
b9ae17a0 2024-12-30 02:46:00 All the device and file type ioctl routines just ignore FIONBIO, so stop calling down into those layer from fcntl(F_SETFL) or ioctl(FIONBIO) and delete the "do nothing for this" stubs in all the *ioctl routines. ok dlg@
7c6c9ed7 2024-12-27 10:15:09 Unlock ah_sysctl() and ipcomp_sysctl(). Both are atomically accessed `ah_enable' and `ipcomp_enable' booleans and per-CPU counters based statistics. esp_sysctl() is much more system wide, so unlock it separately. ok bluhm
535d4cde 2024-12-26 10:15:27 Make access to tcp_mssdflt atomic. To further unlock TCP sysctl, we need atomic access to tcp_mssdflt. pf(4) is reading the value multiple times. Better read it once and pass mssdflt down the call stack. In pf_calc_mss() was a potential integer underflow. Use the signed variant imax(9) and imin(9) like it has been fixed it in TCP stack. OK mvs@
e541a7ae 2024-12-18 02:25:30 go back to r1.326, before i fiddled with packet generation and bpf. i've had a couple of reports of redundant firewalls misbehaving since these changes, so until i can figure out what's wrong i'm backing them out. reported by hrvoje popovski and mark patruck
2fbde403 2024-12-18 01:56:05 let LLDP packets fall through to being handled on the port interfaces. 802.1ax says that LLDP packets sent to the multicast groups listed in 802.1ab (the lldp spec) should be treated as "control frames" so they can be processed by an lldp agent on physical interface. in our situation that means we shouldn't aggregate LLDP packets so they appear to enter the system on aggr(4) interfaces, we should let the physical port interfaces handle them. this will allow AF_FRAME sockets listening on aggr port interfaces receive lldp packets. jmatthew@ says it looks good.
6fb93e47 2024-12-15 11:00:05 add an AF_FRAME socket domain and an IFT_ETHER protocol family under it. this allows userland to use sockets to send and receive Ethernet frames. as per the upcoming frame.4 man page: frame protocol family sockets are designed as an alternative to bpf(4) for handling low data and packet rate communication protocols. Rather than filtering every frame entering the system before the network stack like bpf(4), the frame protocol family processing avoids this overhead by running after the built in protocol handlers in the kernel. For this reason, it is not possible to handle IPv4 or IPv6 packets with frame protocol sockets because the kernel network stack consumes them before the receive handling for frame sockets is run. if you've used udp sockets then these should feel much the same. my main motivation is to implement an lldp agent in userland, but without having to have bpf look at every packet when lldp happens every minute or two. the only feedback i had was positive, so i'm putting it in ok claudio@
09380e00 2024-12-11 04:22:41 get rid of code for an extra DLT_LOOP bpf attachment. pfsync doesnt know the source address in IP packets before it calls ip_output, so the extra bpf attachment has a distorted view of what IP packets are being sent anyway. you can tcpdump on the pfsync syncdev if you want to see what will be on the wire.
780061c3 2024-12-11 04:18:52 fix pfsync_encap to cope with pfsync_sendout changes. problem noticed by hrvoje popovski
b9b60940 2024-12-04 18:20:46 Unlock gre_sysctl(). Both `gre_allow' and `gre_wccp' are atomically accessed integers. They could have only '0' and '1' values, so no extra dances around atomic_load_int(9) required. ok bluhm
c6b373c6 2024-11-26 10:42:58 let bpf pick the first attached dlt when attaching to an interface. this is instead of picking the lowest numbered dlt, which was done to make bpf more predictable with interfaces that attached multiple DLTs. i think the real problem was that bpf would keep the list in the reverse order of attachment and would prefer the last dlt. interfaces that attach multiple DLTs attach ethernet first, which is what you want the majority of the time anyway. but letting bpf pick the first one means drivers can control which dlt they want to default to, regardless of the numeric id behind a dlt. ok claudio@
a4f86d2e 2024-11-20 02:18:45 provide ifq_deq_set_oactive. ifq_deq_set_oactive is a variation on ifq_set_oactive that can be called inside an if_deq_begin "transaction". afresh@ found de(4) was calling ifq_set_oactive while holding the ifq mutex via ifq_deq_begin, which led to a panic because ifq_set_oactive also tries to take the ifq mutex. ifq_deq_set_oactive assumes the caller is already holding the mutex. de(4) is confusing, so it seemed simpler to add a small tweak to ifqs than try and do major surgery on such a hairy driver. tested by afresh@
1511e544 2024-11-19 23:26:35 use a tailq for the global list of bpf_if structs. this replaces a hand rolled list that's been here since 1.1. ok claudio@ kn@ tb@
ddee6534 2024-11-19 02:11:03 fix tcpdump on pfsync interfaces. after the last rewrite i was showing bpf ip packets, not the pfsync payload like the PFSYNC DLT expected. this also lets bpf see packets being processed by pfsync input handling, so if you want to see only what's being sent you'll need to filter by direction. reported by Marc Boisis
2e7772af 2024-11-17 23:31:01 bump the "mru" up to MAXMCLBYTES. there's no reason to limit tun/tap to small packets. ok claudio@
05ecc67f 2024-11-17 23:21:45 include tun_hdr in the length reported by FIONREAD and kq if it's enabled.
a921796a 2024-11-17 12:21:48 make sure bpfsdetach is holding a bpf_d ref when invalidating stuff. when bpfsdetach is called by an interface being destroyed, it iterates over the bpf descriptors using the interface and calls vdevgone and klist_invalidate against them. however, i'm not sure the reference the interface holds against the bpf_d is accounted for properly, so vdevgone might drop it to 0 and free it, which makes the klist_invalidate a use after free. avoid this by taking a bpf_d ref before calling vdevgone and klist_invalidate so the memory can't be freed out from under the feet of bpfsdetach. Reported-by: syzbot+b3927f8ad162452a2f39@syzkaller.appspotmail.com i wasn't able to reproduce whatever syzkaller did. it's possible this is a double free, but we'll wait and see if it pops up again. ok mpi@
1c386161 2024-11-17 00:25:07 provide network offloads between the kernel and userland again userland can request that network packets that are read from or written to the device special file get prepended with a "tun_hdr" struct. this struct contains bits which say what offloads are requested for the packet, including things like ip/tcp/udp/icmp checksums, tcp segmentation offloads, or ethernet vlan tags. userland can write a packet with any of these offloads requested into the kernel at any time, but has to request which ones it's able to handle coming from the kernel. enabling the tun_hdr struct and which offloads userland can handle is done with a new TUNSCAP ioctl. this is based on the virtio_net_hdr in linux, which jan@ actually implemented and had working with vmd. however, claudio@ and i strongly opposed to what feels like a layer violation by pulling virtio structures into the tun driver, and then trying to emulate virtio/linux semantics in our network stack, and playing catch up when the "upstream" projects decide to change the shape or meaning of these bits. tun_hdr is specific to the openbsd network stack and it's semantics, which simplifies our kernel implementation. jan has been pretty gracious about the extra work on the vmd side of things. tested by and ok jan@ ok claudio@ sthen@ backed this out cos of confusion with the ioctl numbers i picked to controlling this feature. i've picked new numbers that don't conflict this time.
d43f9610 2024-11-14 13:47:38 revert tun(4) changes for now, breaks in kdump build (TUNSCAP/TIOCEXT clash) tb@ agrees
67a47b08 2024-11-14 01:51:57 provide a way to negotiate network offloads between the kernel and userland. userland can request that network packets that are read from or written to the device special file get prepended with a "tun_hdr" struct. this struct contains bits which say what offloads are requested for the packet, including things like ip/tcp/udp/icmp checksums, tcp segmentation offloads, or ethernet vlan tags. userland can write a packet with any of these offloads requested into the kernel at any time, but has to request which ones it's able to handle coming from the kernel. enabling the tun_hdr struct and which offloads userland can handle is done with a new TUNSCAP ioctl. this is based on the virtio_net_hdr in linux, which jan@ actually implemented and had working with vmd. however, claudio@ and i strongly opposed to what feels like a layer violation by pulling virtio structures into the tun driver, and then trying to emulate virtio/linux semantics in our network stack, and playing catch up when the "upstream" projects decide to change the shape or meaning of these bits. tun_hdr is specific to the openbsd network stack and it's semantics, which simplifies our kernel implementation. jan has been pretty gracious about the extra work on the vmd side of things. tested by and ok jan@ ok claudio@
42a2f8b7 2024-11-12 04:14:51 bump the type used to specify traffic queue bandwidth to 64bit. this should let people specify interface and queue bandwidths greater than ~4Gbit. this changes the pf ioctls used to specify queues, so if you want to try this you'll need a new kernel, new headers, and a new pfctl (and systat). or upgrade using a snapshot. the effort and benefit of providing compat isn't worth it. putting it in now so people can kick it around.
9618b9b7 2024-11-09 04:09:56 remove unused ifq_is_serialized() missed when the prototype was removed in ifq.h rev 1.25 ok dlg@
0d48c46c 2024-11-08 13:22:09 pf(4) when doing af-to translation for ICMP protocol sends packets with TTL field to zero. To fix it function pf_test_state_icmp() must initialize ttl field in pf_pdesc structure for inner packet. feedback from bluhm@ OK bluhm@
67d75188 2024-11-04 00:13:15 remove unused inline function; ok dlg@
f5210b0e 2024-11-01 02:07:14 remove unused local variable
c0619985 2024-10-31 12:33:11 Rewrite mbuf handling in wg(4). . Use m_align() to ensure that mbufs are packed towards the end so that additional headers don't require costly m_prepends. . Stop using m_copyback(), the way it was used there was actually wrong, instead just use memcpy since this is just a single mbuf. . Kill all usage of m_calchdrlen(), again this is not needed or can simply be m->m_pkthdr.len = m->m_len since all this code uses a single buffer. . In wg_encap() remove the min() with t->t_mtu when calculating plaintext_len and out_len. The code does not correctly cope with this min() at all with severe consequences. Initial diff by dhill@ who found the m_prepend() issue. Tested by various people. OK dhill@ mvs@ bluhm@ sthen@
da76ba4d 2024-10-31 11:41:31 Drop forgotten backslashes within vxlan_input(). Seems they are stalled from macro copy-paste. No functional changes. ok mpi dlg
5641e477 2024-10-29 23:57:54 move hfsc to using nanoseconds for keeping times. before it was using 256000000 things per second, so this isn't a huge change, but it can use nsecuptime() to get the time. kjc and cheloa like it ok claudio@
5873c738 2024-10-29 23:25:45 use nsecuptime instead of using nanouptime and doing a bunch of maths. ok claudio@
7bbcf947 2024-10-22 22:05:17 correct argument to klist_free(); ok visa@ mvs@
defbe25c 2024-10-17 05:02:12 remove unneeded if_wg.h and pfsync.h includes
21537d41 2024-10-16 11:12:31 cut tun_init() out, it does pointless work. tun_init turns interface/stack config into a set of flags that tun(4) keeps in tun_softc sc_flags, but never uses. ok miod@ kn@
cca0aa06 2024-10-16 11:03:55 remove SIOCSIFDSTADDR from the network ioctls. netintro says it's deprecated, and most of our other drivers are doing fine without it. ok miod@ kn@ patrick@
ff46e7d6 2024-10-15 00:41:40 remove struct arpreq from net/if_arp.h unused since "rewrite to merge arp and routing tables" in CSRG if_ether.c 7.14 (Berkeley) 06/25/91 used by SIOCSARP, SIOCGARP, SIOCDARP, OSIOCGARP ioctls in Net/2 which were removed before 4.4BSD-Lite ok sthen@ who tested this with a ports build
c5093773 2024-10-13 00:53:21 remove unneeded limits.h and errno.h includes
8a978b4c 2024-10-12 23:31:14 remove unneeded rwlock.h include
e6796ada 2024-10-12 23:18:10 remove unneeded time.h include
2cc786cd 2024-10-12 23:10:07 remove unneeded percpu.h include
2d7d7ba6 2024-10-10 06:50:58 neuter the tun/tap ioctls that try and modify interface flags. historically there was just tun(4) that supported both layer 3 p2p and ethernet modes, but had to be reconfigured at runtime by userland to properly change the interface type and interface flags. this is obviously not a great idea, mostly because a lot of stack behaviour around address management makes assumptions based on these parameters, and changing them at runtime confuses things. splitting tun so ethernet was handled by a specific tap(4) driver was a first step at locking this down. this takes a further step by restricting userlands ability to reconfigure the interface flags, specifically IFF_BROADCAST, IFF_MULTICAST, and IFF_POINTOPOINT. this change lets userland pass those values via the ioctls, but only if they match the current set of flags on the interface. these flags are set appropriate for the type of interface when it's created, but should not be changed afterwards. nothing in base uses these ioctls, so the only fall out will be from ports doing weird things. ok claudio@ kn@
b985d824 2024-09-27 00:38:49 Previous pipex.c,v 1.155 was broken if the client was not behind a NAT. ok mvs
479c151d 2024-09-20 02:00:46 remove unneeded semicolons; checked by millert@
efb6c398 2024-09-09 07:37:47 Don't take netlock while setting `if_description'. net/if_pppx.c is the only place where `if_description' accessed outside ifioctl() path and there is no reason to take netlock here. SIOCSIFDESCR case of ifioctl() modifies `if_description' with the only kernel lock. ok bluhm
845086ff 2024-09-07 22:41:55 fix RBT_ENTRY in pf_state and pf_state_key ok sashan@
9593dc34 2024-09-04 07:54:51 Fix some spelling. Input and ok jmc@, jsg@
54fbbda3 2024-09-01 03:08:56 spelling; checked by jmc@, ok miod@ mglocker@ krw@
45a062b0 2024-08-31 04:17:14 add rport(4) for p2p l3 connectivity between route domains. you can basically plug rdomains together and route between them over rport interfaces. people keep asking me if this is so you can leak routes between rdomains, and the answer is yes. this is like pair(4) but cheaper because it avoids all the mucking around with putting an ethernet header on the mbuf just to take it off again later, and is more efficient with address space because it's a p2p ip interface. it has a small tweak from mvs@ ok denis@ claudio@
938c962f 2024-08-27 13:52:41 remove some dead code that wasn't cleaned up ok sashan
02e922b0 2024-08-20 07:47:25 Unlock etherip_sysctl(). - ETHERIPCTL_ALLOW - atomically accessed integer; - ETHERIPCTL_STATS - per-CPU counters ok bluhm
895cab01 2024-08-17 09:52:11 Allow PPP interface to run in an rdomain and get a default route installed in the same routing domain Input and OK claudio@
e88074f0 2024-08-15 12:20:20 add BIOCSETFNR, which is like BIOCSETF but doesnt reset the buffer or stats. from Matthew Luckie <mjl@luckie.org.nz> via tech@ deraadt@ likes it.
34cc435a 2024-08-12 17:02:58 Prepare bpf_sysctl() for upcoming net_sysctl() unlocking. Both NET_BPF_MAXBUFSIZE and NET_BPF_BUFSIZE (`bpf_maxbufsize' and `bpf_bufsize' respectively) are atomically accessed integers. No locks required to modify them. ok bluhm
c80589b8 2024-08-06 16:56:09 Unlock sysctl net.inet.ip.directed-broadcast. ip_directedbcast is read once in either ip_input() or pf_test() during packet processing. So writing the variable does not need net lock. OK mvs@
2293e682 2024-08-05 23:56:10 restrict the maximum wait time you can set via BIOCSWTIMEOUT to 5 minutes. this is avoids passing excessively large values to timeout_add_nsec. Reported-by: syzbot+f650785d4f2b3fe28284@syzkaller.appspotmail.com
270a6ceb 2024-08-05 17:47:29 Fix bridging IPv6 fragments with pf reassembly. Sending IPv6 fragments over a bridge with pf did not work. During input pf reassembles the packet, and at bridge output it should be refragmented. This is only done for PF_FWD direction, but bridge(4) and veb(4) called pf_test() with PF_OUT argument. OK sashan@
5e1af158 2024-07-30 13:41:15 Exports the statistics when PIPEXDSESSION. Found by ymatsui at iij. ok mvs
8fd4ba52 2024-07-26 15:51:09 Mark ipsecflowinfo immutable. ok mvs
207bd73a 2024-07-26 15:45:31 In pipex_l2tp_input(), check if ipsecflowinfo is not changed instead of updating it blindly. ok mvs
26723e1a 2024-07-23 20:04:51 Accept and ignore SADB_X_EXT_REPLAY and SADB_X_EXT_COUNTER payloads for incoming SADB_ADD and SADB_UPDATE message. Since we send them as part of the SADB_GET reply we must also accept them on SADB_ADD/UPDATE as sasyncd will forward payloads previously received in SADB_GET. Fixes a bug where sasync can't restore SAs because pfkey returns EINVAL. From Rafa\xc5\x82 Ramocki ok bluhm@
862c3389 2024-07-18 14:46:28 In pfattach() pass malloc type instead of flags to cpumem_malloc(). from markus@
1f9e444e 2024-07-14 18:53:39 Unlock IPv6 sysctl net.inet6.ip6.forwarding from net lock. Use atomic operations to read ip6_forwarding while processing packets in the network stack. To make clear where actually the router property is needed, use the i_am_router variable based on ip6_forwarding. It already existed in nd6_nbr. Move i_am_router setting up the call stack until all users are independent. The forwarding decisions in pf_test, pf_refragment6, ip6_input do also not interfere. Use a new array ipv6ctl_vars_unlocked to make transition of all the integer sysctls easier. Adapt IPv4 to the new style. OK mvs@
ac42138b 2024-07-12 17:20:18 Switch `so_snd' of udp(4) sockets to the new locking scheme. udp_send() and following udp{,6}_output() do not append packets to `so_snd' socket buffer. This mean the sosend() and sosplice() sending paths are dummy pru_send() and there is no problems to simultaneously run them on the same socket. Push shared solock() deep down to sesend() and take it only around pru_send(), but keep somove() running unedr exclusive solock(). Since sosend() doesn't modify `so_snd' the unlocked `so_snd' space checks within somove() are safe. Corresponding `sb_state' and `sb_flags' modifications are protected by `sb_mtx' mutex(9). Tested and OK bluhm.
2c9f5b1f 2024-07-12 09:25:27 Run sysctl net.inet.ip.forwarding without net lock. The places in packet processing where ip_forwarding is evaluated have been consolidated. The remaining pieces in pf test, ip input, and icmp input do not need consistent information. If the integer value is changed by another CPU, it is harmless. The sysctl syscall sets the value atomically, so add atomic read in network processing and remove the net lock in sysctl IPCTL_FORWARDING. OK claudio@ mvs@
28c60e63 2024-07-04 12:50:08 Implement IPv6 forwarding IPsec only. IPsec gateways set the forwarding sysctl to 2. While this worked for IPv4 since a long time, adapt this feature for IPv6 now. Set sysctl net.inet6.ip6.forwarding=2 to forward only packets that have been processed by IPsec. Set IPV6_FORWARDING_IPSEC in ip6_input() and pass the flag down to the call stack. This provides consistent view on global variable ip6_forwarding. In ip6_output() or ip6_forward() drop packets that do not match the policy. OK denis@
0e25137a 2024-07-02 18:33:47 Read IPsec forwarding information once. Fix MP race between reading ip_forwarding in ip_input() and checking ip_forwarding == 2 in ip_output(). In theory ip_forwarding could be 2 during ip_input() and later 0 in ip_output(). Then a packet would be forwarded that was never allowed. Currently exclusive netlock in sysctl(2) prevents all races. Introduce IP_FORWARDING_IPSEC and pass it with the flags parameter that was introduced for IP_FORWARDING. Instead of calling m_tag_find(), traversing the list, and comparing with NULL, just check the PACKET_TAG_IPSEC_IN_DONE bit. Reading ipsec_in_use in ip_output() is a performance hack that is not necessary. New code only checks tree bits. OK mvs@
da5607f6 2024-06-26 01:40:49 return type on a dedicated line when declaring functions ok mglocker@
e7c2d835 2024-06-22 10:22:29 remove space between function names and argument list
e9494cab 2024-06-21 12:51:29 My earlier commit [1.1169 of pf.c (2023/01/05)] makes pf(4) to report wrong rule and anchor number when packet matches rule found and anchor depth 2 and more. The issue has been noticed and reported by Giannis Kapetanakis (billias _at_ edu.physics.uoc.gr), who also co-developed and tested the final fix presented in this commit. To fix the issue pf(4) must also remember the anchor where matching rule belongs while rules are traversed to find a match for given packet. The information on anchor is now kept in anchor stack frame.w OK sthen@
ab457133 2024-06-20 19:25:42 Read IPv6 forwarding value only once while processing a packet. IPv4 uses IP_FORWARDING to pass down a consistent value of net.inet.ip.forwarding down the stack. This is needed for unlocking sysctl. Do the same for IPv6. Read ip6_forwarding once in ip6_input_if() and pass down IPV6_FORWARDING as flags to ip6_ours(), ip6_hbhchcheck(), ip6_forward(). Replace the srcrt value with IPV6_REDIRECT flag for consistency with IPv4. To have common syntax with IPv4, use ip6_forwarding == 0 checks instead of !ip6_forwarding. This will also make it easier to implement net.inet6.ip6.forwarding=2 for IPsec only forwarding later. In nd6_ns_input() and nd6_na_input() read ip6_forwarding once and store it in i_am_router. The variable name has been chosen to avoid confusion with is_router, which indicates router flag of the packet. Reading of ip6_forwarding is done independently from ip6_input_if(), consistency does not really matter. One is for ND router behavior the other for forwarding. Again use the ip6_forwarding != 0 check, so when ip6_forwarding IPsec only value 2 gets implemented, it will behave like a router. OK deraadt@ sashan@ florian@ claudio@
dfc54264 2024-06-14 08:32:22 Switch AF_ROUTE sockets to the new locking scheme. At sockets layer only mark buffers as SB_MTXLOCK. At PCB layer only protect `so_rcv' with corresponding `sb_mtx' mutex(9). SS_ISCONNECTED and SS_CANTRCVMORE bits are redundant for AF_ROUTE sockets. Since SS_CANTRCVMORE modifications performed with both solock() and `sb_mtx' held, the 'unlocked' SS_CANTRCVMORE check in rtm_senddesync() is safe. ok bluhm
1a9be797 2024-06-09 16:25:27 Introduce IFCAP_VLAN_HWOFFLOAD for vio(4). Add IFCAP_VLAN_HWOFFLOAD to signal hardware like vio(4) can handle checksum or TSO offloading with inline VLAN tags. tested by Mark Patruck, sf@ and bluhm@ ok sf@ and bluhm@
84b2c343 2024-06-07 18:24:16 Read IP forwarding variables only once. Do not assume that ip_forwarding and ip_directedbcast cannot change while processing one packet. Read it once and pass down its value with a flag. This is necessary for unlocking the sysctl path. There are a few places where a consistent value does not really matter, they are unchanged. Use a proper ip_ prefix for the global variable. OK claudio@
baaf996c 2024-06-07 13:43:21 remove ph_ppp_proto define, unused since rev 1.123
528abb89 2024-05-29 00:48:14 remove prototypes with no matching function
6064c65c 2024-05-24 06:38:41 pfsync must let to progress state for destination peer The issue has been noticed by matthieu@ when he was chasing cause of excessive pfsync traffic between firewall boxes. When comparing content of state tables between primary and backup firewall the backup firewall showed many states as follows: ESTABLISHED:SYN_SENT FIN_WAIT_2:SYN_SENT * :SYN_SENT this is caused by pfsync_upd_tcp() which fails to update TCP-state for destination connection peer, so it remains stuck in SYN_SENT. matthieu@ confirms diff helps with 'stuck-state'. It also seems to help with excessive pfsync traffic. ok @dlg
150eb84f 2024-05-17 19:02:04 Switch AF_KEY sockets to the new locking scheme. The simplest case. Nothing to change in sockets layer, only set SB_MTXLOCK on socket buffers. ok bluhm
1adf4b76 2024-05-17 18:58:26 Fix uninitialized memory access in pfkeyv2_sysctl(). pfkeyv2_sysctl() reads the SA type from uninitialized memory if it is not provided by the caller of sysctl(2) because of a missing length check. From Carsten Beckmann. ok bluhm