kern

Branch :

Log

Commit	Date	Message
06881677	2020-10-11 07:11:59	Refactor kqueue_scan() to use a context: a "kqueue_scan_state struct". The struct keeps track of the end point of an event queue scan by persisting the end marker. This will be needed when kqueue_scan() is called repeatedly to complete a scan in a piecewise fashion. Extracted from a previous diff from visa@. ok visa@, anton@
574045aa	2020-10-07 17:53:44	sys_getitimer(), sys_setitimer(): style(9), misc. cleanup - Consolidate variable declarations. - Remove superfluous parentheses from return statements. - Prefer sizeof(variable) to sizeof(type) for copyin(9)/copyout(9). - Remove some intermediate pointers from sys_setitimer(). Using SCARG() directly here makes it more obvious to the reader what you're copying.
d71a0d64	2020-10-07 16:17:25	getitimer(2), setitimer(2): ITIMER_REAL: call getnanouptime(9) once Now that the critical sections are merged we should call getnanouptime(9) once. This makes an ITIMER_REAL timer swap atomic with respect to the clock: the time remaining on the old timer is computed with the same timestamp used to schedule the new timer.
b015073f	2020-10-07 15:45:00	getitimer(2), setitimer(2): merge critical sections Merge the common code from sys_getitimer() and sys_setitimer() into a new kernel subroutine, setitimer(). setitimer() performs all of the error-free work for both system calls within a single critical section. We need a single critical section to make the setitimer(2) timer swap operation atomic relative to realitexpire() and hardclock(9). The downside of the new atomicity is that the behavior of setitimer(2) must change. With a single critical section we can no longer copyout(9) the old timer before installing the new timer. So If SCARG(uap, oitv) points to invalid memory, setitimer(2) now fail with EFAULT but the new timer will be left running. You can see this in action with code like the following: struct itv, olditv; itv.it_value.tv_sec = 1; itv.it_value.tv_usec = 0; itv.it_interval = itv.it_value; /* This should EFAULT. 0x1 is probably an invalid address. / if (setitimer(ITIMER_REAL, &itv, (void )0x1) == -1) warn("setitimer"); /* The timer will be running anyway. */ getitimer(ITIMER_REAL, &olditv); printf("time left: %lld.%06ld\n", olditv.it_value.tv_sec, olditv.it_value.tv_usec); There is no easy way to work around this. Both FreeBSD's and Linux's setitimer(2) implementations have a single critical section and they too fail with EFAULT in this case and leave the new timer running. I imagine their developers decided that fixing this error case was a waste of effort. Without permitting copyout(9) from within a mutex I'm not sure it is even possible to avoid it on OpenBSD without sacrificing atomicity during a setitimer(2) timer swap. Given the rarity of this error case I would rather have an atomic swap. Behavior change discussed with deraadt@.
7406c037	2020-10-07 12:33:03	Document that `a_p' is always curproc by using a KASSERT(). One exception of this rule is VOP_CLOSE() where NULL is used instead of curproc when the garbace collector of unix sockets, that runs in a kernel thread, drops the last reference of a file. This will allows for future simplifications of the VFS interfaces. Previous version ok visa@, anton@. ok kn@
64e0152d	2020-10-05 01:56:17	Fix write hang-up on file system on vnd. ok beck@
c62c648b	2020-10-02 15:45:22	expose timeval/timespec from system calls into ktrace, before determining if they are out of range, making it easier to isolate reason for EINVAL ok cheloha
27257b4a	2020-09-29 11:48:54	Move the solock() call outside of solisten(). The reason is that the so_state and splice checks were done without the proper lock which is incorrect. This is similar to sobind(), soconnect() which also require the callee to hold the socket lock. Found by, with and OK mvs@, OK mpi@
2659a4f5	2020-09-26 15:15:22	Remove the PR_WAITOK flag from the ucred_pool. The pool items are small enough that this pool uses the single page allocator for which PR_WAITOK is a no-op. However it presence suggests that pool_put(9) may sleep. The single page allocator will never actually do that. This makes it obvious that refreshcreds() will not sleep. ok deraadt@, visa@
d52ff6db	2020-09-25 20:24:32	setpriority(2): don't treat booleans as scalars The variable "found" in sys_setpriority() is used as a boolean. We should set it to 1 to indicate that we found the object we were looking for instead of incrementing it. deraadt@ notes that the current code is not buggy, because OpenBSD cannot support anywhere near 2^32 processes, but agrees that incrementing the variable signals the wrong thing to the reader. ok millert@ deraadt@
1cc52d09	2020-09-22 13:43:28	timeout(9): timeout_run(): read to_process before leaving timeout_mutex to_process is assigned during timeout_add(9) within timeout_mutex. In timeout_run() we need to read to_process before leaving timeout_mutex to ensure that the process pointer given to kcov_remote_enter(9) is the same as the one we set from timeout_add(9) when the candidate timeout was originally scheduled to run.
fffcd96b	2020-09-16 13:50:42	Move duplicated code to send an uncatchable SIGABRT into a function. ok claudio@
8aec63dd	2020-09-16 10:06:56	put HW_PHYSMEM64 case under CTL_HW not CTL_KERN Fixes previous. Problem spotted by kettenis@
3de0412f	2020-09-16 08:02:53	As discovered by kettenis, recent mesa wants sysctl hw.physmem64, and in pledged programs that is unfortable. My snark levels are a bit drained, but I must say I'm always dissapointed when programs operating on virtual resources enquire about total physical resource availability, the only reason to ask is so they can act unfair relative to others in the shared environment. SIGH.
97c55bcc	2020-09-16 00:00:40	timecounting: provide a naptime variable for userspace via kvm_read(3) vmstat(8) uses kvm_read(3) to extract the naptime from the kernel. Problem is, I deleted `naptime' from the global namespace when I moved it into the timehands. This patch restores it. It gets updated from tc_windup(). Only userspace should use it, and only when the kernel is dead. We need to tweak a variable in tc_setclock() to avoid shadowing the (once again) global naptime.
d5ec8c1a	2020-09-14 19:02:09	add three static probes for vfs: cleaner, bufcache_take and bufcache_rel. while here, swap two lines in bufcache_release() to put a KASSERT() first following the pattern in bufcache_take() ok beck@ mpi@
4d550072	2020-09-13 13:33:37	Unbreak tree. Instead of passing struct process to siginit() just pass the struct sigacts since that is the only thing that is modified by siginit.
6d73ed73	2020-09-13 09:48:39	Grep the KERNEL_LOCK in ktrpsig() before calling ktrwrite(). Another little step towards moving signal delivery outside of KERNEL_LOCK. OK mpi@
68b35b93	2020-09-13 09:42:31	Initialize sigacts0 before making them visible by setting ps->ps_sigacts. OK mpi@
1a5ca5ba	2020-09-12 11:57:24	Add a NULL check in bufbackoff so we don't die when passed a NULL pmem range. Noticed by, and based on a diff from Mike Small <smallm@sdf.org>.
26c3009c	2020-09-09 16:29:14	Introduce a helper to check if a signal is ignored or masked by a thread. ok claudio@, pirofti@
cfdea7e9	2020-09-01 01:53:50	Remove unused sysctl_int_arr(9)
f09fc09b	2020-08-26 03:16:53	Fix a race in single-thread mode switching Extend the scope of SCHED_LOCK() to better synchronize single_thread_set(), single_thread_clear() and single_thread_check(). This prevents threads from suspending before single_thread_set() has finished. If a thread suspended early, ps_singlecount might get decremented too much, which in turn could make single_thread_wait() get stuck. The race could be triggered for example by trying to stop a multithreaded process with a debugger. When triggered, the race prevents the debugger from finishing a wait4(2) call on the debuggee. This kind of gdb hang was reported by Julian Smith on misc@. Unfortunately, single-thread mode switching still has issues and hangs are still possible. OK mpi@
3018811d	2020-08-23 09:35:32	Remove unused debug_syncprt, improve debug sysctl handling "syncprt" is unused since kern/vfs_syscalls.c r1.147 from 2008. Adding new debug sysctls is a bit opaque and looking at kern/kern_sysctl.c the only visible difference between used and stub ctldebug structs in the debugvars[] array is their extern keyword, indicating that it is defined elsewhere. sys/sysctl.h declares all debugN members as extern upfront, but these declarations are not needed. Remove the unused debug sysctl, rename the only remaining one to something meaningful and remove forward declarations from /sys/sysctl.h; this way, adding new debug sysctls is a matter of adding extern and coming up with a name, which is nicer to read on its own and better to grep for. OK mpi
c38ca5bd	2020-08-22 11:47:22	Move sysctl(2) CTL_DEBUG from DEBUG to new DEBUG_SYSCTL Adding "debug.my-knob" sysctls is really helpful to select different code paths and/or log on demand during runtime without recompile, but as this code is under DEBUG, lots of other noise comes with it which is often undesired, at least when looking at specific subsystems only. Adding globals to the kernel and breaking into DDB to change them helps, but that does not work over SSH, hence the need for debug sysctls. Introduces DEBUG_SYSCTL to make use of the "debug" MIB without the rest of DEBUG; it's DEBUG_SYSCTL and not SYSCTL_DEBUG because it's not a general option for all of sysctl(2). OK gnezdo
726a21b0	2020-08-19 10:10:57	Push KERNEL_LOCK/UNLOCK() dance inside trapsignal(). ok kettenis@, visa@
e025e2bd	2020-08-18 18:19:30	Style fixups from hurried commits Thanks kettenis@ for pointing out. ok kettenis@
b85522ab	2020-08-18 13:41:49	Fix kn_data returned by filt_logread(). Take into account the circular nature of the message buffer when computing the number of available bytes. Move the computation into a separate function and use it with the kevent(2) and ioctl(2) interfaces. OK mpi@
f0e956f0	2020-08-18 13:38:24	Remove an unnecessary field from struct msgbuf. OK mvs@
baf5b8dd	2020-08-18 04:48:11	Add sysctl_bounded_arr as a replacement for sysctl_int_arr Design by deraadt@ ok deraadt@
8d5e87ec	2020-08-12 15:31:27	getitimer(2): delay TIMESPEC_TO_TIMEVAL(9) conversion until copyout(9) setitimer(2) works with timespecs in its critical section. It will be easier to merge the two critical sections if getitimer(2) also works with timespecs. In particular, we currently read the uptime clock twice during a setitimer(2) swap: we call getmicrouptime(9) in sys_getitimer() and then call getnanouptime(9) in sys_setitimer(). This means that swapping one timer in for another is not atomic with respect to the uptime clock. It also means the two operations are working with different time structures and resolutions, which is potentially confusing. If both critical sections work with timespecs we can combine the two getnanouptime(9) calls into a single call at the start of the combined critical section in a future patch, making the swap atomic with respect to the clock. So, in preparation, move the TIMESPEC_TO_TIMEVAL conversions in getitimer(2) after the ITIMER_REAL conversion from absolute to relative time, just before copyout(9). The ITIMER_REAL conversion must then be done with timespec macros and getnanouptime(9), just like in setitimer(2).
7da17545	2020-08-12 14:41:09	setitimer(2): ITIMER_REAL: don't call timeout_del(9) before timeout_add(9) If we're replacing the current ITIMER_REAL timer with a new one we don't need to call timeout_del(9) before calling timeout_add(9). timeout_add(9) does the work of timeout_del(9) implicitly if the timeout in question is already pending. This saves us an extra trip through the timeout_mutex.
22f7ba00	2020-08-12 13:49:24	Reduce stack usage of kqueue_scan() Reuse the kev[] array of sys_kevent() in kqueue_scan() to lower stack usage. The code has reset kevp, but not nkev, whenever the retry branch is taken. However, the resetting is unnecessary because retry should be taken only if no events have been collected. Make this clearer by adding KASSERTs. OK mpi@
efe63c19	2020-08-11 22:00:51	setitimer(2): write new timer value in one place Rearrange the critical section in setitimer(2) to match that of getitimer(2). This will make it easier to merge the two critical sections in a subsequent diff. In particular, we want to write the new timer value in one place in the code, regardless of which timer we're setting. ok millert@
bd0b8360	2020-08-11 18:29:58	setitimer(2): consolidate copyin(9), input validation, input conversion For what are probably historical reasons, setitimer(2) does not validate its input (itv) immediately after copyin(9). Instead, it waits until after (possibly) performing a getitimer(2) to copy out the state of the timer. Consolidating copyin(9), input validation, and input conversion into a single block before the getitimer(2) operation makes setitimer(2) itself easier to read. It will also simplify merging the critical sections of setitimer(2) and getitimer(2) in a subsequent patch. This changes setitimer(2)'s behavior in the EINVAL case. Currently, if your input (itv) is invalid, we return EINVAL after modifying the output (olditv). With the patch we will now return EINVAL before modifying the output. However, any code dependent upon this behavior is broken: the contents of olditv are undefined in all setitimer(2) error cases. ok millert@
74ea3cc8	2020-08-11 15:41:50	getitimer(2): don't enter itimer_mtx to read ITIMER_REAL itimerspec The ITIMER_REAL per-process interval timer is protected by the kernel lock. The ITIMER_REAL timeout (ps_realit_to), setitimer(2), and getitimer(2) all run under the kernel lock. Entering itimer_mtx during getitimer(2) when reading the ITIMER_REAL ps_timer state is superfluous and misleading.
3c86a58e	2020-08-09 19:15:47	hardclock(9): fix race with setitimer(2) for ITIMER_VIRTUAL, ITIMER_PROF The ITIMER_VIRTUAL and ITIMER_PROF per-process interval timers are updated from hardclock(9). If a timer for the parent process is enabled the hardclock(9) thread calls itimerdecr() to update and reload it as needed. However, in itimerdecr(), after entering itimer_mtx, the thread needs to double-check that the timer in question is still enabled. While the hardclock(9) thread is entering itimer_mtx a thread in setitimer(2) can take the mutex and disable the timer. If the timer is disabled, itimerdecr() should return 1 to indicate that the timer has not expired and that no action needs to be taken. ok kettenis@
87e66a9b	2020-08-08 01:01:26	adjtime(2): simplify input validation for new adjustment The current input validation for overflow is more complex than it needs to be. We can flatten the conditional hierarchy into a string of checks just one level deep. The result is easier to read.
03077e54	2020-08-07 14:35:38	sosplice(9): fully validate idle timeout The socket splice idle timeout is a timeval, so we need to check that tv_usec is both non-negative and less than one million. Otherwise it isn't in canonical form. We can check for this with timerisvalid(3). benno@ says this shouldn't break anything in base. ok benno@, bluhm@
a17537c0	2020-08-07 00:45:25	timeout(9): remove unused interfaces: timeout_add_ts(9), timeout_add_bt(9) These two interfaces have been entirely unused since introduction. Remove them and thin the "timeout" namespace a bit. Discussed with mpi@ and ratchov@ almost a year ago, though I blocked the change at that time. Also discussed with visa@. ok visa@, mpi@
d97ba291	2020-08-06 17:54:08	timeout(9): fix miscellaneous remote kcov(4) bugs Commit v1.77 introduced remote kcov support for timeouts. We need to tweak a few things to make our support more correct: - Set to_process for barrier timeouts to the calling thread's parent process. Currently it is uninitialized, so during timeout_run() we are passing stack garbage to kcov_remote_enter(9). - Set to_process to NULL during timeout_set_flags(9). If in the future we forget to properly initialize to_process before reaching timeout_run(), we'll pass NULL to kcov_remote_enter(9). anton@ says this is harmless. I assume it is also preferable to passing stack garbage. - Save a copy of to_process on the stack in timeout_run() before calling to_func to ensure that we pass the same process pointer to kcov_remote_leave(9) upon return. The timeout may be freely modified from to_func, so to_process may have changed when we return. Tested by anton@. ok anton@
41d03808	2020-08-01 23:41:55	Move range check inside sysctl_int_arr Range violations are now consistently reported as EOPNOTSUPP. Previously they were mixed with ENOPROTOOPT. OK kn@
8430bc4b	2020-08-01 08:40:20	Add support for remote coverage to kcov. Remote coverage is collected from threads other than the one currently having kcov enabled. A thread with kcov enabled occasionally delegates work to another thread, collecting coverage from such threads improves the ability of syzkaller to correlate side effects in the kernel caused by issuing a syscall. Remote coverage is divided into subsystems. The only supported subsystem right now collects coverage from scheduled tasks and timeouts on behalf of a kcov enabled thread. In order to make this work `struct task' and `struct timeout' must be extended with a new field keeping track of the process that scheduled the task/timeout. Both aforementioned structures have therefore increased with the size of a pointer on all architectures. The kernel API is documented in a new kcov_remote_register(9) manual. Remote coverage is also supported by kcov on NetBSD and Linux. ok mpi@
409840fe	2020-07-26 13:27:23	Reference unveil(2) in system accounting and daily.8. Reminder that unveil does not kill from brynet and gsoares. Wording tweaks from jmc; feedback from deraadt. ok jmc@, millert@, solene@, "fine with me" deraadt@
97b78b05	2020-07-25 00:48:03	timeout(9): remove TIMEOUT_SCHEDULED flag The TIMEOUT_SCHEDULED flag was added a few months ago to differentiate between wheel timeouts and new timeouts during softclock(). The distinction is useful when incrementing the "rescheduled" stat and the "late" stat. Now that we have an intermediate queue for new timeouts, timeout_new, we don't need the flag. The distinction between wheel timeouts and new timeouts can be made computationally. Suggested by procter@ several months ago.
e7acdf99	2020-07-24 21:01:33	timeout(9): delay processing of timeouts added during softclock() New timeouts are appended to the timeout_todo circq via timeout_add(9). If this is done during softclock(), i.e. a timeout function calls timeout_add(9) to reschedule itself, the newly added timeout will be processed later during the same softclock(). This works, but it is not optimal: 1. If a timeout reschedules itself to run in zero ticks, i.e. timeout_add(..., 0); it will be run again during the current softclock(). This can cause an infinite loop, softlocking the primary CPU. 2. Many timeouts are cancelled before they execute. Processing a timeout during the current softclock() is "eager": if we waited, the timeout might be cancelled and we could spare ourselves the effort. If the timeout is not cancelled before the next softclock() we can bucket it as we normally would with no change in behavior. 3. Many timeouts are scheduled to run after 1 tick, i.e. timeout_add(..., 1); Processing these timeouts during the same softclock means bucketing them for no reason: they will be dumped into the timeout_todo queue during the next hardclock(9) anyway. Processing them is pointless. We can avoid these issues by using an intermediate queue, timeout_new. New timeouts are put onto this queue during timeout_add(9). The queue is concatenated to the end of the timeout_todo queue at the start of each softclock() and then softclock() proceeds. This means the amount of work done during a given softclock() is finite and we avoid doing extra work with eager processing. Any timeouts that depend upon being rerun during the current softclock() will need to be updated, though I doubt any such timeouts exist. Discussed with visa@ last year. No complaints after a month.
c15324e3	2020-07-24 14:27:47	Implement BOOT_QUIET option that supresses kernel printf output to the console. When the kernel panics, print console output is enabled such that we see those messages. Use this option for the powerpc64 boot kernel. ok visa@, deraadt@
c0f7b35a	2020-07-24 04:53:04	Make timeout_add_sec(9) add a tick if given zero seconds All other timeout_add_*() functions do so before calling timeout_add(9) as described in the manual, this one did not. OK cheloha
739ba58d	2020-07-22 17:39:50	pstat -t was showing bogus column data on ttys, in modes where newline doesn't occur to rewind to column 0. If OPOST is inactive, simply return 0. ok millert
6e581dd8	2020-07-20 22:40:53	ramdisks got broken by that last diff.
04cecb01	2020-07-20 21:51:34	timecounting: add missing mutex assertion to tc_update_timekeep()
1fb8cdb7	2020-07-20 21:43:02	timecounting: misc. cleanup in tc_setclock() and tc_setrealtimeclock() - Use real variable names like "utc" and "uptime" instead of non-names like "bt" and "bt2" - Move the TIMESPEC_TO_BINTIME(9) conversions out of the critical section - Sprinkle in a little whitespace - Sort automatic variables according to style(9)
610b9993	2020-07-20 18:42:30	cleanup ttrstrt; no functional change; ok dlg
ae4429fb	2020-07-20 17:55:28	fix macro indent
2df268a1	2020-07-20 14:34:16	Sigh. Only the ptyc case should tsleep in ttyretype, since others can arrive in the wrong context. Found by jcs.
1988fbea	2020-07-19 23:58:51	tc_windup(): remove misleading comment about getmicrotime(9) Using getmicrotime(9) or getnanotime(9) is perfectly appropriate in certain contexts. The programmer needs to weigh the overhead savings against the reduced accuracy and decide whether the low-res interfaces are appropriate.
a6409917	2020-07-17 16:28:19	Allow setsockopt SO_RTABLE when pleding "wroute" soon to be needed by slaacd(8). "wroute" allows changes to the routing table so this is a good fit. Nothing else in base is effected by this. dhclient might use the wroute pledge in the future and might also want SO_RTABLE in a more distant future. OK deraadt
63cc33c4	2020-07-17 01:36:41	Read ogen from the other timehands; fixes tk_generation If th0.th_generation == th1.th_generation when we update the user timekeep page, then tk_generation doesn't change, so libc may calculate the wrong time. Now th0 and th1 share the sequence so th0.th_generation != th1.th_generation. ok kettenis@ cheloha@
fecf25f8	2020-07-16 23:06:43	adjtime(2): distribute skew along arbitrary period on runtime clock The adjtime(2) adjustment is applied at up to 5000ppm/sec from tc_windup(). At the start of each UTC second, ntp_update_second() is called from tc_windup() and up to 5000ppm worth of skew is deducted from the timehands' th_adjtimedelta member and moved to the th_adjustment member. The resulting th_adjustment value is then mixed into the th_scale member and thus the system UTC time is slowly nudged in a particular direction. This works pretty well. The only issues have to do with the use of the the edge of the UTC second as the start of the ntp_update_second() period: 1. If the UTC clock jumps forward we can get stuck in a loop calling ntp_update_second() from tc_windup(). We work around this with a magic number, LARGE_STEP. If the UTC clock jumps forward more than LARGE_STEP seconds we truncate the number of iterations to 2. Per the comment in tc_windup(), we do 2 iterations instead of 1 iteration to account for a leap second we may have missed. This is an anachronism: the OpenBSD kernel does not handle leap seconds anymore. Such jumps happen during settimeofday(2), during boot when we jump the clock from zero to the RTC time, and during resume when we jump the clock to the RTC time (again). They are unavoidable. 2. Changes to adjtime(2) are applied asynchronously. For example, if you try to cancel the ongoing adjustment... struct timeval zero = { 0, 0 }; adjtime(&zero, NULL); ... it can take up to one second for the adjustment to be cancelled. In the meantime, the skew continues. This delayed application is not intuitive or documented. 3. Adjustment is deducted from th_adjtimedelta across suspends of fewer than LARGE_STEP seconds, even though we do not skew the clock while we are suspended. This is unintuitive, incorrect, and undocumented. We can avoid all of these problems by applying the adjustment along an arbitrary period on the runtime clock instead of the UTC clock. 1. The runtime clock doesn't jump arbitrary amounts, so we never get stuck in a loop and we don't need a magic number to test for this possibility. With the removal of the magic number LARGE_STEP we can also remove the leap second handling from the tc_windup() code. 2. With a new timehands member, th_next_ntp_update, we can track when the next ntp_update_second() call should happen on the runtime clock. This value can be updated during the adjtime(2) system call, so changes to the skew happen immediately instead of up to one second after the adjtime(2) call. 3. The runtime clock does not jump across a suspend: no skew is deducted from th_adjtimedelta for any time we are offline and unable to adjust the clock. otto@ says the use of the runtime clock should not be a problem for ntpd(8) or the NTP algorithm in general.
631c607e	2020-07-15 21:20:08	settimeofday(2): securelevel 2: prevent root from freezing the UTC clock At securelevel 2 we prevent root from rewinding the kernel UTC clock. The rationale given in the comment is that this prevents a compromised root from setting arbitrary timestamps on files. I can't really speak to the efficacy of this mitigation, or to the efficacy of the securelevel concept in general, but the implementation of this mitigation is wrong. We need to check: timespeccmp(ts, &now, <=) instead of timespeccmp(ts, &now, <) like we do now. Time is a continuous value that is always advancing. We must prevent root from setting the kernel UTC clock to its current value in addition to prior values. Setting the UTC clock to its current value amounts to rewinding it even if we cannot actually measure the difference with a timespec. With this change, at securelevel 2, root can no longer completely freeze the UTC clock.
e65f1ede	2020-07-15 02:29:26	Scott Cheloha convinces me the newly added tsleep_nsec should be tsleep, to hint we are doing the minimum scheduler sleep (and as side effect, collecting potential signal status)
056d7610	2020-07-14 18:25:22	Use a rwlock to protect the ttylist, rather than having ttymalloc/ttyfree callers use spltty. ok kettenis
8bdc3b62	2020-07-14 14:33:03	A pty write containing VDISCARD, VREPRINT, or various retyping cases of VERASE would perform (sometimes irrelevant) compute in the kernel which can be heavy (especially with our insufficient tty subsystem locking). Use tsleep_nsec for 1 tick in such circumstances to yield cpu, and also bring interruptability to ptcwrite() https://syzkaller.appspot.com/bug?extid=462539bc18fef8fc26cc ok kettenis millert, discussions with greg and anton
1a9b7fd9	2020-07-14 06:02:50	Do not convert the NOCACHE buffers that come from a vnd strategy routine into more delayed writes if the vnd is mounted from a file on an MNT_ASYNC filesystem. This prevents a situaiton where the cleaner can not clean delayed writes out without making more delayed writes, and we end up waiting for the syncer to spit things occasionaly when it runs. noticed and reported by sven falempin <sven.falempin@gmail.com> on tech, who validated this fixes his issue. ok krw@
1502f498	2020-07-11 22:59:05	timekeep_sz now already includes the round_page() adjustment; ok kettenis@
69657d9a	2020-07-09 02:17:07	adjfreq(2): limit adjustment to [-500000, +500000] ppm When we recompute the scaling factor during tc_windup() there is an opportunity for arithmetic overflow if the active timecounter's adjfreq(2) adjustment is too large. If we limit the adjustment to [-500000, +500000] ppm the statement in question cannot overflow. In particular, we are concerned with the following bit of code: scale = (u_int64_t)1 << 63; scale += \ ((th->th_adjustment + th->th_counter->tc_freq_adj) / 1024) * 2199; scale /= th->th_counter->tc_frequency; th->th_scale = scale * 2; where scale is an int64_t. Overflow when we do: scale += (...) / 1024 * 2199; as th->th_counter->tc_freq_adj is currently unbounded. th->th_adjustment is limited to [-5000ppm, 5000ppm]. To see that overflow is prevented with the new bounds, consider the new edge case where th->th_counter->tc_freq_adj is 500000ppm and th->th_adjustment is 5000ppm. Both are of type int64_t. We have: int64_t th_adjustment = (5000 * 1000) << 32; /* 21474836480000000 / int64_t tc_freq_adj = 500000000LL << 32; / 2147483648000000000 / scale = (u_int64_t)1 << 63; / 9223372036854775808 / scale += (th_adjustment + tc_freq_adj) / 1024 2199; /* scale += 2168958484480000000 / 1024 * 2199; / / scale += 4657753620480000000; */ 9223372036854775808 + 4657753620480000000 = 13881125657334775808, which less than 18446744073709551616, so we don't have overflow. On the opposite end, if th->th_counter->tc_freq_adj is -500000ppm and th->th_adjustment is -5000ppm we would have -4657753620480000000. 9223372036854775808 - 4657753620480000000 = 4565618416374775808. Again, no overflow. 500000ppm and -500000ppm are extreme adjustments. otto@ says ntpd(8) would never arrive at them naturally, so we are not at risk of breaking a working setup by imposing these restrictions. Documentation input from kettenis@. No complaints from otto@.
464b9e49	2020-07-08 21:05:42	Info leaks in semctl SEM_GET, the pads (unknown old contents) and base (a RW page within allocateable space) were leaked. report from adam@grimm-co ok millert
b7f13c93	2020-07-07 02:01:43	small typo
6a3d01b2	2020-07-06 21:41:56	Wire down the timekeep page. If we don't do this, the pagedaemon may page it out and bad things will happen when we try to page it back in from within the clock interrupt handler. While there, make sure we set timekeep_object back to NULL if we fail to make the timekeep page into kernel space. ok deraadt@ (who had a very similar diff)
d82e6535	2020-07-06 13:33:05	Add support for timeconting in userland. This diff exposes parts of clock_gettime(2) and gettimeofday(2) to userland via libc eliberating processes from the need for a context switch everytime they want to count the passage of time. If a timecounter clock can be exposed to userland than it needs to set its tc_user member to a non-zero value. Tested with one or multiple counters per architecture. The timing data is shared through a pointer found in the new ELF auxiliary vector AUX_openbsd_timekeep containing timehands information that is frequently updated by the kernel. Timing differences between the last kernel update and the current time are adjusted in userland by the tc_get_timecount() function inside the MD usertc.c file. This permits a much more responsive environment, quite visible in browsers, office programs and gaming (apparently one is are able to fly in Minecraft now). Tested by robert@, sthen@, naddy@, kmos@, phessler@, and many others! OK from at least kettenis@, cheloha@, naddy@, sthen@
01647961	2020-07-04 08:33:43	Use klist_invalidate() in knote_processexit() This leaves knote_remove() for kqueue's internal use. As a result, knote_remove() is used to drop knotes from the knlist of a single kqueue instance. klist_invalidate() clears knotes from a klist that can contain entries from different kqueue instances. Use FILTEROP_ISFD to control how klist_invalidate() treats knotes, to preserve the current behaviour of knote_processexit(). All the existing callers of klist_invalidate() are fd-based. The existing code rewires and activates knotes to give userspace a clear indication that the state of the fd has changed. In knote_processexit(), any remaining knotes in ps_klist are non-fd-based (EVFILT_SIGNAL). Those are dropped without notifying userspace. OK mpi@
b609c616	2020-07-04 08:06:07	It's been agreed upon that global locks should be expressed using capital letters in locking annotations. Therefore harmonize the existing annotations. Also, if multiple locks are required they should be delimited using commas. ok mpi@
e62bad27	2020-07-02 23:30:38	timecounting: make the dummy counter interrupt- and MP-safe The dummy counter should be deterministic with respect to interrupts and multiple threads of execution.
42294cbd	2020-06-29 18:23:18	Bring back revision 1.122 with a fix preventing a use-after-free by serializing calls to pipe_buffer_free(). Repeating the previous commit message: Instead of performing three distinct allocations per created pipe, reduce it to a single one. Not only should this be more performant, it also solves a kqueue related issue found by visa@ who also requested this change: if you attach an EVFILT_WRITE filter to a pipe fd, the knote gets added to the peer's klist. This is a problem for kqueue because if you close the peer's fd, the knote is left in the list whose head is about to be freed. knote_fdclose() is not able to clear the knote because it is not registered with the peer's fd. FreeBSD also takes a similar approach to pipe allocations. once again ok mpi@ visa@
0d88cff5	2020-06-26 18:48:31	timecounting: deprecate time_second(9), time_uptime(9) time_second(9) has been replaced in the kernel by gettime(9). time_uptime(9) has been replaced in the kernel by getuptime(9). New code should use the replacement interfaces. They do not suffer from the split-read problem inherent to the time_* variables on 32-bit platforms. The variables remain in sys/kern/kern_tc.c for use via kvm(3) when examining kernel core dumps. This commit completes the deprecation process: - Remove the extern'd definitions for time_second and time_uptime from sys/time.h. - Replace manpage cross-references to time_second(9)/time_uptime(9) with references to microtime(9) or a related interface. - Move the time_second.9 manpage to the attic. With input from dlg@, kettenis@, visa@, and tedu@. ok kettenis@
3209772d	2020-06-24 22:03:40	kernel: use gettime(9)/getuptime(9) in lieu of time_second(9)/time_uptime(9) time_second(9) and time_uptime(9) are widely used in the kernel to quickly get the system UTC or system uptime as a time_t. However, time_t is 64-bit everywhere, so it is not generally safe to use them on 32-bit platforms: you have a split-read problem if your hardware cannot perform atomic 64-bit reads. This patch replaces time_second(9) with gettime(9), a safer successor interface, throughout the kernel. Similarly, time_uptime(9) is replaced with getuptime(9). There is a performance cost on 32-bit platforms in exchange for eliminating the split-read problem: instead of two register reads you now have a lockless read loop to pull the values from the timehands. This is really not too bad in the grand scheme of things, but compared to what we were doing before it is several times slower. There is no performance cost on 64-bit (__LP64__) platforms. With input from visa@, dlg@, and tedu@. Several bugs squashed by visa@. ok kettenis@
f90eedca	2020-06-23 01:40:03	add intrmap_one, some temp code to help us write pci_intr_establish_cpu. it means we can do quick hacks to existing drivers to test interrupts on multiple cpus. emphasis on quick and hacks. ok jmatthew@, who will also ok the removal of it at the right time.
8dca5d44	2020-06-22 21:16:07	timecounting: add gettime(9), getuptime(9) time_second and time_uptime are used widely in the tree. This is a problem on 32-bit platforms because time_t is 64-bit, so there is a potential split-read whenever they are used at or below IPL_CLOCK. Here are two replacement interfaces: gettime(9) and getuptime(9). The "get" prefix signifies that they do not read the hardware timecounter, i.e. they are fast and low-res. The lack of a unit (e.g. micro, nano) signifies that they yield a plain time_t. As an optimization on LP64 platforms we can just return time_second or time_uptime, as a single read is atomic. On 32-bit platforms we need to do the lockless read loop and get the values from the timecounter. In a subsequent diff these will be substituted for time_second and time_uptime almost everywhere in the kernel. With input from visa@ and dlg@. ok kettenis@
bf2d84a0	2020-06-22 18:25:57	inittodr(9): introduce dedicated flag to enable writes from resettodr(9) We don't want resettodr(9) to write the RTC until inittodr(9) has actually run. Until inittodr(9) calls tc_setclock() the system UTC clock will contain a meaningless value and there's no sense in overwriting a good value with a value we know is nonsense. This is not an uncommon problem if you're debugging a problem in early boot, e.g. a panic that occurs prior to inittodr(9). Currently we use the following logic in resettodr(9) to inhibit writes: if (time_second == 1) return; ... this is too magical. A better way to accomplish the same thing is to introduce a dedicated flag set from inittodr(9). Hence, "inittodr_done". Suggested by visa@. ok kettenis@
7ab02df9	2020-06-22 13:14:32	Extend kqueue interface with EVFILT_EXCEPT filter. This filter, already implemented in macOS and Dragonfly BSD, returns exceptional conditions like the reception of out-of-band data. The functionnality is similar to poll(2)'s POLLPRI & POLLRDBAND and it can be used by the kqfilter-based poll & select implementation. ok millert@ on a previous version, ok visa@
0de7037a	2020-06-22 02:45:18	there's not going to be any whole kernel wide network livelocks soon.
6281916f	2020-06-21 05:37:26	add mq_push. it's like mq_enqueue, but drops from the head, not the tail. from Matt Dunwoodie and Jason A. Donenfeld
8631d7d9	2020-06-19 02:08:48	backout pipe change, it crashes some arch
6a4f7213	2020-06-18 14:05:21	Compare `so' and `sosp' types just after `sosp' obtaining. We can't splice sockets from different domains so there is no reason to have locking and memory allocation in this error path. Also in this case only `so' will be locked by solock() so we should avoid `sosp' modification. ok mpi@
15ad1813	2020-06-17 18:29:28	Instead of performing three distinct allocations per created pipe, reduce it to a single one. Not only should this be more performant, it also solves a kqueue related issue found by visa@ who also requested this change: if you attach an EVFILT_WRITE filter to a pipe fd, the knote gets added to the peer's klist. This is a problem for kqueue because if you close the peer's fd, the knote is left in the list whose head is about to be freed. knote_fdclose() is not able to clear the knote because it is not registered with the peer's fd. FreeBSD also takes a similar approach to pipe allocations. ok mpi@ visa@
d9f5f731	2020-06-17 03:01:26	make intrmap_cpu return a struct cpu_info *, not a "cpuid number" thing. requested by kettenis@ discussed with jmatthew@
7bd9aa6b	2020-06-17 00:27:52	add intrmap, an api that picks cpus for devices to attach interrupts to. there's been discussions for years (and even some diffs!) about how we should let drivers establish interrupts on multiple cpus. the simple approach is to let every driver look at the number of cpus in a box and just pin an interrupt on it, which is what pretty much everyone else started with, but we have never seemed to get past bikeshedding about. from what i can tell, the principal objections to this are: 1. interrupts will tend to land on low numbered cpus. ie, if drivers try to establish n interrupts on m cpus, they'll start at cpu 0 and go to cpu n, which means cpu 0 will end up with more interrupts than cpu m-1. 2. some cpus shouldn't be used for interrupts. why a cpu should or shouldn't be used for interrupts can be pretty arbitrary, but in practical terms i'm going to borrow from the scheduler and say that we shouldn't run work on hyperthreads. 3. making all the drivers make the same decisions about the above is a lot of maintenance overhead. either we will have a bunch of inconsistencies, or we'll have a lot of untested commits to keep everything the same. my proposed solution to the above is this diff to provide the intrmap api. drivers that want to establish multiple interrupts ask the api for a set of cpus it can use, and the api considers the above issues when generating a set of cpus for the driver to use. drivers then establish interrupts on cpus with the info provided by the map. it is based on the if_ringmap api in dragonflybsd, but generalised so it could be used by something like nvme(4) in the future. this version provides numeric ids for CPUs to drivers, but as kettenis@ has been pointing out for a very long time, it makes more sense to use cpu_info pointers. i'll be updating the code to address that shortly. discussed with deraadt@ and jmatthew@ ok claudio@ patrick@ kettenis@
31aecb51	2020-06-16 05:09:28	wire stoeplitz code into the tree.
cd731760	2020-06-15 15:42:11	Implement a simple kqfilter for deadfs matching its poll handler. ok visa@, millert@
2b88cdaf	2020-06-15 15:29:40	Set __EV_HUP when the conditions matching poll(2)'s POLLUP are found. This is only done in poll-compatibility mode, when __EV_POLL is set. ok visa@, millert@
98a52fa9	2020-06-15 13:18:33	Raise SPL when modifying ps_klist to prevent a race with interrupts. The list can be accessed from interrupt context if a signal is sent from an interrupt handler. OK anton@ cheloha@ mpi@
efd4ef09	2020-06-14 07:22:55	Remove misleading XXX about locking of ps_klist. All of the kqueue subsystem and ps_klist handling still run under the kernel lock.
a8f4946a	2020-06-12 09:34:17	Revert addition of double underbars for filter-specific flag. Port breakages reported by naddy@
48482229	2020-06-11 13:23:18	Move FRELE() outside fdplock in dup(2) code. This avoids a potential lock order issue with the file close path. The FRELE() can trigger the close path during dup(2) if another thread manages to close the file descriptor simultaneously. This race is possible because the file reference is taken before the file descriptor table is locked for write access. Vitaliy Makkoveev agrees OK anton@ mpi@
1c57bd6b	2020-06-11 09:18:43	Rename poll-compatibility flag to better reflect what it is. While here prefix kernel-only EV flags with two underbars. Suggested by kettenis@, ok visa@
dce82204	2020-06-11 09:06:29	Make spec_kqfilter() and cttykqfilter() behave like their corresponding poll handler if the EV_OLDAPI flag is set. ok visa@
c3947ab6	2020-06-11 06:06:55	whitespace and speeling fix in a comment. no functional change.
1330a2b3	2020-06-11 06:03:54	make taskq_barrier wait for pending tasks, not just the running tasks. I wrote taskq_barrier with the behaviour described in the manpage: taskq_barrier() guarantees that any task that was running on the tq taskq when the barrier was called has finished by the time the barrier returns. Note that it talks about the currently running task, not pending tasks. It just so happens that the original implementation just used task_add to put a condvar on the list and waited for it to run. Because task_add uses TAILQ_INSERT_TAIL, you ended up waiting for all pending to work to run too, not just the currently running task. The new implementation took advantage of already holding the lock and used TAILQ_INSERT_HEAD to put the barrier work at the front of the queue so it would run next, which is closer to the stated behaviour. Using the tail insert here restores the previous accidental behaviour. jsg@ points out the following: > The linux functions like flush_workqueue() we use this for in drm want > to wait on all scheduled work not just currently running. > > ie a comment from one of the linux functions: > > /** > * flush_workqueue - ensure that any scheduled work has run to completion. > * @wq: workqueue to flush > * > * This function sleeps until all work items which were queued on entry > * have finished execution, but it is not livelocked by new incoming ones. > / > > our implementation of this in drm is > > void > flush_workqueue(struct workqueue_struct wq) > { > if (cold) > return; > > taskq_barrier((struct taskq *)wq); > } I don't think it's worth complicating the taskq API, so I'm just going to make taskq_barrier wait for pending work too. tested by tb@ ok jsg@
db7041cd	2020-06-11 00:00:01	get rid of a vestigial bit of the sbartq. i should have removed the sbartq pointer in r1.47 when i removed the sbartq.
3de2ba95	2020-06-10 13:24:57	Move closef() outside fdplock() in sys_socketpair(). This prevents a lock order problem with altered locking of UNIX domain sockets. closef() does not need the file descriptor table lock. From Vitaliy Makkoveev OK mpi@

IABSD.fr/src/sys/kern

Log

IABSD.fr/src /sys/kern