Skip to content

Conversation

@AlexJones0
Copy link
Contributor

@AlexJones0 AlexJones0 commented Feb 10, 2026

See also the relevant signal tests in #93. Ideally that PR is merged and this PR is rebased on top so that we can mark the relevant signal tests as passing, rather than a non-strict xfail. Doing things in this order would also provide confidence that this isn't breaking anything else.

We can get deadlocks rarely due to logging and threading primitives in the scheduler's signal handler, which cause the process to hang sometimes on a SIGINT/SIGTERM. We also want to be able to have the signal interrupt our poll wait/sleep without busy waiting (for performance), which means we also cannot use time.sleep (as an early signal will not interrupt this, and a pre-check could lead to TOC/TOU races), nor can we use signal.sigtimedwait (which registers its own handlers to handle signals inside the wait, but misses signals outside).

This leaves us with one workable solution - use an OS pipe and define a selector on the read file descriptor, and have the signal handler set a flag with the signal number and write to the write file descriptor. By querying the flag we always know if we have handled a signal in our main loop, and by using a fd we reliably skip the wait on a signal, where the wait is blocking (i.e. not a busy wait). The signal handler is then minimal and async-signal-safe, just setting a flag and writing to the pipe. The relevant logging logic is moved to be dispatched by the main loop instead.

Edit: The diff is unfortunately not very nice - it might be easier to view with your Git tooling of preference, or just compare the old and new code side-by-side.

If run on top of #93, this can be tested by running e.g. pytest -k test_signal --count 100 -n auto:

  • Before this PR, I got a result of: 17 xfailed, 586 xpassed in 30.84s
  • With this PR, I get a result of: 600 xpassed in 29.82s

See relevant comments - we can get deadlocks with logging and threading
primitives rarely which causes the process to hang around 5% of the time
on a SIGINT/SIGTERM, but we also want to be able to have the signal
interrupt our poll wait/sleep without busy waiting (for performance),
which means we also cannot use time.sleep (an early signal will not
interrupt this, and a pre-check leads to ToC-ToU races), nor
signal.sigtimedwait (registers its own handlers to handle signals inside
the wait, but misses signals outside).

This leaves us with one clear solution - use an OS pipe and define a
selector on the read file descriptor, and have the signal handler set a
flag with the signal number and write to the write file descriptor. By
querying the flag we always know if we have handled a signal in our main
loop, and by using a fd we reliably skip the wait on a signal, where the
wait is blocking (i.e. not a busy wait). The signal handler is then
minimal and async-signal-safe, just setting a flag and writing to the
pipe. The relevant logging logic is moved to be dispatched by the main
loop instead.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant