ENT-13720: Fixed daemon hang on SIGTERM during child process wait (3.27.x)#6130
Draft
larsewi wants to merge 2 commits into
Draft
ENT-13720: Fixed daemon hang on SIGTERM during child process wait (3.27.x)#6130larsewi wants to merge 2 commits into
larsewi wants to merge 2 commits into
Conversation
ShellCommandReturnsZero retried waitpid() unconditionally on EINTR, so daemons (cf-serverd, cf-execd, cf-monitord) blocked waiting for a child process -- such as cf-promises during policy validation -- stayed unresponsive to SIGTERM until the child finished. The signal handler set PENDING_TERMINATION but the main loop never got control back to check it. Now, when waitpid is interrupted and termination is pending, the child is stopped via ProcessSignalTerminate (SIGINT -> SIGTERM -> SIGKILL) and reaped, so the daemon's main loop can exit promptly. Ticket: ENT-13720 Changelog: Title Signed-off-by: Lars Erik Wik <lars.erik.wik@northern.tech> (cherry picked from commit 243a10f)
The previous attempt checked IsPendingTermination() in the EINTR
branch of the blocking waitpid() loop, but that branch is never
reached: signal() on Linux/glibc installs handlers with SA_RESTART,
so the kernel transparently restarts waitpid() after the handler
runs and the userspace EINTR check never fires. The daemon stays
blocked in waitpid() until the child exits on its own, which is the
exact symptom we set out to fix.
Poll the child with waitpid(WNOHANG) instead, so we get control back
between iterations and can react to PENDING_TERMINATION regardless of
whether the signal interrupts the syscall. nanosleep() between polls
keeps the loop from busy-spinning; since it is never restarted across
signals, SIGTERM wakes us up promptly and the 100 ms interval is only
an upper bound on idle wakeup latency.
References (Linux man-pages 6.9.1):
signal(2):
"By default, in glibc 2 and later, the signal() wrapper function
does not invoke the kernel system call. Instead, it calls
sigaction(2) using flags that supply BSD semantics. [...] The
BSD semantics are equivalent to calling sigaction(2) with the
following flags: sa.sa_flags = SA_RESTART;"
signal(7), "Interruption of system calls and library functions by
signal handlers":
"If a blocked call to one of the following interfaces is
interrupted by a signal handler, then the call is automatically
restarted after the signal handler returns if the SA_RESTART
flag was used; otherwise the call fails with the error EINTR:
[...] wait(2), wait3(2), wait4(2), waitid(2), and waitpid(2)."
signal(7), same section:
"The following interfaces are never restarted after being
interrupted by a signal handler, regardless of the use of
SA_RESTART; they always fail with the error EINTR when
interrupted by a signal handler: [...] Sleep interfaces:
clock_nanosleep(2), nanosleep(2), and usleep(3)."
Ticket: ENT-13720
Signed-off-by: Lars Erik Wik <lars.erik.wik@northern.tech>
(cherry picked from commit ec2627e)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ShellCommandReturnsZeroretriedwaitpid()unconditionally onEINTR, so daemons blocked waiting for a child — e.g.cf-promisesduring policy validation — stayed unresponsive toSIGTERMuntil the child finished.HandleSignalsForDaemonsetsPENDING_TERMINATION, but the main loop never got control back to check it.On interrupted
waitpid, we now checkIsPendingTermination()and, if set, stop the child viaProcessSignalTerminateand reap it, so the daemon can exit.Observed in the
valgrind-checksCI job, wherepkill -f cf-serverdfailed to kill the bootstrap daemon and the valgrind-wrapped replacement could not bind to the listening port.Backported from #6129