ENT-11189: Fix cf-apache restart-loop during hub upgrades#3158
Draft
larsewi wants to merge 3 commits into
Draft
Conversation
cf-apache.service is Type=forking with PIDFile=$(sys.workdir)/httpd/httpd.pid, so systemd waits for the PID file before declaring the service started. apachectl writes the PID file shortly after fork, but on a busy host (e.g. during mission-portal upgrade with concurrent SELinux relabeling, cf-postgres and cf-php-fpm restarts) that gap has been observed to exceed the inherited default TimeoutStartSec of 90 s (see systemd-system.conf(5), DefaultTimeoutStartSec=). When systemd then SIGKILLs the apache parent, worker children survive holding 0.0.0.0:80, the unit enters a restart loop, and subsequent apachectl invocations from policy fail with "Address already in use". Raising TimeoutStartSec to 300 s gives apache enough headroom on a loaded host while still bounding startup time, so a genuinely hung httpd will still be terminated by systemd. Ticket: ENT-11189 ChangeLog: Title Signed-off-by: Lars Erik Wik <lars.erik.wik@northern.tech> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mission_portal_apache_from_stage bundle uses the 'systemd_supervised' class to decide whether to manage cf-apache via systemd (services: promise) or by invoking apachectl directly (commands: promise). The class was set from 'systemctl -q is-active cf-apache', which returns non-zero whenever the unit is currently inactive or failed — including transient failures during an upgrade. In ENT-11189 we observed that this caused the policy to fall back to the direct-apachectl branch while systemd was concurrently retrying cf-apache in its own restart loop, leaving the two racing each other and apachectl failing with "Address already in use". Switching the probe to 'systemctl cat cf-apache' answers the right question — "does systemd know about this unit?" — which is true regardless of the unit's current active/failed/inactive state. Ticket: ENT-11189 ChangeLog: Title Signed-off-by: Lars Erik Wik <lars.erik.wik@northern.tech> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If cf-apache.service has been failing repeatedly, systemd latches it as 'failed' and refuses subsequent restart requests (StartLimitBurst / StartLimitIntervalSec, see systemd.unit(5)). The service_policy => "restart" below is then a silent no-op and the hub stays down. Add a methods promise that runs 'systemctl reset-failed cf-apache' via a new cf_apache_reset_failed_state helper, gated on mission_portal_apache_config_repaired so it only fires in the same agent pass that has just rewritten the apache config and is about to issue a restart. On idle runs it does nothing. Ticket: ENT-11189 ChangeLog: Title Signed-off-by: Lars Erik Wik <lars.erik.wik@northern.tech> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@cf-bottom Jenkins please :) |
|
Alright, I triggered a build: Jenkins: https://ci.cfengine.com/job/pr-pipeline/13827/ Packages: http://buildcache.cfengine.com/packages/testing-pr/jenkins-pr-pipeline-13827/ |
craigcomstock
approved these changes
May 21, 2026
nickanderson
approved these changes
May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
In sequential-tests build #208 the rhel-8 hub got stuck in a cf-apache restart loop during the 3.24.4 → 3.27.1 → master upgrade, failing with
(98) Address already in use: AH00072: could not bind to address 0.0.0.0:80. Investigation traced this to three independent issues that compound each other.Three commits, three independent fixes
Raised cf-apache.service start timeout to avoid PID-file race—cf-apache.serviceisType=forkingwithPIDFile=, butapachectlwrites the PID file a beat after fork. On a busy host (mid-upgrade) that gap exceeded the inherited defaultTimeoutStartSec=90s(systemd-system.conf(5)). systemd then SIGKILLed the apache parent, but worker children survived holding :80, and the unit entered a restart loop. Raised to 300 s — still bounded, just enough headroom for slow startups.Check 'systemctl cat' instead of 'is-active' for cf-apache—mission_portal_apache_from_stageuses thesystemd_supervisedclass to choose between the systemd code path (services:promise) and direct apachectl invocation (commands:promise). The probe wassystemctl -q is-active cf-apache, which returns non-zero whenever the unit is transiently inactive or failed. During the restart loop above, the policy fell back to the apachectl branch and raced systemd. Switching tosystemctl cat cf-apacheanswers the right question — "does systemd know this unit?" — independent of the unit's current state.Reset cf-apache failed state before restarting it— systemd's start rate limiter (StartLimitBurst/StartLimitIntervalSec, default 5 failures in 10 s) can latch the unit asfailed, after whichsystemctl restart cf-apachereturns "Start request repeated too quickly" and the policy'sservice_policy => \"restart\"becomes a silent no-op. Added amethods:promise (runs beforeservices:in CFEngine's promise-type order) that invokessystemctl reset-failed cf-apachevia a small helper bundle, gated onmission_portal_apache_config_repairedso it only fires when the agent is about to restart anyway. Idle runs do nothing.Commits 1 and 2 close the known failure path. Commit 3 closes the general recovery path so future failure modes can't permanently wedge the hub through the same latch.
Ticket: ENT-11189