ENT-11189: Fix cf-apache restart-loop during hub upgrades by larsewi · Pull Request #3158 · cfengine/masterfiles

larsewi · 2026-05-21T11:13:00Z

Summary

In sequential-tests build #208 the rhel-8 hub got stuck in a cf-apache restart loop during the 3.24.4 → 3.27.1 → master upgrade, failing with (98) Address already in use: AH00072: could not bind to address 0.0.0.0:80. Investigation traced this to three independent issues that compound each other.

Three commits, three independent fixes

Raised cf-apache.service start timeout to avoid PID-file race — cf-apache.service is Type=forking with PIDFile=, but apachectl writes the PID file a beat after fork. On a busy host (mid-upgrade) that gap exceeded the inherited default TimeoutStartSec=90s (systemd-system.conf(5)). systemd then SIGKILLed the apache parent, but worker children survived holding :80, and the unit entered a restart loop. Raised to 300 s — still bounded, just enough headroom for slow startups.
Check 'systemctl cat' instead of 'is-active' for cf-apache — mission_portal_apache_from_stage uses the systemd_supervised class to choose between the systemd code path (services: promise) and direct apachectl invocation (commands: promise). The probe was systemctl -q is-active cf-apache, which returns non-zero whenever the unit is transiently inactive or failed. During the restart loop above, the policy fell back to the apachectl branch and raced systemd. Switching to systemctl cat cf-apache answers the right question — "does systemd know this unit?" — independent of the unit's current state.
Reset cf-apache failed state before restarting it — systemd's start rate limiter (StartLimitBurst / StartLimitIntervalSec, default 5 failures in 10 s) can latch the unit as failed, after which systemctl restart cf-apache returns "Start request repeated too quickly" and the policy's service_policy => \"restart\" becomes a silent no-op. Added a methods: promise (runs before services: in CFEngine's promise-type order) that invokes systemctl reset-failed cf-apache via a small helper bundle, gated on mission_portal_apache_config_repaired so it only fires when the agent is about to restart anyway. Idle runs do nothing.

Commits 1 and 2 close the known failure path. Commit 3 closes the general recovery path so future failure modes can't permanently wedge the hub through the same latch.

Ticket: ENT-11189

cf-apache.service is Type=forking with PIDFile=$(sys.workdir)/httpd/httpd.pid, so systemd waits for the PID file before declaring the service started. apachectl writes the PID file shortly after fork, but on a busy host (e.g. during mission-portal upgrade with concurrent SELinux relabeling, cf-postgres and cf-php-fpm restarts) that gap has been observed to exceed the inherited default TimeoutStartSec of 90 s (see systemd-system.conf(5), DefaultTimeoutStartSec=). When systemd then SIGKILLs the apache parent, worker children survive holding 0.0.0.0:80, the unit enters a restart loop, and subsequent apachectl invocations from policy fail with "Address already in use". Raising TimeoutStartSec to 300 s gives apache enough headroom on a loaded host while still bounding startup time, so a genuinely hung httpd will still be terminated by systemd. Ticket: ENT-11189 ChangeLog: Title Signed-off-by: Lars Erik Wik <lars.erik.wik@northern.tech> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The mission_portal_apache_from_stage bundle uses the 'systemd_supervised' class to decide whether to manage cf-apache via systemd (services: promise) or by invoking apachectl directly (commands: promise). The class was set from 'systemctl -q is-active cf-apache', which returns non-zero whenever the unit is currently inactive or failed — including transient failures during an upgrade. In ENT-11189 we observed that this caused the policy to fall back to the direct-apachectl branch while systemd was concurrently retrying cf-apache in its own restart loop, leaving the two racing each other and apachectl failing with "Address already in use". Switching the probe to 'systemctl cat cf-apache' answers the right question — "does systemd know about this unit?" — which is true regardless of the unit's current active/failed/inactive state. Ticket: ENT-11189 ChangeLog: Title Signed-off-by: Lars Erik Wik <lars.erik.wik@northern.tech> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

If cf-apache.service has been failing repeatedly, systemd latches it as 'failed' and refuses subsequent restart requests (StartLimitBurst / StartLimitIntervalSec, see systemd.unit(5)). The service_policy => "restart" below is then a silent no-op and the hub stays down. Add a methods promise that runs 'systemctl reset-failed cf-apache' via a new cf_apache_reset_failed_state helper, gated on mission_portal_apache_config_repaired so it only fires in the same agent pass that has just rewritten the apache config and is about to issue a restart. On idle runs it does nothing. Ticket: ENT-11189 ChangeLog: Title Signed-off-by: Lars Erik Wik <lars.erik.wik@northern.tech> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

larsewi · 2026-05-21T11:17:45Z

@cf-bottom Jenkins please :)

cf-bottom · 2026-05-21T12:01:02Z

Alright, I triggered a build:

Jenkins: https://ci.cfengine.com/job/pr-pipeline/13827/

Packages: http://buildcache.cfengine.com/packages/testing-pr/jenkins-pr-pipeline-13827/

larsewi and others added 3 commits May 21, 2026 12:38

larsewi changed the title ~~Fix cf-apache restart-loop during hub upgrades (ENT-11189)~~ ENT-11189: Fix cf-apache restart-loop during hub upgrades May 21, 2026

craigcomstock approved these changes May 21, 2026

View reviewed changes

nickanderson approved these changes May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENT-11189: Fix cf-apache restart-loop during hub upgrades#3158

ENT-11189: Fix cf-apache restart-loop during hub upgrades#3158
larsewi wants to merge 3 commits into
cfengine:masterfrom
larsewi:cf-apache-restart-loop

larsewi commented May 21, 2026 •

edited

Loading

Uh oh!

larsewi commented May 21, 2026

Uh oh!

cf-bottom commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

Conversation

larsewi commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Three commits, three independent fixes

Uh oh!

larsewi commented May 21, 2026

Uh oh!

cf-bottom commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

larsewi commented May 21, 2026 •

edited

Loading