Skip to content

ENT-11189: Fix cf-apache restart-loop during hub upgrades#3158

Draft
larsewi wants to merge 3 commits into
cfengine:masterfrom
larsewi:cf-apache-restart-loop
Draft

ENT-11189: Fix cf-apache restart-loop during hub upgrades#3158
larsewi wants to merge 3 commits into
cfengine:masterfrom
larsewi:cf-apache-restart-loop

Conversation

@larsewi
Copy link
Copy Markdown
Contributor

@larsewi larsewi commented May 21, 2026

Summary

In sequential-tests build #208 the rhel-8 hub got stuck in a cf-apache restart loop during the 3.24.4 → 3.27.1 → master upgrade, failing with (98) Address already in use: AH00072: could not bind to address 0.0.0.0:80. Investigation traced this to three independent issues that compound each other.

Three commits, three independent fixes

  1. Raised cf-apache.service start timeout to avoid PID-file racecf-apache.service is Type=forking with PIDFile=, but apachectl writes the PID file a beat after fork. On a busy host (mid-upgrade) that gap exceeded the inherited default TimeoutStartSec=90s (systemd-system.conf(5)). systemd then SIGKILLed the apache parent, but worker children survived holding :80, and the unit entered a restart loop. Raised to 300 s — still bounded, just enough headroom for slow startups.

  2. Check 'systemctl cat' instead of 'is-active' for cf-apachemission_portal_apache_from_stage uses the systemd_supervised class to choose between the systemd code path (services: promise) and direct apachectl invocation (commands: promise). The probe was systemctl -q is-active cf-apache, which returns non-zero whenever the unit is transiently inactive or failed. During the restart loop above, the policy fell back to the apachectl branch and raced systemd. Switching to systemctl cat cf-apache answers the right question — "does systemd know this unit?" — independent of the unit's current state.

  3. Reset cf-apache failed state before restarting it — systemd's start rate limiter (StartLimitBurst / StartLimitIntervalSec, default 5 failures in 10 s) can latch the unit as failed, after which systemctl restart cf-apache returns "Start request repeated too quickly" and the policy's service_policy => \"restart\" becomes a silent no-op. Added a methods: promise (runs before services: in CFEngine's promise-type order) that invokes systemctl reset-failed cf-apache via a small helper bundle, gated on mission_portal_apache_config_repaired so it only fires when the agent is about to restart anyway. Idle runs do nothing.

Commits 1 and 2 close the known failure path. Commit 3 closes the general recovery path so future failure modes can't permanently wedge the hub through the same latch.

Ticket: ENT-11189

larsewi and others added 3 commits May 21, 2026 12:38
cf-apache.service is Type=forking with PIDFile=$(sys.workdir)/httpd/httpd.pid,
so systemd waits for the PID file before declaring the service started.
apachectl writes the PID file shortly after fork, but on a busy host (e.g.
during mission-portal upgrade with concurrent SELinux relabeling, cf-postgres
and cf-php-fpm restarts) that gap has been observed to exceed the inherited
default TimeoutStartSec of 90 s (see systemd-system.conf(5),
DefaultTimeoutStartSec=). When systemd then SIGKILLs the apache parent,
worker children survive holding 0.0.0.0:80, the unit enters a restart loop,
and subsequent apachectl invocations from policy fail with "Address already
in use".

Raising TimeoutStartSec to 300 s gives apache enough headroom on a loaded
host while still bounding startup time, so a genuinely hung httpd will still
be terminated by systemd.

Ticket: ENT-11189
ChangeLog: Title

Signed-off-by: Lars Erik Wik <lars.erik.wik@northern.tech>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mission_portal_apache_from_stage bundle uses the 'systemd_supervised'
class to decide whether to manage cf-apache via systemd (services: promise)
or by invoking apachectl directly (commands: promise). The class was set
from 'systemctl -q is-active cf-apache', which returns non-zero whenever
the unit is currently inactive or failed — including transient failures
during an upgrade.

In ENT-11189 we observed that this caused the policy to fall back to the
direct-apachectl branch while systemd was concurrently retrying cf-apache
in its own restart loop, leaving the two racing each other and apachectl
failing with "Address already in use".

Switching the probe to 'systemctl cat cf-apache' answers the right question
— "does systemd know about this unit?" — which is true regardless of the
unit's current active/failed/inactive state.

Ticket: ENT-11189
ChangeLog: Title

Signed-off-by: Lars Erik Wik <lars.erik.wik@northern.tech>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
If cf-apache.service has been failing repeatedly, systemd latches it as
'failed' and refuses subsequent restart requests (StartLimitBurst /
StartLimitIntervalSec, see systemd.unit(5)). The service_policy => "restart"
below is then a silent no-op and the hub stays down.

Add a methods promise that runs 'systemctl reset-failed cf-apache' via a new
cf_apache_reset_failed_state helper, gated on mission_portal_apache_config_repaired
so it only fires in the same agent pass that has just rewritten the apache
config and is about to issue a restart. On idle runs it does nothing.

Ticket: ENT-11189
ChangeLog: Title

Signed-off-by: Lars Erik Wik <lars.erik.wik@northern.tech>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@larsewi
Copy link
Copy Markdown
Contributor Author

larsewi commented May 21, 2026

@cf-bottom Jenkins please :)

@cf-bottom
Copy link
Copy Markdown

@larsewi larsewi changed the title Fix cf-apache restart-loop during hub upgrades (ENT-11189) ENT-11189: Fix cf-apache restart-loop during hub upgrades May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants