Skip to content

DAOS-17427 control: Restart excluded rank after suicide (#16279)#18422

Open
tanabarr wants to merge 1 commit into
release/2.8from
tanabarr/control-engine-suicide-restart-rel2_6
Open

DAOS-17427 control: Restart excluded rank after suicide (#16279)#18422
tanabarr wants to merge 1 commit into
release/2.8from
tanabarr/control-engine-suicide-restart-rel2_6

Conversation

@tanabarr
Copy link
Copy Markdown
Contributor

@tanabarr tanabarr commented Jun 3, 2026

When an engine detects that it has been removed from the system group
map by receiving a CART event, it will now notify its local control
plane with a RAS engine_self_terminated event before terminating its
own process. After receiving this self-termination event, the local
control plane will restart the engine so it can rejoin the system.

The goal of this change is to improve overall system resilience by
automatically recovering engines that are excluded because of
temporary issues such as network instability. Once the engines rejoin,
the rank will still need to be reintegrated into pools as a separate
follow‑up step.

Rate-limiting prevents restart storms: a configurable minimum delay
(default 300 seconds) between restarts per rank ensures system
stability. Two new server config file parameters control behavior:
disable_engine_auto_restart (boolean, default false) completely
disables automatic restarts, while engine_auto_restart_min_delay
(integer seconds) sets the minimum time between consecutive restart
attempts.

Functional tests for the automatic engine restart feature included
with cases to verify disabling, rate-limiting and configuration
support.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

When an engine detects that it has been removed from the system group
map by receiving a CART event, it will now notify its local control plane
with a RAS engine_self_terminated event before terminating its own
process. After receiving this self-termination event, the local
control plane will restart the engine so it can rejoin the system.

The goal of this change is to improve overall system resilience by
automatically recovering engines that are excluded because of
temporary issues such as network instability. Once the engines rejoin,
the rank will still need to be reintegrated into pools as a separate
follow‑up step.

Rate-limiting prevents restart storms: a configurable minimum delay
(default 300 seconds) between restarts per rank ensures system
stability. Two new server config file parameters control behavior:
disable_engine_auto_restart (boolean, default false) completely
disables automatic restarts, while engine_auto_restart_min_delay
(integer seconds) sets the minimum time between consecutive restart
attempts.

Functional tests for the automatic engine restart feature included
with cases to verify disabling, rate-limiting and configuration support.

Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr requested review from a team as code owners June 3, 2026 10:53
@tanabarr tanabarr self-assigned this Jun 3, 2026
@tanabarr tanabarr added the clean-cherry-pick Cherry-pick from another branch that did not require additional edits label Jun 3, 2026
@tanabarr tanabarr requested review from kjacque and mjmac June 3, 2026 10:54
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Ticket title is 'Handle engine suicides by automatically restarting the engines'
Status is 'Awaiting backport'
Labels: 'request_for_2.8'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-17427

@daosbuild3
Copy link
Copy Markdown
Collaborator

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18422/1/execution/node/1388/log

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18422/1/testReport/

@daosbuild3
Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18422/2/testReport/

@tanabarr
Copy link
Copy Markdown
Contributor Author

tanabarr commented Jun 5, 2026

Expected known failures on NLT and dfuse write through test

@tanabarr
Copy link
Copy Markdown
Contributor Author

tanabarr commented Jun 5, 2026

@gnailzenh this is ready to be landed to the 2.8 branch whenever merge approval is given

Copy link
Copy Markdown
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest LGTM

@kjacque kjacque requested a review from a team June 5, 2026 21:25
@github-actions github-actions Bot added the priority Ticket has high priority (automatically managed) label Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clean-cherry-pick Cherry-pick from another branch that did not require additional edits priority Ticket has high priority (automatically managed)

Development

Successfully merging this pull request may close these issues.

5 participants