DAOS-17427 control: Restart excluded rank after suicide (#16279)#18422
DAOS-17427 control: Restart excluded rank after suicide (#16279)#18422tanabarr wants to merge 1 commit into
Conversation
When an engine detects that it has been removed from the system group map by receiving a CART event, it will now notify its local control plane with a RAS engine_self_terminated event before terminating its own process. After receiving this self-termination event, the local control plane will restart the engine so it can rejoin the system. The goal of this change is to improve overall system resilience by automatically recovering engines that are excluded because of temporary issues such as network instability. Once the engines rejoin, the rank will still need to be reintegrated into pools as a separate follow‑up step. Rate-limiting prevents restart storms: a configurable minimum delay (default 300 seconds) between restarts per rank ensures system stability. Two new server config file parameters control behavior: disable_engine_auto_restart (boolean, default false) completely disables automatic restarts, while engine_auto_restart_min_delay (integer seconds) sets the minimum time between consecutive restart attempts. Functional tests for the automatic engine restart feature included with cases to verify disabling, rate-limiting and configuration support. Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
|
Ticket title is 'Handle engine suicides by automatically restarting the engines' |
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18422/1/testReport/ |
|
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18422/1/execution/node/1388/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18422/1/testReport/ |
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18422/2/testReport/ |
|
Expected known failures on NLT and dfuse write through test |
|
@gnailzenh this is ready to be landed to the 2.8 branch whenever merge approval is given |
When an engine detects that it has been removed from the system group
map by receiving a CART event, it will now notify its local control
plane with a RAS engine_self_terminated event before terminating its
own process. After receiving this self-termination event, the local
control plane will restart the engine so it can rejoin the system.
The goal of this change is to improve overall system resilience by
automatically recovering engines that are excluded because of
temporary issues such as network instability. Once the engines rejoin,
the rank will still need to be reintegrated into pools as a separate
follow‑up step.
Rate-limiting prevents restart storms: a configurable minimum delay
(default 300 seconds) between restarts per rank ensures system
stability. Two new server config file parameters control behavior:
disable_engine_auto_restart (boolean, default false) completely
disables automatic restarts, while engine_auto_restart_min_delay
(integer seconds) sets the minimum time between consecutive restart
attempts.
Functional tests for the automatic engine restart feature included
with cases to verify disabling, rate-limiting and configuration
support.
Steps for the author:
After all prior steps are complete: