Files
metal/mtce/debian
Eric MacDonald 9b2fb85c30 Run maintenance failsafe reboot using systemd-run
The current method used by Maintenance to launch its failsafe reboot
function causes systemd shutdown to stall due to recent changes in
cgroup containment behavior. Specifically, mtcAgent runs within the
'sm.service' cgroup, while mtcClient is now launched in its own
transient systemd-run cgroup.

When a grandchild process is created using a double-fork and execv,
it inherits the parent's cgroup unless explicitly detached. As a
result, the long-lived reboot sleeper remains part of the parent's
cgroup. During shutdown, systemd waits for all processes in a
service's cgroup to exit before completing the stop operation.

In the case of a failsafe reboot, the grandchild sleeps for a period
before issuing a SysRq-triggered reboot. This delay often exceeds
systemd’s default 90-second shutdown timeout, causing unnecessary
delays during node reboot. Even though the system could otherwise
shut down cleanly in less time.

This update resolves the issue by switching the failsafe reboot logic
to use systemd-run. This ensures the reboot script runs in its own
isolated transient unit and cgroup, fully detached from the parent
service.

A new `delayed_sysrq_reboot.sh` script is introduced to implement
the reboot logic. With this change, failsafe reboots now work as
expected without stalling systemd shutdown, whether triggered from
mtcAgent or mtcClient.

Test Plan:

PASS: Verify build, install and unlock/lock/unlock of each node in
PASS: ... AIO SX (hw) and AIO DX (hw)
PASS: ... AIO DX with SX subcloud (dc-libvirt)
PASS: ... Standard 2+1+1 storage (vbox)

Unit Testing new delayed_sysrq_reboot.sh testing

PASS: Verify reset after specified delay (success path)
PASS: Verify kernel sysrq auto enable feature (recovery path)
PASS: Verify sysrq reset is rejected when (failure path)
PASS: ... delay value is not specified
PASS: ... delay is out of range
PASS: ... called as non-root user
PASS: ... too few or many arguments
PASS: ... the /proc/sysrq-trigger file is not present (fit)

PASS: Verify file is owned as root:root and has root only permissions
PASS: Verify behavior if /proc/sysrq-trigger does not cause a reset
PASS: Verify execution logging
PASS: Verify shell check static analysis
PASS: Verify handling over /var/run/.node_reset flag file detection

Updated delayed failsafe sysrq function

PASS: Verify systemd-run command arguments
PASS: Verify sysrq reset occurs over unlock of local or remote system
      node after the specified delay.

General:

PASS: Verify no shutdown delay due to failsafe reboot launch
      over self unlock
PASS: Verify no unexpected shutdown kernel tracebacks
PASS: Verify kernel and console logging
PASS: Verify no coredumps or crashdumps during feature update testing
PASS: Verify mtcClient doesn't stall shutdown over 10 lock/unlocks of
      ... standby controller-1 (AIO DX hw)
PASS: ... worker (vbox)
PASS: ... storage (vbox)

Closes-Bug: 2111280
Change-Id: I86e0191548f8f13f61960a91e4e0bbe83134cca6
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2025-05-20 13:23:01 +00:00
..