When a node is shutdown but not detected by Kubelet's Node Shutdown
Manager, this becomes a non-graceful node shutdown. Non-graceful node
shutdown is usually not a problem for stateless apps, however, it is a
problem for stateful apps. In this scenario, kubernetes 1.28 introduced
the "out-of-service" taint. When added, this taint will trigger pods on
the node to be forcefully deleted if there are no matching tolerations
on them [1].
However, the mariadb and rabbitmq charts of stx-openstack app are
including a toleration for any "NoExecute" taints, preventing the
non-graceful node shutdown feature introduced by k8s v1.28 to work as
expected.
Since there is no evidence about any stx-openstack need related to
"NoExecute" tolerations, this change remove them to leverage the
stx-openstack support for non-graceful nodes shutdown.
[1]https://kubernetes.io/blog/2023/08/16/kubernetes-1-28-non-graceful-node-shutdown-ga
Test Plan:
[PASS] build stx-openstack packages and tarball
[PASS] upload stx-openstack tarball and apply the app
Non-graceful node shutdown on AIO-DX system:
[PASS] non-gracefull shutdown of standby controller
(ipmitool -I <interface> -H <hostname> chassis power off)
[PASS] mariadb and openstack CLI NOT working*
[PASS] trigger pods evacuation from standby controller
(kubectl taint nodes controller-1 \
node.kubernetes.io/out-of-service=nodeshutdown:NoExecute)
[PASS] reapply stx-openstack
[PASS] mariadb and openstack CLI recovered
[PASS] restart standby controller
[PASS] remove "out-of-service" taint
(kubectl taint nodes controller-1 \
node.kubernetes.io/out-of-service=nodeshutdown:NoExecute-)
[PASS] restart rabbitmq pods**
[PASS] reapply stx-openstack
Non-graceful nodes shutdown on standard system:
[PASS] upload and apply stx-openstack
[PASS] non-gracefull shutdown of standby controller
(ipmitool -I <interface> -H <hostname> chassis power off)
[PASS] mariadb and openstack CLI are still working
[PASS] restart standby controller
[PASS] stx-openstack automatically reapplied
[PASS] non-gracefull shutdown of garbd compute node
[PASS] mariadb and openstack CLI are still working
[PASS] restart garbd compute node
[PASS] stx-openstack automatically reapplied
* Since AIO-DX is limited to 2 nodes (non-HA architecture), it can't
recover automatically from a non-graceful node shutdown. Therefore, to
recover the application the operator needs to execute the manual
procedure described in the test plan.
** Due to a known limitation on the manual recover procedure, the
rabbitmq pod need to be manually restarted after the shutdown node is
restarted. Otherwise, the second rabbitmq pod won't be recognized as
part of the rabbitmq cluster and the app reapply will fail.
Closes-Bug: #2126578
Change-Id: I7b660ae88d3614484bd01cd6d5d7fb37ac382c56
Signed-off-by: Alex Figueiredo <alex.fernandesfigueiredo@windriver.com>