Fix ceph services stopped when controller-1 is down in DX-Direct

In a DX-Direct environment, when controller-1 was not yet initialized
or was down, the storage network service detected no carrier on the
network interfaces and interpreted this as a network failure. Because
of that, it stopped all Ceph services during installation,
which is not the intended behavior.

The solution was to align the logic with what already existed in the
ceph-init-wrapper. A new function was introduced to check whether all
interfaces (oam, mgmt, and cluster_host) had no carrier.

If this condition is true and the system is running in duplex-direct
mode, it means the peer controller is down or not yet initialized,
so Ceph services should remain running instead of being stopped.
If the condition is not met, the behavior remains unchanged and Ceph
services are stopped when the network is not functional.

Additionally, the log function itself was not changed,
but it was moved earlier in the script so it is available for cases
where the sanity check of the SM_CEPH_MON/OSD_CURRENT_STATE variable
requires logging.

Test Plan:

    DX-Direct:
    PASS: Power down controller-1 and verify that no services
          are stopped.
    PASS: Simulate a network outage (e.g., oam, mgmt, cluster_host,
          etc.) and verify which services are stopped.
    PASS: Start and stop services (osd, mon, mds)
          using ceph-init-wrapper.

    DX:
    PASS: Power down controller-1 and verify that no services
          are stopped.
    PASS: Simulate a network outage (e.g., oam, mgmt, cluster_host,
          etc.) and verify which services are stopped.
    PASS: Start and stop services (osd, mon, mds)
          using ceph-init-wrapper.

Closes-bug: 2125670

Change-Id: Ia8a056f45f64a2ecaf2eeb018abd0d85f4a3f9db
Signed-off-by: Gabriel Przybysz Gonçalves Júnior <gabriel.przybyszgoncalvesjunior@windriver.com>
This commit is contained in:
Gabriel Przybysz Gonçalves Júnior
2025-09-23 15:47:40 -03:00
parent 4827557376
commit 084671b8a2
2 changed files with 64 additions and 25 deletions

View File

@@ -99,6 +99,28 @@ else
IFS=" " read -r -a args <<< "$@"
fi
# Log Management
# Adding PID and PPID informations
log () {
local name=""
local log_level="$1"
# Checking if the first parameter is not a log level
if grep -q -v ${log_level} <<< "INFO DEBUG WARN ERROR"; then
name=" ($1)";
log_level="$2"
shift
fi
shift
local message="$@"
# prefix = <pid_subshell> <ppid_name>[<ppid>] <name|optional>
local prefix="${BASHPID} $(cat /proc/${PPID}/comm)[${PPID}]${name}"
# yyyy-MM-dd HH:mm:ss.SSSSSS /etc/init.d/ceph-init-wrapper <prefix> <log_level>: <message>
wlog "${prefix}" "${log_level}" "${message}"
return 0
}
is_ppid_sm()
{
local ppid_name
@@ -156,28 +178,6 @@ if is_ppid_sm && [ "${SM_CEPH_OSD_CURRENT_STATE}" != "${STATE_RUNNING}" ]; then
fi
fi
# Log Management
# Adding PID and PPID informations
log () {
local name=""
local log_level="$1"
# Checking if the first parameter is not a log level
if grep -q -v ${log_level} <<< "INFO DEBUG WARN ERROR"; then
name=" ($1)";
log_level="$2"
shift
fi
shift
local message="$@"
# prefix = <pid_subshell> <ppid_name>[<ppid>] <name|optional>
local prefix="${BASHPID} $(cat /proc/${PPID}/comm)[${PPID}]${name}"
# yyyy-MM-dd HH:mm:ss.SSSSSS /etc/init.d/ceph-init-wrapper <prefix> <log_level>: <message>
wlog "${prefix}" "${log_level}" "${message}"
return 0
}
# Identify the ceph network interface from /etc/platform/platform.conf file
# The network interface will be set to the 'ceph_network_interface' variable
# Return 0 if found the variable, and 1 if not.

View File

@@ -118,6 +118,29 @@ has_ceph_network_carrier()
return 0
}
# Verify if oam, cluster host and mgmt networks have carrier.
# This is a special condition for AIO-DX Direct setup.
# If all networks have no carrier, then the other host is down.
# When the other host is down, ceph must start on this host.
# Return 0 if no carrier is detected on all network interfaces.
# Return 1 of carrier has been detected in at lease one network interface.
has_all_network_no_carrier()
{
ip link show "${oam_interface}" | grep NO-CARRIER
local oam_carrier=$?
ip link show "${cluster_host_interface}" | grep NO-CARRIER
local cluster_host_carrier=$?
ip link show "${management_interface}" | grep NO-CARRIER
local mgmt_carrier=$?
# Check if all networks have no carrier, meaning the other host is down
if [ "${oam_carrier}" -eq 0 ] && [ "${cluster_host_carrier}" -eq 0 ] && [ "${mgmt_carrier}" -eq 0 ]; then
log INFO "No carrier detected from all network interfaces"
return 0
fi
return 1
}
status()
{
has_ceph_network_carrier
@@ -127,9 +150,25 @@ status()
# Service is "running" and has carrier.
RETVAL=0
else
# Force stop services only if carrier is not detected.
[ ${HAS_CARRIER} -ne 0 ] && stop
RETVAL=1
if [ ${HAS_CARRIER} -ne 0 ]; then
if [ "${system_mode}" == "duplex-direct" ]; then
has_all_network_no_carrier
if [ $? -eq 0 ]; then
log INFO "All network interfaces are not functional, considering the other host is down. Keep Ceph running."
RETVAL=0
else
log INFO "Ceph network interface is not functional in duplex-direct, stopping Ceph."
stop
RETVAL=1
fi
else
log INFO "Ceph network interface is not functional, stopping Ceph."
stop
RETVAL=1
fi
else
RETVAL=1
fi
fi
# NOTE: The Status return is only used in the Start method to validate that there