Files
kernel/kernel-rt/debian/patches/0013-Revert-sched-idle-Move-quiet_vmstate-into-the-NOHZ-c.patch
Li Zhou 79e7b8794c kernel-rt: upgrade to 6.6.7
Update kernel source to 6.6.7 from linux-yocto upstream.
Update "debian" folder source to 6.1.27-1~bpo11+1 from debian upstream,
because kernel 6.6.7 is ported to our bullseye platform now and
the newest "debian" folder from debian upstream for bullseye platform
is for 6.1.

Add an optimization for the StarlingX debian kernel building framework:
We used to always maintain kernel with patches on "debian" folder
and they are put at kernel/kernel-std(rt)/debian/deb_patches dir.
Most of these patches are about "changelog" (debian/changelog) and
"config" (debian/config/amd64/none/config). The patches in "deb_patches"
dir increased rapidly.
Next We will put "changelog" and "config" at the dir
kernel/kernel-std(rt)/debian/source and use them to replace
debian/changelog and debian/config/amd64/none/config
after the upstream "debian" folder is extracted. This can not only
keep a clean "deb_patches" folder, but also avoid using a big patch to
remove the "changelog" file in the upstream "debian" folder before any
kernel build.

Below are changes about "deb_patches"/"patches" for kernel-rt:
(We use the patches' serial number in their name to represent them
becuase so many patches are involved here.)

(1)about "deb_patches" folder:

(1.1)Because of the optimization, all the patches about changelog
and config for 5.10 can be abandoned and they will be changed directly
in the files under "source" dir for 6.6.
Patches for 5.10 that are abandoned because they are about config:
0003/0005/0006/0007/0008/0010/0011/0016/0018/0022/0026/0028
Patches for 5.10 that are abandoned because they are about changelog:
0001/0002/0007/0013/0020/0023/0024/0025/0027/0029/0030/0032/0033/0034

The "changelog" and "config" under "source" dir for 6.6 are verified
to be aligned with those for 5.10 build.
CONFIG_FANOTIFY is enabled in "config" as a new request.

(1.2)Patch 0017 for 5.10 is abandoned because the new commit
<Use parallel XZ for source tar generation> is available in new
version "debian" folder, which does the same work.
Refer to: https://salsa.debian.org/kernel-team/linux/-/commit/
50b61a14e6dbc50b19dfe938c4679ecda50b83ee

(1.3)Below patches for 5.10 are ported to 6.6:
0004/0009/0015 (0009/0015 are merged into 0004) compose patch 0001
for 6.6;
0014 is ported to 0002 for 6.6;
0021 is ported to 0003 for 6.6;
0005/0019 (0005 is merged into 0019) compose patch 0004 for 6.6;
0012 is ported to 0005 for 6.6;
0031 is ported to 0006 for 6.6.

List the new patches for 6.6:
New patches 0001-0006 are ported from "deb_patches" for 5.10;
New patches 0007-0010 are added for building kernel 6.6.7 with
6.1.27-1~bpo11+1 "debian" folder.

(2)about "patches" folder:

(2.1)Patches for 5.10 that are abandoned because they are already in
6.6.7 include:
0017-0027/0031-0038/0041-0056/0058-0071/0073-0081/0083

(2.2)Patch 0011 for 5.10 is abandoned for new upstream commit
<scsi: smartpqi: Expose SAS address for SATA drives> in 6.6.
Refer to: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
linux.git/commit/?id=00598b056aa6d46c7a6819efa850ec9d0d690d76
The new upstream commit has done what 0011 does.

(2.3)Patch 0039 for 5.10 is abandoned for new upstream commit
<samples/bpf: replace broken overhead microbenchmark with
fib_table_lookup> in 6.6.
Refer to: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
linux.git/commit/?id=58e975d014e1e31bd1586be7d2be6d61bd3e3ada
0038 isn't needed any more with the new commit merged.

(2.4)Patch 0030 for 5.10 is abandoned because the related code has
been changed in 6.6 and the issue was verified to disappear.

(2.5)Patch 0010 for 5.10 is abandoned and the issue will be fixed by
setting /config/target/iscsi/cpus_allowed_list to be same with kernel
parameter "kthread_cpus". Because the new patch
<scsi: target: Add iscsi/cpus_allowed_list in configfs>
adds iscsi/cpus_allowed_list in configfs. The available CPU set of
iSCSI connection RX/TX threads is allowed_cpus & online_cpus.
This will do the same thing with patch 0010 so long as we set
cpus_allowed_list properly.
Refer to: <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
linux.git/commit/?id=d72d827f2f2636d8d72f0f3ebe5b661c9a24d343>
This issue will be addressed by later patches on stx framework part.

(2.6)Patch 0015-0016 are abandoned because the issue has been fixed
from the user space side by using the stable /dev/disk/by-path/...
symbolic links instead of names like /dev/sda that can change
(confirmed by M. Vefa Bicakci).

(2.7)Below patches for 5.10 are ported to 6.6:
0001-0009/0012/0028-0029/0040/0057/0082

(3)about kernel config:
(3.1) Enable CONFIG_GNSS for the ice driver.

Test plan:
 The out of tree kernel modules for 6.6 aren't ready by now.
 So many tests can't be done yet because the related test environments
 need those OOT drivers. Here list the tests which have been done with
 a test patch to remove the OOT drivers from the ISO temporarily.
 There are also 2 patches as workaround for solving 2 issues met when
 installing lab in jenkins job.
 PASS: Build linux/linux-rt OK.
 PASS: Build ISO OK.
 PASS: Install and boot up OK on a AIO-SX lab with std/rt kernel.
 PASS: The 12 hours cyclictest result for rt kernel is:
       samples:  259199998	avg:   1658	max:   5455
       99.9999th percentile: 3725	overflows: 0

Story: 2011000
Task: 49365

Signed-off-by: Li Zhou <li.zhou@windriver.com>
Change-Id: I6601fd2d7be4fc314ef2bc03b46f851eabebe3ea
(cherry picked from commit 06f53ed8e2)
Signed-off-by: Jiping Ma <jiping.ma2@windriver.com>
2024-07-09 23:59:50 +00:00

121 lines
4.6 KiB
Diff

From ed051d788e0f7d177bec80d7b594e7b889b975bd Mon Sep 17 00:00:00 2001
From: "M. Vefa Bicakci" <vefa.bicakci@windriver.com>
Date: Wed, 4 Jan 2023 20:36:53 -0500
Subject: [PATCH] Revert "sched/idle: Move quiet_vmstate() into the NOHZ code"
This reverts commit 62cb1188ed86a9cf082fd2f757d4dd9b54741f24.
We received a bug report indicating that the "Dirty" field in
/proc/meminfo was increasing without bounds, to the point that the
number of dirty file pages would eventually reach what is enforced by
the vm.dirty_bytes threshold (which is set to 800_000_000 bytes in
StarlingX) and cause any task attempting to carry out disk I/O to get
blocked.
Upon further debugging, we noticed that this issue occurred on nohz_full
CPUs where a user application was carrying out disk I/O by writing to
and rotating log files, with a usleep(0) call between every log file
rotation. The issue was reproducible with the preempt-rt patch set very
reliably.
The reverted commit moved the quiet_vmstat() call from the entry of the
idle loop (do_idle function) to the tick_nohz_stop_tick function.
However, the tick_nohz_stop_tick function is called very frequently from
hard IRQ context via the following call chain:
irq_exit_rcu
tick_irq_exit (has a condition to check for nohz_full CPUs)
tick_nohz_irq_exit
tick_nohz_full_update_tick
tick_nohz_stop_sched_tick
tick_nohz_stop_tick
quiet_vmstat
The check for nohz_full CPUs in tick_irq_exit() explains why this issue
occurred with nohz_full CPUs more reliably.
Calling quiet_vmstat from hard IRQ context is problematic.
quiet_vmstat() makes the following calls to update vm_node_stat as well
as other statistics such as vm_zone_stat and vm_numa_stat. Recall that
an element in the vm_node_stat array tracks the number of dirty file
pages:
quiet_vmstat
refresh_cpu_vm_stats
fold_diff (Updates vm_node_stat and other statistics)
However, __mod_node_page_state() (and fellow functions) also update
vm_node_stat, and although it is called with interrupts disabled in most
contexts (via spin_lock_irqsave), there are instances where it is called
with interrupts enabled (as evidenced by instrumenting the function with
counters that count the number of times the function was called with and
without interrupts disabled). Also, the fact that __mod_node_page_state
and its sibling function __mod_zone_page_state should not be called with
interrupts enabled is evidenced by the following comment in mm/vmstat.c
above __mod_zone_page_state():
For use when we know that interrupts are disabled, or when we know
that preemption is disabled and that particular counter cannot be
updated from interrupt context.
Furthermore, recall that the preempt-rt patch set makes most spinlocks
sleeping locks and changes the implementation of spin_lock_irqsave in
such a way that IRQs are *not* disabled by spin_lock_irqsave. With the
preempt-rt patch set, this corresponds to a significant increase in the
number of calls to __mod_node_page_state() with interrupts *enabled*.
This in turn significantly increases the possibility of incorrectly
modifying global statistics variables such the ones in the vm_node_stat
array.
To avoid this issue, we revert commit 62cb1188ed86 ("sched/idle: Move
quiet_vmstate() into the NOHZ code") and therefore move the quiet_vmstat
call back into the idle loop's entry point, where it is *not* called
from an hard IRQ context. With this revert applied, the issue is no
longer reproducible.
I would like to acknowledge the extensive help and guidance provided by
Jim Somerville <jim.somerville@windriver.com> during the debugging and
investigation of this issue.
Signed-off-by: M. Vefa Bicakci <vefa.bicakci@windriver.com>
---
kernel/sched/idle.c | 1 +
kernel/time/tick-sched.c | 2 --
2 files changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f26ab2675..9298330c5 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -274,6 +274,7 @@ static void do_idle(void)
*/
__current_set_polling();
+ quiet_vmstat();
tick_nohz_idle_enter();
while (!need_resched()) {
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 1ad89eec2..468e756f1 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -25,7 +25,6 @@
#include <linux/irq_work.h>
#include <linux/posix-timers.h>
#include <linux/context_tracking.h>
-#include <linux/mm.h>
#include <asm/irq_regs.h>
@@ -932,7 +931,6 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
*/
if (!ts->tick_stopped) {
calc_load_nohz_start();
- quiet_vmstat();
ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
ts->tick_stopped = 1;
--
2.17.1