Files
kernel/kernel-rt/debian/patches/0070-perf-x86-rapl-Fix-psys-energy-event-on-Intel-SPR-pla.patch
M. Vefa Bicakci 0ee7813d66 kernel: Improve Sapphire Rapids CPU support
This commit cherry-picks commits from the mainline kernel to improve
Sapphire Rapids CPU support in the following components of the StarlingX
kernel: intel_idle, perf/x86/RAPL and powercap, and perf/x86/cstate.
(RAPL stands for "Running Average Power Limit", which is a CPU feature
for measuring and limiting power consumption.)

These improvements are required to support a new power metrics
application in StarlingX, which is intended to work with Sapphire Rapids
CPUs: https://opendev.org/starlingx/app-power-metrics

The following commits are cherry-picked as part of this effort, in
chronological order, organized by component:

=> intel_idle
* commit 9edf3c0ffef0
  ("intel_idle: add SPR support")
  (v5.18-rc1~203^2~3^3~5)
* commit da0e58c038e6
  ("intel_idle: add 'preferred_cstates' module argument")
  (v5.18-rc1~203^2~3^3~4)
* commit 3a9cf77b60dc
  ("intel_idle: add core C6 optimization for SPR")
  (v5.18-rc1~203^2~3^3~3)
* commit 39c184a6a9a7
  ("intel_idle: Fix the 'preferred_cstates' module parameter")
  (v5.18-rc5~22^2^2~1)
* commit 7eac3bd38d18
  ("intel_idle: Fix SPR C6 optimization")
  (v5.18-rc5~22^2^2)
* commit 1548fac47a11
  ("intel_idle: make SPR C1 and C1E be independent")
  (v6.0-rc1~184^2~2^2^2)

=> perf/x86/rapl + powercap
* commit ffb20c2e52e8
  ("perf/x86/rapl: Add msr mask support")
  (v5.12-rc1~146^2~3)
* commit b6f78d3fba7f
  ("perf/x86/rapl: Only check lower 32bits for RAPL energy counters")
  (v5.12-rc1~146^2~2)
* commit 838342a6d6b7
  ("perf/x86/rapl: Fix psys-energy event on Intel SPR platform")
  (v5.12-rc1~146^2~1)
* commit 931da6a0de5d
  ("powercap: intel_rapl: support new layout of Psys PowerLimit Register
    on SPR")
  (v5.17-rc1~167^2^4^2~1)
* commit 80275ca9e525
  ("perf/x86/rapl: Use standard Energy Unit for SPR Dram RAPL domain")
  (v6.1-rc4~3^2~3)

=> perf/x86/cstate
* commit 87bf399f86ec
  ("perf/x86/cstate: Add ICELAKE_X and ICELAKE_D support")
  (v5.14-rc1~7^2~1)
* commit 528c9f1daf20
  ("perf/x86/cstate: Add SAPPHIRERAPIDS_X CPU support")
  (v5.18-rc4~3^2)

The set of commits listed above is a reduced version of a slightly
larger superset of commits we had originally considered for
cherry-picking. We opted for the commits listed above to limit potential
impact on the StarlingX kernel by focusing on Sapphire Rapids support
and direct dependencies only.

We should note that we encountered a number of merge conflicts while
cherry-picking these commits; however, none of the merge conflict
resolutions required significantly altering the modifications made by
the original commits. The individual patch files denote the nature of
the merge conflicts.

Verification:

* The kernel recipes and all kernel modules were built from scratch with
  this commit, using the following command in a StarlingX build
  environment:

  $ build-pkgs -c -p linux,linux-rt,bnxt-en,i40e,i40e-cvl-2.54,\
  i40e-cvl-4.10,iavf,iavf-cvl-2.54,iavf-cvl-4.10,ice,ice-cvl-2.54,\
  ice-cvl-4.10,igb-uio,iqvlinux,kmod-opae-fpga-driver,mlnx-ofed-kernel,\
  octeon-ep,qat1.7.l

  These packages were further packaged into a StarlingX (ostree) patch
  for easier deployment.

* An Ansible-bootstrapped low-latency All-in-One simplex StarlingX
  set-up was prepared on a server with a Sapphire Rapids CPU.

* The ostree patch was installed onto the server to start testing our
  changes. The kernel was confirmed to boot up as expected.

* We enabled RAPL Psys domain reporting the server's BIOS (originally
  disabled), and we also disabled the BIOS-enforced limit on the CPU
  *package* C-states (originally set to C0/C1).

* We forcibly removed the "intel_idle.max_cstate=0" kernel command line
  argument by modifying the sysinv daemon's Python source code on the
  server (with a systemd service that bind-mounts a replacement *.py
  file, to avoid another ostree patch). This was required to prevent the
  intel_idle driver from disabling itself, so that we could confirm the
  sanity of the cherry-picked commits.

* The following tests were carried out, first with the patched
  preempt-rt kernel, and next with the original unpatched preempt-rt
  kernel:

  * Confirm that the intel_idle CPU idling driver is active:
    $ cat /sys/devices/system/cpu/cpuidle/current_driver
  * Confirm the CPU idling state names and parameters:
    $ grep -s '^' \
    /sys/devices/system/cpu/cpu0/cpuidle/state[0-9]*/\
    {name,desc,time,latency,residency}
  * Confirm that the RAPL/powercap and C-state related performance
    monitor unit (PMU) counters are usable by the kernel and with perf:
    $ sudo perf list
  * Confirm that the CPU and package C-state residency counters are
    working:
    $ perf stat -a \
      -e cstate_core/c1-residency/ -e cstate_core/c6-residency/ \
      -e cstate_pkg/c2-residency/ -e cstate_pkg/c6-residency/ \
      -- sleep 5
  * Confirm that RAPL/powercap-related performance counters are working:
    $ perf stat -a \
      -e power/energy-pkg/ -e power/energy-ram/ -e power/energy-psys/ \
      -- sleep 5

  With the unpatched kernel, we observed that the intel_idle driver used
  CPU idling information exposed by the ACPI tables, with the following
  idle state names: POLL, C1_ACPI, C2_ACPI. With the patched kernel the
  C-state tables embedded in the intel_idle driver were used as
  expected, with the following idle state names: POLL, C1, C1E, C6.

  With the unpatched kernel, we observed that the CPU/package C-state
  residency counters were not detected, whereas they were detected with
  the patched kernel, as expected.

  With both the unpatched and the patched kernels, the RAPL/powercap
  related performance counters were detected. We observed that the units
  for the DRAM domain were incorrect for the unpatched kernel, which was
  expected due to the lack of commit 80275ca9e525 ("perf/x86/rapl: Use
  standard Energy Unit for SPR Dram RAPL domain").

* To confirm the sanity of our results acquired with the patched kernel
  in the previous step, we also carried out the following experiment
  with the v6.4.3-rt6 kernel available in the linux-yocto repository as
  commit 917d160a84f6 ("Merge branch 'v6.4/standard/base' into
  v6.4/standard/preempt-rt/base") in the "v6.4/standard/preempt-rt/base"
  branch.

  The "notification of death" StarlingX kernel patch was forward-ported
  to the v6.4.3-rt6 kernel and the "kernel.sched_nr_migrate" sysctl was
  reintroduced to make this kernel work with the aforementioned
  Ansible-bootstrapped StarlingX system.

  Furthermore, to ensure that the RAPL/powercap features are aligned to
  the most recent mainline kernel version, we cherry-picked the
  following commits from v6.5-rc1 onto the v6.4.3-rt6 kernel:
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?qt=range&q=44c026a73be8..49776c712eb6

  Afterwards, this v6.4.3-rt6-based test kernel was built and installed
  onto the test server, and test procedures discussed in the previous
  step were repeated.

  Compared to the patched StarlingX v5.10 kernel, we observed that the
  RAPL/powercap measurements were similar, and the CPU and package
  C-state residency counters were not extremely different with the
  v6.4.3-rt6-based test kernel.

* We should note that we have repeated tests with the patched StarlingX
  v5.10 kernel as well, but we did not reinstall the system to acquire a
  standard/non-low-latency set-up. Instead, we opted for running the
  following command, rebooting the system into the standard kernel,
  followed by repeating the test procedures, which had similar results.

  sudo grub-editenv /boot/1/kernel.env set kernel=vmlinuz-5.10.0-6-amd64

Acknowledgements:

* Thanks to Alyson Deives Pereira for his extensive help in pruning the
  commits that we had originally thought of cherry-picking with this
  commit.

* Thanks to Mark Asselstine for his advice on the second phase of the
  commit pruning activity.

Story: 2010773
Task: 48449

Change-Id: Ibe6bff65e8a415ac027a5d493a0e65fe58c9e344
Signed-off-by: M. Vefa Bicakci <vefa.bicakci@windriver.com>
2023-07-24 13:44:38 +00:00

127 lines
4.9 KiB
Diff

From f7d7c1c60866dc2d4e7c79f10a520bbbccfd7ceb Mon Sep 17 00:00:00 2001
From: Zhang Rui <rui.zhang@intel.com>
Date: Fri, 5 Feb 2021 00:18:16 +0800
Subject: [PATCH] perf/x86/rapl: Fix psys-energy event on Intel SPR platform
There are several things special for the RAPL Psys energy counter, on
Intel Sapphire Rapids platform.
1. it contains one Psys master package, and only CPUs on the master
package can read valid value of the Psys energy counter, reading the
MSR on CPUs in the slave package returns 0.
2. The master package does not have to be Physical package 0. And when
all the CPUs on the Psys master package are offlined, we lose the Psys
energy counter, at runtime.
3. The Psys energy counter can be disabled by BIOS, while all the other
energy counters are not affected.
It is not easy to handle all of these in the current RAPL PMU design
because
a) perf_msr_probe() validates the MSR on some random CPU, which may either
be in the Psys master package or in the Psys slave package.
b) all the RAPL events share the same PMU, and there is not API to remove
the psys-energy event cleanly, without affecting the other events in
the same PMU.
This patch addresses the problems in a simple way.
First, by setting .no_check bit for RAPL Psys MSR, the psys-energy event
is always added, so we don't have to check the Psys ENERGY_STATUS MSR on
master package.
Then, by removing rapl_not_visible(), the psys-energy event is always
available in sysfs. This does not affect the previous code because, for
the RAPL MSRs with .no_check cleared, the .is_visible() callback is always
overriden in the perf_msr_probe() function.
Note, although RAPL PMU is die-based, and the Psys energy counter MSR on
Intel SPR is package scope, this is not a problem because there is only
one die in each package on SPR.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/20210204161816.12649-3-rui.zhang@intel.com
(cherry picked from commit 838342a6d6b7ecc475dc052d4a405c4ffb3ad1b5)
Signed-off-by: M. Vefa Bicakci <vefa.bicakci@windriver.com>
---
arch/x86/events/rapl.c | 21 +++++++++------------
1 file changed, 9 insertions(+), 12 deletions(-)
diff --git a/arch/x86/events/rapl.c b/arch/x86/events/rapl.c
index 7ed25b2ba05f..f42a70496a24 100644
--- a/arch/x86/events/rapl.c
+++ b/arch/x86/events/rapl.c
@@ -454,16 +454,9 @@ static struct attribute *rapl_events_cores[] = {
NULL,
};
-static umode_t
-rapl_not_visible(struct kobject *kobj, struct attribute *attr, int i)
-{
- return 0;
-}
-
static struct attribute_group rapl_events_cores_group = {
.name = "events",
.attrs = rapl_events_cores,
- .is_visible = rapl_not_visible,
};
static struct attribute *rapl_events_pkg[] = {
@@ -476,7 +469,6 @@ static struct attribute *rapl_events_pkg[] = {
static struct attribute_group rapl_events_pkg_group = {
.name = "events",
.attrs = rapl_events_pkg,
- .is_visible = rapl_not_visible,
};
static struct attribute *rapl_events_ram[] = {
@@ -489,7 +481,6 @@ static struct attribute *rapl_events_ram[] = {
static struct attribute_group rapl_events_ram_group = {
.name = "events",
.attrs = rapl_events_ram,
- .is_visible = rapl_not_visible,
};
static struct attribute *rapl_events_gpu[] = {
@@ -502,7 +493,6 @@ static struct attribute *rapl_events_gpu[] = {
static struct attribute_group rapl_events_gpu_group = {
.name = "events",
.attrs = rapl_events_gpu,
- .is_visible = rapl_not_visible,
};
static struct attribute *rapl_events_psys[] = {
@@ -515,7 +505,6 @@ static struct attribute *rapl_events_psys[] = {
static struct attribute_group rapl_events_psys_group = {
.name = "events",
.attrs = rapl_events_psys,
- .is_visible = rapl_not_visible,
};
static bool test_msr(int idx, void *data)
@@ -534,6 +523,14 @@ static struct perf_msr intel_rapl_msrs[] = {
[PERF_RAPL_PSYS] = { MSR_PLATFORM_ENERGY_STATUS, &rapl_events_psys_group, test_msr, false, RAPL_MSR_MASK },
};
+static struct perf_msr intel_rapl_spr_msrs[] = {
+ [PERF_RAPL_PP0] = { MSR_PP0_ENERGY_STATUS, &rapl_events_cores_group, test_msr, false, RAPL_MSR_MASK },
+ [PERF_RAPL_PKG] = { MSR_PKG_ENERGY_STATUS, &rapl_events_pkg_group, test_msr, false, RAPL_MSR_MASK },
+ [PERF_RAPL_RAM] = { MSR_DRAM_ENERGY_STATUS, &rapl_events_ram_group, test_msr, false, RAPL_MSR_MASK },
+ [PERF_RAPL_PP1] = { MSR_PP1_ENERGY_STATUS, &rapl_events_gpu_group, test_msr, false, RAPL_MSR_MASK },
+ [PERF_RAPL_PSYS] = { MSR_PLATFORM_ENERGY_STATUS, &rapl_events_psys_group, test_msr, true, RAPL_MSR_MASK },
+};
+
/*
* Force to PERF_RAPL_MAX size due to:
* - perf_msr_probe(PERF_RAPL_MAX)
@@ -764,7 +761,7 @@ static struct rapl_model model_spr = {
BIT(PERF_RAPL_PSYS),
.unit_quirk = RAPL_UNIT_QUIRK_INTEL_SPR,
.msr_power_unit = MSR_RAPL_POWER_UNIT,
- .rapl_msrs = intel_rapl_msrs,
+ .rapl_msrs = intel_rapl_spr_msrs,
};
static struct rapl_model model_amd_fam17h = {