[24.04_linux-nvidia-6.17-next] linux-nvidia-6.17: SMT-aware asymmetric CPU capacity idle selection by arighi · Pull Request #395 · NVIDIA/NV-Kernels

arighi · 2026-04-28T06:24:44Z

On Vera Rubin, the firmware exposes CPUs with different capacities through ACPI/CPPC. Unlike Grace systems, Vera Rubin also supports SMT. As a result, the Linux scheduler enables the asymmetric CPU capacity idle selection policy, but the current implementation is not SMT-aware. This can lead to suboptimal task placement, where tasks are scheduled on both SMT siblings of the same core even when fully idle SMT cores are available elsewhere in the system.

In CPU-intensive workloads, this behavior can significantly reduce performance, with slowdowns of up to 2x observed in certain CPU-intensive workloads.

This series is a backport of the upstream patch series available at:
https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com

NOTE: the original series includes additional patches that are not needed in linux-nvidia-6.17:

PATCH 1/6 is a refactoring that is valid only in kernel >= 7.0, because it requires 71fedc41c23b ("sched/fair: Switch to rcu_dereference_all()") and it's not worth backporting it,
PATCH 6/6 is incorrect and will be dropped (so it's not backported)

The series is currently under review on the mailing list, but consensus has been reached with the scheduler maintainers and the changes are expected to be merged for v7.2.

Given the potential impact on Vera Rubin performance, it seems reasonable to backport and apply these patches to the linux-nvidia kernel and carry them as NVIDIA SAUCE for now, until the upstream solution becomes available.

Patch series has been validated both on Vera and Grace running DCPerf Mediawiki and benchblas (NVBLAS).

LP: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-bos/+bug/2150671

github-actions · 2026-04-28T06:34:28Z

PR Validation Report

Patchscan ✅ No Missing Fixes

All cherry-picked commits checked — no missing upstream fixes found.

PR Lint ❌ Errors found

Details

Checking 4 commits...

Cherry-pick digest:
E: 9a87c418e16e ("NVIDIA: SAUCE: sched/fair: Attach sched_"): diff MISMATCH with lore patch (add [Author: reason] annotation if intentional)
┌──────────────┬───────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local        │ Referenced upstream / Patch subject           │ Patch-ID   │ Subject │ SoB chain                 │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ b3bf56e2a412 │ [SAUCE] sched/fair: add sis_util support to s │ N/A        │ N/A     │ arighi, nayak, arighi     │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 842b1195ebce │ [SAUCE] sched/fair: reject misfit pulls onto  │ N/A        │ N/A     │ arighi, arighi            │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 20f85dc0e99e │ [SAUCE] sched/fair: prefer fully-idle smt cor │ N/A        │ N/A     │ arighi, arighi            │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 9a87c418e16e │ sched/fair: attach sched_domain_shared to sd_ │ MISMATCH   │ found   │ ok, backporter: arighi    │
└──────────────┴───────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘

Lint: all checks passed.

clsotog · 2026-04-28T15:33:52Z

@arighi
The first commit
NVIDIA: SAUCE: sched/fair: Host has_idle_cores/nr_busy_cpus on sd_asym_cpucapacity
at the link of the backported which one is this one? the patch2: https://lore.kernel.org/all/bea8880a-7273-4257-b733-dcb3f1e28ed6@linux.ibm.com/?

jamieNguyenNVIDIA · 2026-04-28T15:55:41Z

The upstream v4 series has 6 patches but this PR only carries 4 (patches 2/6–5/6). Are the other two patches in the series -- [PATCH 1/6] "sched/fair: Use guard(rcu) for sched_domain RCU sections and [PATCH 6/6] "sched/topology: Remove SMT/asym capacity warning -- not needed here?

arighi · 2026-04-29T12:32:01Z

@clsotog oops I forgot to update the subject of the patch, and yes, it is PATCH 2/6. Fixed now. Thanks!

arighi · 2026-04-29T12:33:39Z

@jamieNguyenNVIDIA right, I should have mentioned in the PR description, sorry. The first patch and the last one are not needed. Updated the PR description. Thanks!

jamieNguyenNVIDIA · 2026-04-29T15:15:13Z

Acked-by: Jamie Nguyen <jamien@nvidia.com>

Be sure to send this to 26.04_linux-nvidia and 26.04_linux-nvidia-bos. Thanks!

clsotog · 2026-04-29T15:26:30Z

 	per_cpu(sd_llc_size, cpu) = size;
 	per_cpu(sd_llc_id, cpu) = id;
+
+	/* TODO: Rename sd_llc_shared to fit the new role. */


Do we still have a TODO here?

I was a bit conflicted on this one, because the patch that renames sd_llc_shared to sd_balance_shared includes other changes that shouldn't be backported to 6.17, so I left that comment there, but I guess we can remove it and just keep sd_llc_shared.

@clsotog thinking more about it, I just removed the TODO, considering that we're not backporting the patch that renames the struct, the TODO doesn't really add any useful information and it's just confusing. PR updated.

nvmochs · 2026-04-29T16:56:43Z

@arighi Couple of questions...

4ef9f0b NVIDIA: SAUCE: sched/fair: Add SIS_UTIL support to select_idle_capacity()
81d78e1 NVIDIA: SAUCE: sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

Nit: These are not clean picks from LKML (looks like delta is just in the comments). Are these from on an older version of the series?

467bd61 NVIDIA: SAUCE: sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity

Similar: The commit message differs from the LKML version.

arighi · 2026-04-29T17:10:24Z

@nvmochs ah yes, these patches are based on an older series, I just updated the code without updating the comments. Better to keep them aligned. I've updated the PR, now they should match with the upstream patches. Thanks!

…apacity On asymmetric CPU capacity systems, the wakeup path uses select_idle_capacity(), which scans the span of sd_asym_cpucapacity rather than sd_llc. The has_idle_cores hint however lives on sd_llc->shared, so the wakeup-time read of has_idle_cores operates on an LLC-scoped blob while the actual scan/decision spans the wider asym domain; nr_busy_cpus also lives in the same shared sched_domain data, but it's never used in the asym CPU capacity scenario. Therefore, move the sched_domain_shared object to sd_asym_cpucapacity whenever the CPU has a SD_ASYM_CPUCAPACITY_FULL ancestor and that ancestor is non-overlapping (i.e., not built from SD_NUMA). In that case the scope of has_idle_cores matches the scope of the wakeup scan. Fall back to attaching the shared object to sd_llc in three cases: 1) plain symmetric systems (no SD_ASYM_CPUCAPACITY_FULL anywhere); 2) CPUs in an exclusive cpuset that carves out a symmetric capacity island: has_asym is system-wide but those CPUs have no SD_ASYM_CPUCAPACITY_FULL ancestor in their hierarchy and follow the symmetric LLC path in select_idle_sibling(); 3) exotic topologies where SD_ASYM_CPUCAPACITY_FULL lands on an SD_NUMA-built domain. init_sched_domain_shared() keys the shared blob off cpumask_first(span), which on overlapping NUMA domains would alias unrelated spans onto the same blob. Keep the shared object on the LLC there; select_idle_capacity() gracefully skips the has_idle_cores preference when sd->shared is NULL. While at it, also rename the per-CPU sd_llc_shared to sd_balance_shared, as it is no longer strictly tied to the LLC. Co-developed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> (backported from https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com) [ arighi: - backport full logic to attach sd->shared in build_sched_domains() - do not rename sd_llc_shared to reduce the risk of conflicts ] Signed-off-by: Andrea Righi <arighi@nvidia.com>

…ty idle selection On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting different per-core frequencies), the wakeup path uses select_idle_capacity() and prioritizes idle CPUs with higher capacity for better task placement. However, when those CPUs belong to SMT cores, their effective capacity can be much lower than the nominal capacity when the sibling thread is busy: SMT siblings compete for shared resources, so a "high capacity" CPU that is idle but whose sibling is busy does not deliver its full capacity. This effective capacity reduction cannot be modeled by the static capacity value alone. Introduce SMT awareness in the asym-capacity idle selection policy: when SMT is active, always prefer fully-idle SMT cores over partially-idle ones. Prioritizing fully-idle SMT cores yields better task placement because the effective capacity of partially-idle SMT cores is reduced; always preferring them when available leads to more accurate capacity usage on task wakeup. On an SMT system with asymmetric CPU capacities, SMT-aware idle selection has been shown to improve throughput by around 15-18% for CPU-bound workloads, running an amount of tasks equal to the amount of SMT cores. Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Christian Loehle <christian.loehle@arm.com> Cc: Koba Ko <kobak@nvidia.com> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reported-by: Felix Abecassis <fabecassis@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> (cherry picked from https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com) Signed-off-by: Andrea Righi <arighi@nvidia.com>

… on asym-capacity When SD_ASYM_CPUCAPACITY load balancing considers pulling a misfit task, capacity_of(dst_cpu) can overstate available compute if the SMT sibling is busy: the core does not deliver its full nominal capacity. If SMT is active and dst_cpu is not on a fully idle core, skip this destination so we do not migrate a misfit expecting a capacity upgrade we cannot actually provide. Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Christian Loehle <christian.loehle@arm.com> Cc: Koba Ko <kobak@nvidia.com> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Reported-by: Felix Abecassis <fabecassis@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> (cherry picked from https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com) Signed-off-by: Andrea Righi <arighi@nvidia.com>

…ty() Add to select_idle_capacity() the same SIS_UTIL-controlled idle-scan mechanism, already used by select_idle_cpu(): when sched_feat(SIS_UTIL) is enabled and the LLC domain has sched_domain_shared data, derive the per-attempt scan limit from sd->shared->nr_idle_scan. That bounds the walk on large LLCs and allows an early return once the scan limit is reached, if we already picked a sufficiently strong idle-core candidate (best_fits == ASYM_IDLE_CORE_UCLAMP_MISFIT). Co-developed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> (cherry picked from https://lore.kernel.org/all/20260428051720.3180182-1-arighi@nvidia.com) Signed-off-by: Andrea Righi <arighi@nvidia.com>

nvmochs · 2026-04-29T17:49:59Z

No further issues from me.

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

FYI, I went ahead and created a Launchpad (required for merging) at added the link in the PR description.

Also, as Jamie mentioned, we'll want this backported to 26.04 7.0-LTS and 7.0-bos. However, I recommend we wait until we can pick from -next to avoid carrying this as SAUCE in those kernels. It sounds like the series is getting close to being accepted, so this shouldn't be too far in the future. I'll open Jiras to track getting this into those kernels.

clsotog

Acked-by: Carol L Soto <csoto@nvidia.com>

nvmochs · 2026-04-30T22:30:41Z

Merged, closing PR.

cc0874baf17d (nnoble/nvidia-6.17-next) NVIDIA: SAUCE: sched/fair: Add SIS_UTIL support to select_idle_capacity()
0500b0b58f16 NVIDIA: SAUCE: sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity
f3ac33c0c4fe NVIDIA: SAUCE: sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
49cc5de67c8c NVIDIA: SAUCE: sched/fair: Attach sched_domain_shared to sd_asym_cpucapacity

arighi requested review from ianm-nv and nvidia-bfigg April 28, 2026 06:24

arighi changed the title ~~linux-nvidia-6.17: SMT-aware asymmetric CPU capacity idle selection~~ [24.04_linux-nvidia-6.17-next]: SMT-aware asymmetric CPU capacity idle selection Apr 28, 2026

arighi changed the title ~~[24.04_linux-nvidia-6.17-next]: SMT-aware asymmetric CPU capacity idle selection~~ [24.04_linux-nvidia-6.17-next] linux-nvidia-6.17: SMT-aware asymmetric CPU capacity idle selection Apr 28, 2026

arighi force-pushed the linux-nvidia-6.17-sched branch from c6377a1 to 4ef9f0b Compare April 29, 2026 12:30

clsotog reviewed Apr 29, 2026

View reviewed changes

arighi force-pushed the linux-nvidia-6.17-sched branch from 4ef9f0b to 9ec2612 Compare April 29, 2026 17:07

arighi and others added 4 commits April 29, 2026 19:23

arighi force-pushed the linux-nvidia-6.17-sched branch from 9ec2612 to b3bf56e Compare April 29, 2026 17:23

clsotog approved these changes Apr 29, 2026

View reviewed changes

nvmochs closed this Apr 30, 2026

This was referenced May 5, 2026

[26.04_linux-nvidia] linux-nvidia-7.0: SMT-aware asymmetric CPU capacity idle selection #405

Closed

[26.04_linux-nvidia-bos] linux-nvidia-7.0-bos: SMT-aware asymmetric CPU capacity idle selection #406

Closed

arighi mentioned this pull request May 27, 2026

[linux-nvidia-6.18-next] linux-nvidia-6.18: SMT-aware asymmetric CPU capacity idle selection #441

Closed

Conversation

arighi commented Apr 28, 2026 • edited by nvmochs Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Validation Report

Patchscan ✅ No Missing Fixes

PR Lint ❌ Errors found

Uh oh!

clsotog commented Apr 28, 2026

Uh oh!

jamieNguyenNVIDIA commented Apr 28, 2026

Uh oh!

arighi commented Apr 29, 2026

Uh oh!

arighi commented Apr 29, 2026

Uh oh!

jamieNguyenNVIDIA commented Apr 29, 2026

Uh oh!

clsotog Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

arighi Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

arighi Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

nvmochs commented Apr 29, 2026

Uh oh!

arighi commented Apr 29, 2026

Uh oh!

nvmochs commented Apr 29, 2026

Uh oh!

clsotog left a comment

Choose a reason for hiding this comment

Uh oh!

nvmochs commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

arighi commented Apr 28, 2026 •

edited by nvmochs

Loading

github-actions Bot commented Apr 28, 2026 •

edited

Loading