[26.04_linux-nvidia-bos] workqueue: Introduce a sharded cache affinity scope by clsotog · Pull Request #394 · NVIDIA/NV-Kernels

clsotog · 2026-04-27T15:57:01Z

This feature from v7.1 has shown benefit on Grace: https://www.phoronix.com/news/Linux-7.1-WQ

LKML (series): [https://lore.kernel.org/all/20260401-workqueue_sharded-v3-0-ab0b9336bf0b@debian.org/]
LKML (follow-up fix): [https://lore.kernel.org/all/20260413-workqueue_fix_nios-v2-1-2cf6a61b6bb3@debian.org/]
LKML (related fix taken from series RFC): [https://lore.kernel.org/all/2eef24999c6eeef8e8ea8daf54990e76@kernel.org/]
LKML (additional follow-up fix): https://lore.kernel.org/all/20260402205913.1953402-1-arnd@kernel.org/

Upstream patches:
9dc42c9 workqueue: fix typo in WQ_AFFN_SMT comment
5920d04 workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
4cdc8a7 workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
738390a tools/workqueue: add CACHE_SHARD support to wq_dump.py
24b2e73 workqueue: add test_workqueue benchmark module
41e3ccc docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope
76af546 workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()
1abaae9 workqueue: fix parse_affn_scope() prefix matching bug
c6890f3 workqueue: avoid unguarded 64-bit division

LP: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2150467

Validation:

./test_cache_shard.sh
TAP version 13
ok 1 - WQ_AFFN_CACHE_SHARD support present
ok 2 - default cache_shard_size is 8
ok 3 - default affinity scope is cache_shard (via module param)
./test_cache_shard.sh: line 92: /sys/module/workqueue/parameters/cache_shard_size: Permission denied
ok 4 - cache_shard_size is read-only (write correctly rejected)
ok 5 - writeback: all affinity scopes accepted and read back correctly
ok 6 - writeback: cannot restore scope (write permission denied) # SKIP
ok 7 - writeback: cache_shard scope accepted
ok 8 - LLC shard sizes all <= wq_cache_shard_size=8 (2 LLC(s) checked)
ok 9 - wq_affinity_test passed
1..9

Passed: 8 Failed: 0 Skipped: 1
./test-workqueue-cache-shard.sh
== Environment ==
kernel: 7.0.0
machine: aarch64

== Boot cmdline (workqueue-related) ==
INFO: No workqueue.* tokens in /proc/cmdline (defaults apply).

== Built-in module parameters: workqueue ==
cache_shard_size=8
cpu_intensive_thresh_us=20000
cpu_intensive_warning_thresh=4
debug_force_rr_cpu=N
default_affinity_scope=cache_shard
power_efficient=Y

== Virtual workqueue sysfs: affinity_scope ==
blkcg_punt_bio: affinity_scope=default (cache_shard)
ib-comp-unb-wq: affinity_scope=default (cache_shard)
nvme-auth-wq: affinity_scope=default (cache_shard)
nvme-delete-wq: affinity_scope=default (cache_shard)
nvme-reset-wq: affinity_scope=default (cache_shard)
nvme-wq: affinity_scope=default (cache_shard)
raid5wq: affinity_scope=default (cache_shard)
writeback: affinity_scope=cache_shard

== Try writing affinity_scope on first writable wq (best-effort) ==
INFO: Using writeback (current=cache_shard)
set affinity_scope=cache_shard -> read_back=cache_shard
set affinity_scope=cache -> read_back=cache
set affinity_scope=smt -> read_back=smt
set affinity_scope=cpu -> read_back=cpu
set affinity_scope=system -> read_back=system
set affinity_scope=numa -> read_back=numa

== Optional benchmark module (name varies by patch revision) ==
INFO: Found kernel module: test_workqueue
INFO: Running: modprobe test_workqueue
modprobe: ERROR: could not insert 'test_workqueue': Resource temporarily unavailable
PASS: modprobe test_workqueue: init returned -EAGAIN (userspace exit 1) — Return -EAGAIN so the module does not stay loaded after the benchmark; treating as success.
INFO: Recent kernel log lines:
[ 308.429626] test_workqueue: running 144 threads, 50000 items/thread
[ 310.762202] test_workqueue: cpu 8052070 items/sec p50=14400 p90=15168 p95=15392 ns
[ 312.914305] test_workqueue: smt 8067805 items/sec p50=14432 p90=15200 p95=15520 ns
[ 316.089652] test_workqueue: cache_shard 4377330 items/sec p50=8448 p90=13664 p95=15872 ns
[ 330.482312] test_workqueue: cache 639225 items/sec p50=41504 p90=185728 p95=197567 ns
[ 344.235586] test_workqueue: numa 646112 items/sec p50=43008 p90=180351 p95=190624 ns
[ 373.915019] test_workqueue: system 269122 items/sec p50=209247 p90=269824 p95=277023 ns
PASS: Finished. Review WARN lines; missing sysfs/module often means config differs from the patch series.

./wq_stress -d 60
TAP version 13

wq_stress: duration=60s verbose=0 phase=all
ok 1 - phase1: shard layout math correct (32768 cases, LLC sizes 1..512, shard_sizes 1..64)
ok 2 - phase2: stress completed without deadlock (cache=1343910246 cache_shard=13941963556 iterations)

phase2: throughput ratio cache_shard/cache = 10.37x
ok 3 - phase2: cache_shard throughput >= 90% of cache (ratio=10.37x)
ok 4 - phase3: captured 3429 workqueue_execute_start events
ok 5 - phase3: all 3429 events on CPUs with valid shard assignments
ok 6 - phase4: 25 scope switches under load completed, scope restored to 'cache_shard'
1..6

Passed: 6 Failed: 0 Skipped: 0

github-actions · 2026-04-27T16:10:01Z

PR Validation Report

Patchscan ✅ No Missing Fixes

All cherry-picked commits checked — no missing upstream fixes found.

PR Lint ⚠️ Warnings

Details

Checking 9 commits...

Cherry-pick digest:
┌──────────────┬───────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local        │ Referenced upstream / Patch subject           │ Patch-ID   │ Subject │ SoB chain                 │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 911b93df5e04 │ 76af54648899                                  │ match      │ match   │ preserved + csoto added   │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ d2e0836fb559 │ c6890f36fc49                                  │ match      │ match   │ preserved + csoto added   │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 9a7c8b624f71 │ 41e3ccca00b3                                  │ match      │ match   │ preserved + csoto added   │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 5e368f829353 │ 24b2e73f9700                                  │ match      │ match   │ preserved + csoto added   │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 11f929c67404 │ 738390a5321c                                  │ match      │ match   │ preserved + csoto added   │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 5241fff6384a │ 4cdc8a7389d5                                  │ match      │ match   │ preserved + csoto added   │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ bde0170b8d61 │ 5920d046f7ae                                  │ match      │ match   │ preserved + csoto added   │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 441c719f4915 │ 9dc42c907028                                  │ match      │ match   │ preserved + csoto added   │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 2ff78cba0bc8 │ 1abaae9b38a8                                  │ match      │ match   │ preserved + csoto added   │
└──────────────┴───────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘

Lint results:
W: 911b93df5e04 ("workqueue: validate cpumask_first() result in llc_"): subject 73 chars (>72)

parse_affn_scope() uses strncasecmp() with the length of the candidate name, which means it only checks if the input *starts with* a known scope name. Given that the upcoming diff will create "cache_shard" affinity scope, writing "cache_shard" to a workqueue's affinity_scope sysfs attribute always matches "cache" first, making it impossible to select "cache_shard" via sysfs, so, this fix enable it to distinguish "cache" and "cache_shard" Fix by replacing the hand-rolled prefix matching loop with sysfs_match_string(), which uses sysfs_streq() for exact matching (modulo trailing newlines). Also add the missing const qualifier to the wq_affn_names[] array declaration. Note that sysfs_streq() is case-sensitive, unlike the previous strncasecmp() approach. This is intentional and consistent with how other sysfs attributes handle string matching in the kernel. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 1abaae9) Signed-off-by: Carol L Soto <csoto@nvidia.com>

Fix "poer" -> "per" in the WQ_AFFN_SMT enum comment. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 9dc42c9) Signed-off-by: Carol L Soto <csoto@nvidia.com>

On systems where many CPUs share one LLC, unbound workqueues using WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock contention on pool->lock. For example, Chuck Lever measured 39% of cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3 NFS-over-RDMA system. The existing affinity hierarchy (cpu, smt, cache, numa, system) offers no intermediate option between per-LLC and per-SMT-core granularity. Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at most wq_cache_shard_size cores (default 8, tunable via boot parameter). Shards are always split on core (SMT group) boundaries so that Hyper-Threading siblings are never placed in different pods. Cores are distributed across shards as evenly as possible -- for example, 36 cores in a single LLC with max shard size 8 produces 5 shards of 8+7+7+7+7 cores. The implementation follows the same comparator pattern as other affinity scopes: precompute_cache_shard_ids() pre-fills the cpu_shard_id[] array from the already-initialized WQ_AFFN_CACHE and WQ_AFFN_SMT topology, and cpus_share_cache_shard() is passed to init_pod_type(). Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread), show cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency compared to cache scope on this 72-core single-LLC system. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 5920d04) Signed-off-by: Carol L Soto <csoto@nvidia.com>

Set WQ_AFFN_CACHE_SHARD as the default affinity scope for unbound workqueues. On systems where many CPUs share one LLC, the previous default (WQ_AFFN_CACHE) collapses all CPUs to a single worker pool, causing heavy spinlock contention on pool->lock. WQ_AFFN_CACHE_SHARD subdivides each LLC into smaller groups, providing a better balance between locality and contention. Users can revert to the previous behavior with workqueue.default_affinity_scope=cache. On systems with 8 or fewer cores per LLC, CACHE_SHARD produces a single shard covering the entire LLC, making it functionally identical to the previous CACHE default. The sharding only activates when an LLC has more than 8 cores. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 4cdc8a7) Signed-off-by: Carol L Soto <csoto@nvidia.com>

The WQ_AFFN_CACHE_SHARD affinity scope was added to the kernel but wq_dump.py was not updated to enumerate it. Add the missing constant lookup and include it in the affinity scopes iteration so that drgn output shows the CACHE_SHARD pod topology alongside the other scopes. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 738390a) Signed-off-by: Carol L Soto <csoto@nvidia.com>

Add a kernel module that benchmarks queue_work() throughput on an unbound workqueue to measure pool->lock contention under different affinity scope configurations (cache vs cache_shard). The module spawns N kthreads (default: num_online_cpus()), each bound to a different CPU. All threads start simultaneously and queue work items, measuring the latency of each queue_work() call. Results are reported as p50/p90/p95 latencies for each affinity scope. The affinity scope is switched between runs via the workqueue's sysfs affinity_scope attribute (WQ_SYSFS), avoiding the need for any new exported symbols. The module runs as __init-only, returning -EAGAIN to auto-unload, and can be re-run via insmod. Example of the output: running 50 threads, 50000 items/thread cpu 6806017 items/sec p50=2574 p90=5068 p95=5818 ns smt 6821040 items/sec p50=2624 p90=5168 p95=5949 ns cache_shard 1633653 items/sec p50=5337 p90=9694 p95=11207 ns cache 286069 items/sec p50=72509 p90=82304 p95=85009 ns numa 319403 items/sec p50=63745 p90=73480 p95=76505 ns system 308461 items/sec p50=66561 p90=75714 p95=78048 ns Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 24b2e73) Signed-off-by: Carol L Soto <csoto@nvidia.com>

Update kernel-parameters.txt and workqueue.rst to reflect the new cache_shard affinity scope and the default change from cache to cache_shard. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 41e3ccc) Signed-off-by: Carol L Soto <csoto@nvidia.com>

The printk() requires a division that is not allowed on 32-bit architectures: x86_64-linux-ld: lib/test_workqueue.o: in function `test_workqueue_init': test_workqueue.c:(.init.text+0x36f): undefined reference to `__udivdi3' Use div_u64() to print the resulting elapsed microseconds. Fixes: 24b2e73 ("workqueue: add test_workqueue benchmark module") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit c6890f3) Signed-off-by: Carol L Soto <csoto@nvidia.com>

…id() On uniprocessor (UP) configs such as nios2, NR_CPUS is 1, so cpu_shard_id[] is a single-element array (int[1]). In llc_populate_cpu_shard_id(), cpumask_first(sibling_cpus) returns an unsigned int that the compiler cannot prove is always 0, triggering a -Warray-bounds warning when the result is used to index cpu_shard_id[]: kernel/workqueue.c:8321:55: warning: array subscript 1 is above array bounds of 'int[1]' [-Warray-bounds] 8321 | cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)]; | ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is a false positive: sibling_cpus can never be empty here because 'c' itself is always set in it, so cpumask_first() will always return a valid CPU. However, the compiler cannot prove this statically, and the warning only manifests on UP configs where the array size is 1. Add a bounds check with WARN_ON_ONCE to silence the warning, and store the result in a local variable to make the code clearer and avoid calling cpumask_first() twice. Fixes: 5920d04 ("workqueue: add WQ_AFFN_CACHE_SHARD affinity scope") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202604022343.GQtkF2vO-lkp@intel.com/ Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 76af546) Signed-off-by: Carol L Soto <csoto@nvidia.com>

jamieNguyenNVIDIA · 2026-04-28T00:01:47Z

Acked-by: Jamie Nguyen <jamien@nvidia.com>

nvmochs · 2026-04-28T01:49:40Z

LGTM!

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

nvmochs · 2026-04-28T17:16:16Z

Merged, closing PR.

945602e4d4c2 workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()
93a1d393dc86 workqueue: avoid unguarded 64-bit division
d9a0944605cf docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope
1e0bcbcb6c8f workqueue: add test_workqueue benchmark module
78dfce3e243e tools/workqueue: add CACHE_SHARD support to wq_dump.py
a8b46e7d7adc workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
9536794c2fd7 workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
c9fe0bd9ff5c workqueue: fix typo in WQ_AFFN_SMT comment
b98c9681040c workqueue: fix parse_affn_scope() prefix matching bug

clsotog force-pushed the clsotog/workqueue-sharded-26.04-bos branch from afdadad to 6acf03f Compare April 27, 2026 18:42

leitao and others added 9 commits April 27, 2026 12:18

workqueue: fix typo in WQ_AFFN_SMT comment

441c719

Fix "poer" -> "per" in the WQ_AFFN_SMT enum comment. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 9dc42c9) Signed-off-by: Carol L Soto <csoto@nvidia.com>

clsotog force-pushed the clsotog/workqueue-sharded-26.04-bos branch from 6acf03f to 911b93d Compare April 27, 2026 19:41

nvmochs self-requested a review April 28, 2026 01:49

nvmochs approved these changes Apr 28, 2026

View reviewed changes

nvmochs closed this Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[26.04_linux-nvidia-bos] workqueue: Introduce a sharded cache affinity scope#394

[26.04_linux-nvidia-bos] workqueue: Introduce a sharded cache affinity scope#394
clsotog wants to merge 9 commits into
NVIDIA:26.04_linux-nvidia-bosfrom
clsotog:clsotog/workqueue-sharded-26.04-bos

clsotog commented Apr 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

jamieNguyenNVIDIA commented Apr 28, 2026

Uh oh!

nvmochs commented Apr 28, 2026

Uh oh!

nvmochs commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

clsotog commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Validation Report

Patchscan ✅ No Missing Fixes

PR Lint ⚠️ Warnings

Uh oh!

jamieNguyenNVIDIA commented Apr 28, 2026

Uh oh!

nvmochs commented Apr 28, 2026

Uh oh!

nvmochs commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

clsotog commented Apr 27, 2026 •

edited

Loading

github-actions Bot commented Apr 27, 2026 •

edited

Loading