[26.04_linux-nvidia-bos] workqueue: Introduce a sharded cache affinity scope#394
Closed
clsotog wants to merge 9 commits into
Closed
[26.04_linux-nvidia-bos] workqueue: Introduce a sharded cache affinity scope#394clsotog wants to merge 9 commits into
clsotog wants to merge 9 commits into
Conversation
Contributor
PR Validation ReportPatchscan ✅ No Missing FixesAll cherry-picked commits checked — no missing upstream fixes found. PR Lint
|
afdadad to
6acf03f
Compare
parse_affn_scope() uses strncasecmp() with the length of the candidate name, which means it only checks if the input *starts with* a known scope name. Given that the upcoming diff will create "cache_shard" affinity scope, writing "cache_shard" to a workqueue's affinity_scope sysfs attribute always matches "cache" first, making it impossible to select "cache_shard" via sysfs, so, this fix enable it to distinguish "cache" and "cache_shard" Fix by replacing the hand-rolled prefix matching loop with sysfs_match_string(), which uses sysfs_streq() for exact matching (modulo trailing newlines). Also add the missing const qualifier to the wq_affn_names[] array declaration. Note that sysfs_streq() is case-sensitive, unlike the previous strncasecmp() approach. This is intentional and consistent with how other sysfs attributes handle string matching in the kernel. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 1abaae9) Signed-off-by: Carol L Soto <csoto@nvidia.com>
Fix "poer" -> "per" in the WQ_AFFN_SMT enum comment. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 9dc42c9) Signed-off-by: Carol L Soto <csoto@nvidia.com>
On systems where many CPUs share one LLC, unbound workqueues using WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock contention on pool->lock. For example, Chuck Lever measured 39% of cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3 NFS-over-RDMA system. The existing affinity hierarchy (cpu, smt, cache, numa, system) offers no intermediate option between per-LLC and per-SMT-core granularity. Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at most wq_cache_shard_size cores (default 8, tunable via boot parameter). Shards are always split on core (SMT group) boundaries so that Hyper-Threading siblings are never placed in different pods. Cores are distributed across shards as evenly as possible -- for example, 36 cores in a single LLC with max shard size 8 produces 5 shards of 8+7+7+7+7 cores. The implementation follows the same comparator pattern as other affinity scopes: precompute_cache_shard_ids() pre-fills the cpu_shard_id[] array from the already-initialized WQ_AFFN_CACHE and WQ_AFFN_SMT topology, and cpus_share_cache_shard() is passed to init_pod_type(). Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread), show cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency compared to cache scope on this 72-core single-LLC system. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 5920d04) Signed-off-by: Carol L Soto <csoto@nvidia.com>
Set WQ_AFFN_CACHE_SHARD as the default affinity scope for unbound workqueues. On systems where many CPUs share one LLC, the previous default (WQ_AFFN_CACHE) collapses all CPUs to a single worker pool, causing heavy spinlock contention on pool->lock. WQ_AFFN_CACHE_SHARD subdivides each LLC into smaller groups, providing a better balance between locality and contention. Users can revert to the previous behavior with workqueue.default_affinity_scope=cache. On systems with 8 or fewer cores per LLC, CACHE_SHARD produces a single shard covering the entire LLC, making it functionally identical to the previous CACHE default. The sharding only activates when an LLC has more than 8 cores. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 4cdc8a7) Signed-off-by: Carol L Soto <csoto@nvidia.com>
The WQ_AFFN_CACHE_SHARD affinity scope was added to the kernel but wq_dump.py was not updated to enumerate it. Add the missing constant lookup and include it in the affinity scopes iteration so that drgn output shows the CACHE_SHARD pod topology alongside the other scopes. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 738390a) Signed-off-by: Carol L Soto <csoto@nvidia.com>
Add a kernel module that benchmarks queue_work() throughput on an unbound workqueue to measure pool->lock contention under different affinity scope configurations (cache vs cache_shard). The module spawns N kthreads (default: num_online_cpus()), each bound to a different CPU. All threads start simultaneously and queue work items, measuring the latency of each queue_work() call. Results are reported as p50/p90/p95 latencies for each affinity scope. The affinity scope is switched between runs via the workqueue's sysfs affinity_scope attribute (WQ_SYSFS), avoiding the need for any new exported symbols. The module runs as __init-only, returning -EAGAIN to auto-unload, and can be re-run via insmod. Example of the output: running 50 threads, 50000 items/thread cpu 6806017 items/sec p50=2574 p90=5068 p95=5818 ns smt 6821040 items/sec p50=2624 p90=5168 p95=5949 ns cache_shard 1633653 items/sec p50=5337 p90=9694 p95=11207 ns cache 286069 items/sec p50=72509 p90=82304 p95=85009 ns numa 319403 items/sec p50=63745 p90=73480 p95=76505 ns system 308461 items/sec p50=66561 p90=75714 p95=78048 ns Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 24b2e73) Signed-off-by: Carol L Soto <csoto@nvidia.com>
Update kernel-parameters.txt and workqueue.rst to reflect the new cache_shard affinity scope and the default change from cache to cache_shard. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit 41e3ccc) Signed-off-by: Carol L Soto <csoto@nvidia.com>
The printk() requires a division that is not allowed on 32-bit architectures: x86_64-linux-ld: lib/test_workqueue.o: in function `test_workqueue_init': test_workqueue.c:(.init.text+0x36f): undefined reference to `__udivdi3' Use div_u64() to print the resulting elapsed microseconds. Fixes: 24b2e73 ("workqueue: add test_workqueue benchmark module") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Tejun Heo <tj@kernel.org> (cherry picked from commit c6890f3) Signed-off-by: Carol L Soto <csoto@nvidia.com>
…id()
On uniprocessor (UP) configs such as nios2, NR_CPUS is 1, so
cpu_shard_id[] is a single-element array (int[1]). In
llc_populate_cpu_shard_id(), cpumask_first(sibling_cpus) returns an
unsigned int that the compiler cannot prove is always 0, triggering
a -Warray-bounds warning when the result is used to index
cpu_shard_id[]:
kernel/workqueue.c:8321:55: warning: array subscript 1 is above
array bounds of 'int[1]' [-Warray-bounds]
8321 | cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)];
| ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is a false positive: sibling_cpus can never be empty here because
'c' itself is always set in it, so cpumask_first() will always return a
valid CPU. However, the compiler cannot prove this statically, and the
warning only manifests on UP configs where the array size is 1.
Add a bounds check with WARN_ON_ONCE to silence the warning, and store
the result in a local variable to make the code clearer and avoid calling
cpumask_first() twice.
Fixes: 5920d04 ("workqueue: add WQ_AFFN_CACHE_SHARD affinity scope")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604022343.GQtkF2vO-lkp@intel.com/
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 76af546)
Signed-off-by: Carol L Soto <csoto@nvidia.com>
6acf03f to
911b93d
Compare
Collaborator
|
|
Collaborator
|
LGTM!
|
nvmochs
approved these changes
Apr 28, 2026
Collaborator
|
Merged, closing PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This feature from v7.1 has shown benefit on Grace: https://www.phoronix.com/news/Linux-7.1-WQ
LKML (series): [https://lore.kernel.org/all/20260401-workqueue_sharded-v3-0-ab0b9336bf0b@debian.org/]
LKML (follow-up fix): [https://lore.kernel.org/all/20260413-workqueue_fix_nios-v2-1-2cf6a61b6bb3@debian.org/]
LKML (related fix taken from series RFC): [https://lore.kernel.org/all/2eef24999c6eeef8e8ea8daf54990e76@kernel.org/]
LKML (additional follow-up fix): https://lore.kernel.org/all/20260402205913.1953402-1-arnd@kernel.org/
Upstream patches:
9dc42c9 workqueue: fix typo in WQ_AFFN_SMT comment
5920d04 workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
4cdc8a7 workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
738390a tools/workqueue: add CACHE_SHARD support to wq_dump.py
24b2e73 workqueue: add test_workqueue benchmark module
41e3ccc docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope
76af546 workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()
1abaae9 workqueue: fix parse_affn_scope() prefix matching bug
c6890f3 workqueue: avoid unguarded 64-bit division
LP: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2150467
Validation:
./test_cache_shard.sh
TAP version 13
ok 1 - WQ_AFFN_CACHE_SHARD support present
ok 2 - default cache_shard_size is 8
ok 3 - default affinity scope is cache_shard (via module param)
./test_cache_shard.sh: line 92: /sys/module/workqueue/parameters/cache_shard_size: Permission denied
ok 4 - cache_shard_size is read-only (write correctly rejected)
ok 5 - writeback: all affinity scopes accepted and read back correctly
ok 6 - writeback: cannot restore scope (write permission denied) # SKIP
ok 7 - writeback: cache_shard scope accepted
ok 8 - LLC shard sizes all <= wq_cache_shard_size=8 (2 LLC(s) checked)
ok 9 - wq_affinity_test passed
1..9
Passed: 8 Failed: 0 Skipped: 1
./test-workqueue-cache-shard.sh
== Environment ==
kernel: 7.0.0
machine: aarch64
== Boot cmdline (workqueue-related) ==
INFO: No workqueue.* tokens in /proc/cmdline (defaults apply).
== Built-in module parameters: workqueue ==
cache_shard_size=8
cpu_intensive_thresh_us=20000
cpu_intensive_warning_thresh=4
debug_force_rr_cpu=N
default_affinity_scope=cache_shard
power_efficient=Y
== Virtual workqueue sysfs: affinity_scope ==
blkcg_punt_bio: affinity_scope=default (cache_shard)
ib-comp-unb-wq: affinity_scope=default (cache_shard)
nvme-auth-wq: affinity_scope=default (cache_shard)
nvme-delete-wq: affinity_scope=default (cache_shard)
nvme-reset-wq: affinity_scope=default (cache_shard)
nvme-wq: affinity_scope=default (cache_shard)
raid5wq: affinity_scope=default (cache_shard)
writeback: affinity_scope=cache_shard
== Try writing affinity_scope on first writable wq (best-effort) ==
INFO: Using writeback (current=cache_shard)
set affinity_scope=cache_shard -> read_back=cache_shard
set affinity_scope=cache -> read_back=cache
set affinity_scope=smt -> read_back=smt
set affinity_scope=cpu -> read_back=cpu
set affinity_scope=system -> read_back=system
set affinity_scope=numa -> read_back=numa
== Optional benchmark module (name varies by patch revision) ==
INFO: Found kernel module: test_workqueue
INFO: Running: modprobe test_workqueue
modprobe: ERROR: could not insert 'test_workqueue': Resource temporarily unavailable
PASS: modprobe test_workqueue: init returned -EAGAIN (userspace exit 1) — Return -EAGAIN so the module does not stay loaded after the benchmark; treating as success.
INFO: Recent kernel log lines:
[ 308.429626] test_workqueue: running 144 threads, 50000 items/thread
[ 310.762202] test_workqueue: cpu 8052070 items/sec p50=14400 p90=15168 p95=15392 ns
[ 312.914305] test_workqueue: smt 8067805 items/sec p50=14432 p90=15200 p95=15520 ns
[ 316.089652] test_workqueue: cache_shard 4377330 items/sec p50=8448 p90=13664 p95=15872 ns
[ 330.482312] test_workqueue: cache 639225 items/sec p50=41504 p90=185728 p95=197567 ns
[ 344.235586] test_workqueue: numa 646112 items/sec p50=43008 p90=180351 p95=190624 ns
[ 373.915019] test_workqueue: system 269122 items/sec p50=209247 p90=269824 p95=277023 ns
PASS: Finished. Review WARN lines; missing sysfs/module often means config differs from the patch series.
./wq_stress -d 60
TAP version 13
wq_stress: duration=60s verbose=0 phase=all
ok 1 - phase1: shard layout math correct (32768 cases, LLC sizes 1..512, shard_sizes 1..64)
ok 2 - phase2: stress completed without deadlock (cache=1343910246 cache_shard=13941963556 iterations)
phase2: throughput ratio cache_shard/cache = 10.37x
ok 3 - phase2: cache_shard throughput >= 90% of cache (ratio=10.37x)
ok 4 - phase3: captured 3429 workqueue_execute_start events
ok 5 - phase3: all 3429 events on CPUs with valid shard assignments
ok 6 - phase4: 25 scope switches under load completed, scope restored to 'cache_shard'
1..6
Passed: 6 Failed: 0 Skipped: 0