Skip to content

[26.04_linux-nvidia-bos] workqueue: Introduce a sharded cache affinity scope#394

Closed
clsotog wants to merge 9 commits into
NVIDIA:26.04_linux-nvidia-bosfrom
clsotog:clsotog/workqueue-sharded-26.04-bos
Closed

[26.04_linux-nvidia-bos] workqueue: Introduce a sharded cache affinity scope#394
clsotog wants to merge 9 commits into
NVIDIA:26.04_linux-nvidia-bosfrom
clsotog:clsotog/workqueue-sharded-26.04-bos

Conversation

@clsotog
Copy link
Copy Markdown
Collaborator

@clsotog clsotog commented Apr 27, 2026

This feature from v7.1 has shown benefit on Grace: https://www.phoronix.com/news/Linux-7.1-WQ

LKML (series): [https://lore.kernel.org/all/20260401-workqueue_sharded-v3-0-ab0b9336bf0b@debian.org/]
LKML (follow-up fix): [https://lore.kernel.org/all/20260413-workqueue_fix_nios-v2-1-2cf6a61b6bb3@debian.org/]
LKML (related fix taken from series RFC): [https://lore.kernel.org/all/2eef24999c6eeef8e8ea8daf54990e76@kernel.org/]
LKML (additional follow-up fix): https://lore.kernel.org/all/20260402205913.1953402-1-arnd@kernel.org/

Upstream patches:
9dc42c9 workqueue: fix typo in WQ_AFFN_SMT comment
5920d04 workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
4cdc8a7 workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
738390a tools/workqueue: add CACHE_SHARD support to wq_dump.py
24b2e73 workqueue: add test_workqueue benchmark module
41e3ccc docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope
76af546 workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()
1abaae9 workqueue: fix parse_affn_scope() prefix matching bug
c6890f3 workqueue: avoid unguarded 64-bit division

LP: https://bugs.launchpad.net/ubuntu/+source/linux-nvidia-6.17/+bug/2150467

Validation:

./test_cache_shard.sh
TAP version 13
ok 1 - WQ_AFFN_CACHE_SHARD support present
ok 2 - default cache_shard_size is 8
ok 3 - default affinity scope is cache_shard (via module param)
./test_cache_shard.sh: line 92: /sys/module/workqueue/parameters/cache_shard_size: Permission denied
ok 4 - cache_shard_size is read-only (write correctly rejected)
ok 5 - writeback: all affinity scopes accepted and read back correctly
ok 6 - writeback: cannot restore scope (write permission denied) # SKIP
ok 7 - writeback: cache_shard scope accepted
ok 8 - LLC shard sizes all <= wq_cache_shard_size=8 (2 LLC(s) checked)
ok 9 - wq_affinity_test passed
1..9

Passed: 8 Failed: 0 Skipped: 1
./test-workqueue-cache-shard.sh
== Environment ==
kernel: 7.0.0
machine: aarch64

== Boot cmdline (workqueue-related) ==
INFO: No workqueue.* tokens in /proc/cmdline (defaults apply).

== Built-in module parameters: workqueue ==
cache_shard_size=8
cpu_intensive_thresh_us=20000
cpu_intensive_warning_thresh=4
debug_force_rr_cpu=N
default_affinity_scope=cache_shard
power_efficient=Y

== Virtual workqueue sysfs: affinity_scope ==
blkcg_punt_bio: affinity_scope=default (cache_shard)
ib-comp-unb-wq: affinity_scope=default (cache_shard)
nvme-auth-wq: affinity_scope=default (cache_shard)
nvme-delete-wq: affinity_scope=default (cache_shard)
nvme-reset-wq: affinity_scope=default (cache_shard)
nvme-wq: affinity_scope=default (cache_shard)
raid5wq: affinity_scope=default (cache_shard)
writeback: affinity_scope=cache_shard

== Try writing affinity_scope on first writable wq (best-effort) ==
INFO: Using writeback (current=cache_shard)
set affinity_scope=cache_shard -> read_back=cache_shard
set affinity_scope=cache -> read_back=cache
set affinity_scope=smt -> read_back=smt
set affinity_scope=cpu -> read_back=cpu
set affinity_scope=system -> read_back=system
set affinity_scope=numa -> read_back=numa

== Optional benchmark module (name varies by patch revision) ==
INFO: Found kernel module: test_workqueue
INFO: Running: modprobe test_workqueue
modprobe: ERROR: could not insert 'test_workqueue': Resource temporarily unavailable
PASS: modprobe test_workqueue: init returned -EAGAIN (userspace exit 1) — Return -EAGAIN so the module does not stay loaded after the benchmark; treating as success.
INFO: Recent kernel log lines:
[ 308.429626] test_workqueue: running 144 threads, 50000 items/thread
[ 310.762202] test_workqueue: cpu 8052070 items/sec p50=14400 p90=15168 p95=15392 ns
[ 312.914305] test_workqueue: smt 8067805 items/sec p50=14432 p90=15200 p95=15520 ns
[ 316.089652] test_workqueue: cache_shard 4377330 items/sec p50=8448 p90=13664 p95=15872 ns
[ 330.482312] test_workqueue: cache 639225 items/sec p50=41504 p90=185728 p95=197567 ns
[ 344.235586] test_workqueue: numa 646112 items/sec p50=43008 p90=180351 p95=190624 ns
[ 373.915019] test_workqueue: system 269122 items/sec p50=209247 p90=269824 p95=277023 ns
PASS: Finished. Review WARN lines; missing sysfs/module often means config differs from the patch series.

./wq_stress -d 60
TAP version 13

wq_stress: duration=60s verbose=0 phase=all
ok 1 - phase1: shard layout math correct (32768 cases, LLC sizes 1..512, shard_sizes 1..64)
ok 2 - phase2: stress completed without deadlock (cache=1343910246 cache_shard=13941963556 iterations)

phase2: throughput ratio cache_shard/cache = 10.37x
ok 3 - phase2: cache_shard throughput >= 90% of cache (ratio=10.37x)
ok 4 - phase3: captured 3429 workqueue_execute_start events
ok 5 - phase3: all 3429 events on CPUs with valid shard assignments
ok 6 - phase4: 25 scope switches under load completed, scope restored to 'cache_shard'
1..6

Passed: 6 Failed: 0 Skipped: 0

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 27, 2026

PR Validation Report

Patchscan ✅ No Missing Fixes

All cherry-picked commits checked — no missing upstream fixes found.

PR Lint ⚠️ Warnings

Details
Checking 9 commits...

Cherry-pick digest:
┌──────────────┬───────────────────────────────────────────────┬────────────┬─────────┬───────────────────────────┐
│ Local        │ Referenced upstream / Patch subject           │ Patch-ID   │ Subject │ SoB chain                 │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 911b93df5e04 │ 76af54648899                                  │ match      │ match   │ preserved + csoto added   │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ d2e0836fb559 │ c6890f36fc49                                  │ match      │ match   │ preserved + csoto added   │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 9a7c8b624f71 │ 41e3ccca00b3                                  │ match      │ match   │ preserved + csoto added   │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 5e368f829353 │ 24b2e73f9700                                  │ match      │ match   │ preserved + csoto added   │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 11f929c67404 │ 738390a5321c                                  │ match      │ match   │ preserved + csoto added   │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 5241fff6384a │ 4cdc8a7389d5                                  │ match      │ match   │ preserved + csoto added   │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ bde0170b8d61 │ 5920d046f7ae                                  │ match      │ match   │ preserved + csoto added   │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 441c719f4915 │ 9dc42c907028                                  │ match      │ match   │ preserved + csoto added   │
├──────────────┼───────────────────────────────────────────────┼────────────┼─────────┼───────────────────────────┤
│ 2ff78cba0bc8 │ 1abaae9b38a8                                  │ match      │ match   │ preserved + csoto added   │
└──────────────┴───────────────────────────────────────────────┴────────────┴─────────┴───────────────────────────┘

Lint results:
W: 911b93df5e04 ("workqueue: validate cpumask_first() result in llc_"): subject 73 chars (>72)

@clsotog clsotog force-pushed the clsotog/workqueue-sharded-26.04-bos branch from afdadad to 6acf03f Compare April 27, 2026 18:42
leitao and others added 9 commits April 27, 2026 12:18
parse_affn_scope() uses strncasecmp() with the length of the candidate
name, which means it only checks if the input *starts with* a known
scope name.

Given that the upcoming diff will create "cache_shard" affinity scope,
writing "cache_shard" to a workqueue's affinity_scope sysfs attribute
always matches "cache" first, making it impossible to select
"cache_shard" via sysfs, so, this fix enable it to distinguish "cache"
and "cache_shard"

Fix by replacing the hand-rolled prefix matching loop with
sysfs_match_string(), which uses sysfs_streq() for exact matching
(modulo trailing newlines). Also add the missing const qualifier to
the wq_affn_names[] array declaration.

Note that sysfs_streq() is case-sensitive, unlike the previous
strncasecmp() approach. This is intentional and consistent with
how other sysfs attributes handle string matching in the kernel.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 1abaae9)
Signed-off-by: Carol L Soto <csoto@nvidia.com>
Fix "poer" -> "per" in the WQ_AFFN_SMT enum comment.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 9dc42c9)
Signed-off-by: Carol L Soto <csoto@nvidia.com>
On systems where many CPUs share one LLC, unbound workqueues using
WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock
contention on pool->lock. For example, Chuck Lever measured 39% of
cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3
NFS-over-RDMA system.

The existing affinity hierarchy (cpu, smt, cache, numa, system) offers
no intermediate option between per-LLC and per-SMT-core granularity.

Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at
most wq_cache_shard_size cores (default 8, tunable via boot parameter).
Shards are always split on core (SMT group) boundaries so that
Hyper-Threading siblings are never placed in different pods. Cores are
distributed across shards as evenly as possible -- for example, 36 cores
in a single LLC with max shard size 8 produces 5 shards of 8+7+7+7+7
cores.

The implementation follows the same comparator pattern as other affinity
scopes: precompute_cache_shard_ids() pre-fills the cpu_shard_id[] array
from the already-initialized WQ_AFFN_CACHE and WQ_AFFN_SMT topology,
and cpus_share_cache_shard() is passed to init_pod_type().

Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread), show
cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency
compared to cache scope on this 72-core single-LLC system.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 5920d04)
Signed-off-by: Carol L Soto <csoto@nvidia.com>
Set WQ_AFFN_CACHE_SHARD as the default affinity scope for unbound
workqueues. On systems where many CPUs share one LLC, the previous
default (WQ_AFFN_CACHE) collapses all CPUs to a single worker pool,
causing heavy spinlock contention on pool->lock.

WQ_AFFN_CACHE_SHARD subdivides each LLC into smaller groups, providing
a better balance between locality and contention. Users can revert to
the previous behavior with workqueue.default_affinity_scope=cache.

On systems with 8 or fewer cores per LLC, CACHE_SHARD produces a single
shard covering the entire LLC, making it functionally identical to the
previous CACHE default. The sharding only activates when an LLC has more
than 8 cores.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 4cdc8a7)
Signed-off-by: Carol L Soto <csoto@nvidia.com>
The WQ_AFFN_CACHE_SHARD affinity scope was added to the kernel but
wq_dump.py was not updated to enumerate it. Add the missing constant
lookup and include it in the affinity scopes iteration so that drgn
output shows the CACHE_SHARD pod topology alongside the other scopes.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 738390a)
Signed-off-by: Carol L Soto <csoto@nvidia.com>
Add a kernel module that benchmarks queue_work() throughput on an
unbound workqueue to measure pool->lock contention under different
affinity scope configurations (cache vs cache_shard).

The module spawns N kthreads (default: num_online_cpus()), each bound
to a different CPU. All threads start simultaneously and queue work
items, measuring the latency of each queue_work() call. Results are
reported as p50/p90/p95 latencies for each affinity scope.

The affinity scope is switched between runs via the workqueue's sysfs
affinity_scope attribute (WQ_SYSFS), avoiding the need for any new
exported symbols.

The module runs as __init-only, returning -EAGAIN to auto-unload,
and can be re-run via insmod.

Example of the output:

 running 50 threads, 50000 items/thread

   cpu              6806017 items/sec p50=2574    p90=5068    p95=5818 ns
   smt              6821040 items/sec p50=2624    p90=5168    p95=5949 ns
   cache_shard      1633653 items/sec p50=5337    p90=9694    p95=11207 ns
   cache            286069 items/sec p50=72509    p90=82304   p95=85009 ns
   numa             319403 items/sec p50=63745    p90=73480   p95=76505 ns
   system           308461 items/sec p50=66561    p90=75714   p95=78048 ns

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 24b2e73)
Signed-off-by: Carol L Soto <csoto@nvidia.com>
Update kernel-parameters.txt and workqueue.rst to reflect the new
cache_shard affinity scope and the default change from cache to
cache_shard.

Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 41e3ccc)
Signed-off-by: Carol L Soto <csoto@nvidia.com>
The printk() requires a division that is not allowed on 32-bit architectures:

x86_64-linux-ld: lib/test_workqueue.o: in function `test_workqueue_init':
test_workqueue.c:(.init.text+0x36f): undefined reference to `__udivdi3'

Use div_u64() to print the resulting elapsed microseconds.

Fixes: 24b2e73 ("workqueue: add test_workqueue benchmark module")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit c6890f3)
Signed-off-by: Carol L Soto <csoto@nvidia.com>
…id()

On uniprocessor (UP) configs such as nios2, NR_CPUS is 1, so
cpu_shard_id[] is a single-element array (int[1]). In
llc_populate_cpu_shard_id(), cpumask_first(sibling_cpus) returns an
unsigned int that the compiler cannot prove is always 0, triggering
a -Warray-bounds warning when the result is used to index
cpu_shard_id[]:

  kernel/workqueue.c:8321:55: warning: array subscript 1 is above
  array bounds of 'int[1]' [-Warray-bounds]
   8321 |  cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)];
        |                    ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is a false positive: sibling_cpus can never be empty here because
'c' itself is always set in it, so cpumask_first() will always return a
valid CPU. However, the compiler cannot prove this statically, and the
warning only manifests on UP configs where the array size is 1.

Add a bounds check with WARN_ON_ONCE to silence the warning, and store
the result in a local variable to make the code clearer and avoid calling
cpumask_first() twice.

Fixes: 5920d04 ("workqueue: add WQ_AFFN_CACHE_SHARD affinity scope")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604022343.GQtkF2vO-lkp@intel.com/
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 76af546)
Signed-off-by: Carol L Soto <csoto@nvidia.com>
@clsotog clsotog force-pushed the clsotog/workqueue-sharded-26.04-bos branch from 6acf03f to 911b93d Compare April 27, 2026 19:41
@jamieNguyenNVIDIA
Copy link
Copy Markdown
Collaborator

Acked-by: Jamie Nguyen <jamien@nvidia.com>

@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Apr 28, 2026

LGTM!

Acked-by: Matthew R. Ochs <mochs@nvidia.com>

@nvmochs nvmochs self-requested a review April 28, 2026 01:49
@nvmochs
Copy link
Copy Markdown
Collaborator

nvmochs commented Apr 28, 2026

Merged, closing PR.

945602e4d4c2 workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()
93a1d393dc86 workqueue: avoid unguarded 64-bit division
d9a0944605cf docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope
1e0bcbcb6c8f workqueue: add test_workqueue benchmark module
78dfce3e243e tools/workqueue: add CACHE_SHARD support to wq_dump.py
a8b46e7d7adc workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
9536794c2fd7 workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
c9fe0bd9ff5c workqueue: fix typo in WQ_AFFN_SMT comment
b98c9681040c workqueue: fix parse_affn_scope() prefix matching bug

@nvmochs nvmochs closed this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants