You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Wasm: process-persistent shared Web Worker pool (fix heavy-test starvation)
The Web Worker pool was per-accelerator: created on first dispatch and
terminate()'d on Dispose. Worker.terminate() is an asynchronous browser
signal, so a fresh-accelerator-per-test pattern (PMT's ~531-test Wasm lane)
spun up a new hardwareConcurrency pool while the previous pool's threads were
still winding down -> transient worker oversubscription that compounded across
the lane -> the pure-spin barrier couldn't schedule all workers in its window
-> compute-heavy tests starved and timed out late while light tests stayed fast
(Tuvok's ML full-sweep report).
Make the pool process-static (s_sharedWorkerPool), created once per tab and
reused across every default-WorkerCount accelerator. Removes both the terminate
churn and the per-test re-create cost. Safe across accelerators: the worker-side
module-cache key is the process-static monotonic _nextKernelId (no cross-
accelerator collision), a memory-buffer change invalidates a reused worker's
cached instances, and each accelerator detaches its own per-worker handlers on
Dispose instead of relying on terminate.
Bounded at ~hardwareConcurrency-2:
- an explicit WorkerCount (oversubscription stress tests at 16/48/3xcores) keeps
a PRIVATE pool (old create/terminate lifecycle) so its non-default count can't
inflate the shared pool; routing is by resolved-count == machine-default.
- a worker still CHECKED OUT at an abnormal/fire-and-forget Dispose (may be
running an orphaned parked-barrier fiber) is terminated + removed from the pool
rather than stranded, which would drain the available queue and let the
shortfall-grow path balloon the pool.
New WorkerPool.Remove(worker); WasmAccelerator.SharedWorkerPoolSize /
DisposeSharedWorkerPool() diagnostics. Regression guard
WasmTests.Wasm_SharedWorkerPool_PersistsAndStaysBoundedAcrossAccelerators
(persisted + bounded + correct-on-reuse + explicit-count isolation).
Gate: PMT_FILTER=WasmTests 516/0/17. Version 4.12.1-local.3 (forks stay 2.0.16).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,6 +8,7 @@ Wrapper-only (forks stay **2.0.16**). Adds a new selection-gate capability flag:
8
8
9
9
-**`AcceleratorRequirements.RequiresScatterStores`** (rules out WebGL) - declare it when a kernel writes a computed/arbitrary output index (`out[someIndex] = ...`) or more than one element of one buffer per thread that isn't the consecutive `v*storeCount+slot` layout. WebGL Transform-Feedback captures one output record per vertex at the thread's own slot (gather-only), so in-kernel scatter can't run there; the flag filters WebGL at `EnumerateCompatibleDevices` / `CreatePreferredAccelerator` / `Satisfies` time. (WebGL still scatters at the host/algorithm layer - e.g. RadixSort via render-to-texture.)
10
10
- A compile-time fail-loud guard for this class (mirroring the atomics/barriers/Scan throws) was prototyped and backed out - the blunt criterion false-positived on legitimate positional multi-store + grid-stride-loop kernels. The correct codegen-level criterion is a tracked open item (`Plans/webgl-multistore-fail-loud-guard-plan-2026-06-13.md`). For now use the selection flag.
11
+
- **Wasm: process-persistent shared Web Worker pool.** The Web Worker pool is now process-static (`WasmAccelerator.s_sharedWorkerPool`) - created once per tab and reused across every default-WorkerCount accelerator - instead of being created and `terminate()`d per accelerator. `Worker.terminate()` is an asynchronous browser signal, so a fresh-accelerator-per-test pattern (PMT's ~531-test Wasm lane) spun up a new `hardwareConcurrency` pool while the previous pool's threads were still winding down -> transient worker oversubscription that compounded across the lane -> the pure-spin barrier couldn't schedule all workers in its window -> compute-heavy tests starved and timed out late while light tests stayed fast. The shared pool removes both the terminate churn and the per-test re-create cost. Safe across accelerators: the worker-side module-cache key is a process-static monotonic id (no cross-accelerator collision), a memory-buffer change invalidates a reused worker's cached instances, and each accelerator detaches its handlers on Dispose. Bounded at ~`hardwareConcurrency-2`: an explicit `WorkerCount` (oversubscription stress tests) keeps a private pool, and a worker still checked out at an abnormal Dispose is terminated+removed rather than stranded. Locked by `WasmTests.Wasm_SharedWorkerPool_PersistsAndStaysBoundedAcrossAccelerators`. Gate: `PMT_FILTER=WasmTests` 516/0/17.
11
12
12
13
## 4.12.0 (2026-06-13) - Sync/async contract: async-only where it waits/observes, sync for fire-and-forget
Copy file name to clipboardExpand all lines: SpawnDev.ILGPU/SpawnDev.ILGPU.csproj
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -4,9 +4,9 @@
4
4
<TargetFramework>net10.0</TargetFramework>
5
5
<ImplicitUsings>enable</ImplicitUsings>
6
6
<Nullable>enable</Nullable>
7
-
<Version>4.12.1-local.2</Version>
7
+
<Version>4.12.1-local.3</Version>
8
8
<!-- Brief current-version highlights only. Full per-version history with code samples lives in CHANGELOG.md (linked from the README). -->
9
-
<PackageReleaseNotes>4.12.1: (a) WebGPU GEMV grid-stride fix — a cooperative GEMV's inner K-tile loop (k=tid;k<K;k+=G) was conflated with a synthetic group grid-stride counter (~K/G x too-small accumulation); fixed with a two-pass uniform-break (decide the break from the EMITTED body's barriers — barrier-free loops keep their natural break, barrier loops keep the uniform break unchanged). Re-enables the fast cooperative M=1 GEMV on WebGPU. (b) fixes special-float (±inf/NaN) scalar kernel params on WebGL + Wasm - a kernel float scalar holding ±inf/NaN (e.g. ConstantOfShape(-inf)) silently failed at the .NET->JS dispatch boundary (WebGL: System.Text.Json in postMessage rejected ±inf/NaN; Wasm: culture-sensitive ToString emitted invalid JS tokens). WebGL now sends float scalars as int32 bit patterns (glWorker reconstructs); Wasm uses InvariantCulture. Also adds the AcceleratorRequirements.RequiresScatterStores capability flag (rules out WebGL for kernels doing in-kernel scatter / >1 non-positional output element per thread - WebGL Transform-Feedback is gather-only). Forks stay 2.0.16. --- 4.12.0 Sync/async contract (bundles forks 2.0.16): an operation that WAITS for completion or OBSERVES a result is async-only on the browser backends (WebGPU/WebGL/Wasm) - its synchronous form throws NotSupportedException; fire-and-forget operations (dispatch, alloc, host->device upload, flush-submit) stay synchronous everywhere. Synchronize() now THROWS on browser (was a silent non-waiting flush) - use await SynchronizeAsync() to wait. New Flush() submits batched work without waiting (valid synchronously on browser - submit is sync on every backend, so there is no async Flush twin). CopyFromCPU/Allocate1D(data) work on every browser backend again (routed through the new EnsureHostCopyConsumed hook). Sync CreateScan/CreateRadixSort builders run on the browser backends. Gate: full cross-backend PMT 3384/0/218. Full contract: Docs/async.md. Full per-version history: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
9
+
<PackageReleaseNotes>4.12.1: (a) WebGPU GEMV grid-stride fix — a cooperative GEMV's inner K-tile loop (k=tid;k<K;k+=G) was conflated with a synthetic group grid-stride counter (~K/G x too-small accumulation); fixed with a two-pass uniform-break (decide the break from the EMITTED body's barriers — barrier-free loops keep their natural break, barrier loops keep the uniform break unchanged). Re-enables the fast cooperative M=1 GEMV on WebGPU. (b) fixes special-float (±inf/NaN) scalar kernel params on WebGL + Wasm - a kernel float scalar holding ±inf/NaN (e.g. ConstantOfShape(-inf)) silently failed at the .NET->JS dispatch boundary (WebGL: System.Text.Json in postMessage rejected ±inf/NaN; Wasm: culture-sensitive ToString emitted invalid JS tokens). WebGL now sends float scalars as int32 bit patterns (glWorker reconstructs); Wasm uses InvariantCulture. Also adds the AcceleratorRequirements.RequiresScatterStores capability flag (rules out WebGL for kernels doing in-kernel scatter / >1 non-positional output element per thread - WebGL Transform-Feedback is gather-only). (c) Wasm: the Web Worker pool is now process-persistent (shared across accelerators) instead of created + terminated per accelerator - removes the per-accelerator worker create/async-terminate churn that transiently oversubscribed cores and starved compute-heavy tests late in a long sequential lane. Default-WorkerCount accelerators share the pool (bounded at ~hardwareConcurrency-2); an explicit WorkerCount (oversubscription stress tests) keeps a private pool. Forks stay 2.0.16. --- 4.12.0 Sync/async contract (bundles forks 2.0.16): an operation that WAITS for completion or OBSERVES a result is async-only on the browser backends (WebGPU/WebGL/Wasm) - its synchronous form throws NotSupportedException; fire-and-forget operations (dispatch, alloc, host->device upload, flush-submit) stay synchronous everywhere. Synchronize() now THROWS on browser (was a silent non-waiting flush) - use await SynchronizeAsync() to wait. New Flush() submits batched work without waiting (valid synchronously on browser - submit is sync on every backend, so there is no async Flush twin). CopyFromCPU/Allocate1D(data) work on every browser backend again (routed through the new EnsureHostCopyConsumed hook). Sync CreateScan/CreateRadixSort builders run on the browser backends. Gate: full cross-backend PMT 3384/0/218. Full contract: Docs/async.md. Full per-version history: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
-`WasmILGPUDevice.cs` — device config (`MaxNumThreadsPerGroup = 256`, `MaxGroupSize = (256,1,1)`). NOTE: an earlier version of this line said 64 — that was stale; the device has set 256 (verified `WasmILGPUDevice.cs:68-69` + offline compile dump 2026-06-09). RadixSortKernel1's `scanMemory` is `int[groupSize*UnrollFactor]` = `int[1024]` at groupSize 256, UnrollFactor 4.
54
54
55
+
## Persistent Shared Worker Pool (2026-06-13) — `s_sharedWorkerPool`
56
+
57
+
The Web Worker pool is **process-static** (`WasmAccelerator.s_sharedWorkerPool`), created ONCE
58
+
per tab and reused across every **default-WorkerCount**`WasmAccelerator` (the PMT/production
59
+
common case). It is **not** recreated/terminated per accelerator. An accelerator created with an
60
+
explicit `WasmBackendOptions.WorkerCount` (the oversubscription stress tests at 16/48/3×cores)
61
+
uses a **private** pool (`_ownPool`, old create-on-first-dispatch / terminate-on-Dispose
62
+
lifecycle) so its large worker count can't permanently inflate the shared pool for the rest of
63
+
the lane. Two leak guards keep the shared pool bounded at ≈`hardwareConcurrency-2`: (a) only
64
+
default accelerators touch it, and (b) on Dispose any worker still **checked out** (dispatch in
65
+
flight at an abnormal/fire-and-forget dispose) is **terminated + removed** (not stranded — a
66
+
stranded worker drains the available queue and the shortfall-grow path balloons the pool; this
67
+
was caught 32-vs-10 by the regression test below before the guards were added). **Why:** PMT creates a fresh accelerator per test (~569 in the Wasm lane). The old
68
+
per-accelerator pool called `Worker.terminate()` on its whole pool at Dispose, but
69
+
`terminate()` is an **asynchronous** browser signal — the OS thread + its SharedArrayBuffer/Wasm-
70
+
memory references wind down *after* Dispose returns. So the next test immediately spun up a fresh
71
+
`hardwareConcurrency` pool while the previous pool's threads were still dying → transient worker
72
+
**oversubscription** that compounds across a long sequential lane → the pure-spin barrier can't get
73
+
all workers scheduled in its spin window → compute-**heavy** tests starve and time out late in
74
+
Phase B (light tests still squeak in — a throughput tax, not a flat tax). Tuvok's full-sweep
75
+
report: `TurboQuant_*` pass 128/128 in ~4s scoped, time out at 30s late in the 569-test lane.
76
+
77
+
**Why cross-accelerator reuse is SAFE:**
78
+
- the worker-side module-cache key (`KernelCacheEntry.KernelId`) is the **process-static** monotonic
79
+
`_nextKernelId`, so two different accelerators can never hand a worker the same id → never a
80
+
stale/stomped module (same reason the GC-hash kernelId bug was fixed — see "kernelId MUST be a
81
+
monotonic unique id");
82
+
- a memory-buffer change invalidates the worker's cached instances (`_instancesById = {}` in
83
+
`WorkerPool.cs` when `d.memory.buffer` differs), so a worker reused by a new accelerator
84
+
re-instantiates against the new accelerator's `WebAssembly.Memory`;
85
+
- each accelerator installs its OWN per-worker message handlers on adoption
86
+
(`EnsurePersistentHandlers`) and **DETACHES** them on Dispose (`DisposeAccelerator_SyncRoot`:
87
+
`worker.OnMessage -= state.MsgHandler`). Detach (not terminate) is what frees the accelerator —
88
+
without it, each disposed accelerator's handler closures stay attached to the persistent workers
89
+
(a managed leak that keeps the disposed accelerator alive) and pile up as dead no-op listeners.
90
+
So a reused worker carries exactly one accelerator's handlers at a time.
91
+
92
+
The pool is **never** disposed on accelerator Dispose (it outlives accelerators by design); tear it
93
+
down only via `WasmAccelerator.DisposeSharedWorkerPool()` (test/shutdown) or tab teardown.
94
+
`WasmAccelerator.SharedWorkerPoolSize` is a diagnostic: a correct pool stays bounded
95
+
(≈ `hardwareConcurrency - 2`) no matter how many accelerators come and go — it does NOT grow per
96
+
accelerator. Locked by `WasmTests.Wasm_SharedWorkerPool_PersistsAndStaysBoundedAcrossAccelerators`
97
+
(bounded-size + correct-on-reuse). **Caveat (not the PMT case):** if two accelerators are *concurrently*
98
+
alive and both adopt the same worker, both install handlers — detach-on-dispose only cleans the
99
+
dominant sequential case (PMT disposes the previous accelerator before creating the next).
100
+
55
101
## Offline compile dump (desktop, no browser) — `wasm-dump`
56
102
57
103
`SpawnDev.ILGPU.DemoConsole -- wasm-dump` compiles RadixSort kernels on the DESKTOP and prints the emitted shared-memory alloca table + flags any `GenerateCode(Alloca)` type+size fallback aliasing or offset overlap. Works because `WasmAccelerator.Create` wraps the `BlazorJSRuntime.JS` lookup in try/catch (defaults to 4 cores) and `CreateRadixSort*` compiles its kernels eagerly via `LoadKernel` BEFORE any dispatch — so the IL→wasm compile path runs fully offline (no workers, no Chromium, no dispatch). Reusable for any shared-memory layout audit. Source: `SpawnDev.ILGPU.DemoConsole/WasmCompileDump.cs`.
0 commit comments