Wasm: process-persistent shared Web Worker pool (fix heavy-test starvation)

LostBeard · claude · LostBeard · commit 4f9383052df8 · 2026-06-13T23:53:24.000-04:00
The Web Worker pool was per-accelerator: created on first dispatch and
terminate()'d on Dispose. Worker.terminate() is an asynchronous browser
signal, so a fresh-accelerator-per-test pattern (PMT's ~531-test Wasm lane)
spun up a new hardwareConcurrency pool while the previous pool's threads were
still winding down -&gt; transient worker oversubscription that compounded across
the lane -&gt; the pure-spin barrier couldn't schedule all workers in its window
-&gt; compute-heavy tests starved and timed out late while light tests stayed fast
(Tuvok's ML full-sweep report).

Make the pool process-static (s_sharedWorkerPool), created once per tab and
reused across every default-WorkerCount accelerator. Removes both the terminate
churn and the per-test re-create cost. Safe across accelerators: the worker-side
module-cache key is the process-static monotonic _nextKernelId (no cross-
accelerator collision), a memory-buffer change invalidates a reused worker's
cached instances, and each accelerator detaches its own per-worker handlers on
Dispose instead of relying on terminate.

Bounded at ~hardwareConcurrency-2:
- an explicit WorkerCount (oversubscription stress tests at 16/48/3xcores) keeps
  a PRIVATE pool (old create/terminate lifecycle) so its non-default count can't
  inflate the shared pool; routing is by resolved-count == machine-default.
- a worker still CHECKED OUT at an abnormal/fire-and-forget Dispose (may be
  running an orphaned parked-barrier fiber) is terminated + removed from the pool
  rather than stranded, which would drain the available queue and let the
  shortfall-grow path balloon the pool.

New WorkerPool.Remove(worker); WasmAccelerator.SharedWorkerPoolSize /
DisposeSharedWorkerPool() diagnostics. Regression guard
WasmTests.Wasm_SharedWorkerPool_PersistsAndStaysBoundedAcrossAccelerators
(persisted + bounded + correct-on-reuse + explicit-count isolation).

Gate: PMT_FILTER=WasmTests 516/0/17. Version 4.12.1-local.3 (forks stay 2.0.16).

Co-Authored-By: Claude Opus 4.8 &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,7 @@ Wrapper-only (forks stay **2.0.16**). Adds a new selection-gate capability flag:
 
 - **`AcceleratorRequirements.RequiresScatterStores`** (rules out WebGL) - declare it when a kernel writes a computed/arbitrary output index (`out[someIndex] = ...`) or more than one element of one buffer per thread that isn't the consecutive `v*storeCount+slot` layout. WebGL Transform-Feedback captures one output record per vertex at the thread's own slot (gather-only), so in-kernel scatter can't run there; the flag filters WebGL at `EnumerateCompatibleDevices` / `CreatePreferredAccelerator` / `Satisfies` time. (WebGL still scatters at the host/algorithm layer - e.g. RadixSort via render-to-texture.)
 - A compile-time fail-loud guard for this class (mirroring the atomics/barriers/Scan throws) was prototyped and backed out - the blunt criterion false-positived on legitimate positional multi-store + grid-stride-loop kernels. The correct codegen-level criterion is a tracked open item (`Plans/webgl-multistore-fail-loud-guard-plan-2026-06-13.md`). For now use the selection flag.
+- **Wasm: process-persistent shared Web Worker pool.** The Web Worker pool is now process-static (`WasmAccelerator.s_sharedWorkerPool`) - created once per tab and reused across every default-WorkerCount accelerator - instead of being created and `terminate()`d per accelerator. `Worker.terminate()` is an asynchronous browser signal, so a fresh-accelerator-per-test pattern (PMT's ~531-test Wasm lane) spun up a new `hardwareConcurrency` pool while the previous pool's threads were still winding down -> transient worker oversubscription that compounded across the lane -> the pure-spin barrier couldn't schedule all workers in its window -> compute-heavy tests starved and timed out late while light tests stayed fast. The shared pool removes both the terminate churn and the per-test re-create cost. Safe across accelerators: the worker-side module-cache key is a process-static monotonic id (no cross-accelerator collision), a memory-buffer change invalidates a reused worker's cached instances, and each accelerator detaches its handlers on Dispose. Bounded at ~`hardwareConcurrency-2`: an explicit `WorkerCount` (oversubscription stress tests) keeps a private pool, and a worker still checked out at an abnormal Dispose is terminated+removed rather than stranded. Locked by `WasmTests.Wasm_SharedWorkerPool_PersistsAndStaysBoundedAcrossAccelerators`. Gate: `PMT_FILTER=WasmTests` 516/0/17.
 
 ## 4.12.0 (2026-06-13) - Sync/async contract: async-only where it waits/observes, sync for fire-and-forget
 
diff --git a/SpawnDev.ILGPU.Demo/UnitTests/WasmTests.cs b/SpawnDev.ILGPU.Demo/UnitTests/WasmTests.cs
@@ -1522,5 +1522,120 @@ public async Task WasmGroupBarrierOversubscriptionTest()
                 context.Dispose();
             }
         }
+
+        // Persistent shared worker pool (2026-06-13, Geordi). The Wasm Web Worker pool is
+        // process-static and reused across EVERY accelerator instead of being recreated +
+        // terminated per accelerator. PMT creates a fresh accelerator per test (~569 in the
+        // Wasm lane); the old per-accelerator pool terminated its whole worker pool on Dispose,
+        // but Worker.terminate() is an ASYNC browser signal — so the next test spun up a fresh
+        // hardwareConcurrency pool while the previous pool's threads were still dying → transient
+        // worker oversubscription that starved compute-heavy tests late in the lane (Tuvok's
+        // full-sweep report: heavy tests pass scoped in ~4s, time out at 30s in-lane).
+        //
+        // This test locks BOTH halves of the fix:
+        //  (1) BOUNDED: creating + dispatching + disposing K accelerators leaves the shared pool
+        //      at ~one accelerator's worth of workers, NOT K× (the old design would have churned
+        //      K separate pools through create/terminate).
+        //  (2) CORRECT-ON-REUSE: a dispatch on accelerators 2..K (which adopt the SAME persistent
+        //      workers freed by the disposed earlier accelerators) still matches the CPU oracle —
+        //      proving the worker-side module cache (keyed by the process-static monotonic
+        //      kernelId) and the memory-buffer-change instance invalidation handle cross-
+        //      accelerator reuse with no stale module / stale memory.
+        [TestMethod(Timeout = 120000)]
+        public async Task Wasm_SharedWorkerPool_PersistsAndStaysBoundedAcrossAccelerators()
+        {
+            const int count = 4096;
+            const int accelerators = 5;
+
+            // A single accelerator's worker request — the pool must never exceed this regardless
+            // of how many accelerators come and go.
+            int oneAccWorkerCount;
+            {
+                using var probeCtx = Context.Create().Wasm().ToContext();
+                using var probeAcc = await probeCtx.CreateWasmAcceleratorAsync();
+                oneAccWorkerCount = ((WasmAccelerator)probeAcc).WorkerCount;
+            }
+            if (oneAccWorkerCount < 1)
+                throw new Exception($"Unexpected WorkerCount {oneAccWorkerCount} (expected >= 1).");
+
+            var oracle = new int[count];
+            for (int i = 0; i < count; i++) oracle[i] = i * 3 + 7;
+
+            int maxSizeSeen = 0;
+            for (int a = 0; a < accelerators; a++)
+            {
+                var context = Context.Create().Wasm().ToContext();
+                var accelerator = await context.CreateWasmAcceleratorAsync();
+                try
+                {
+                    using var buf = accelerator.Allocate1D<int>(count);
+                    var fill = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<int>>(
+                        (i, v) => v[i] = i * 3 + 7);
+                    fill((Index1D)count, buf.View);
+                    await accelerator.SynchronizeAsync();
+                    var result = await buf.CopyToHostAsync<int>();
+
+                    // Correctness on a (re)used worker pool — iterations 1..K-1 adopt workers freed
+                    // by the previous disposed accelerator.
+                    for (int i = 0; i < count; i++)
+                        if (result[i] != oracle[i])
+                            throw new Exception(
+                                $"Accelerator #{a}: result[{i}]={result[i]} expected {oracle[i]} — " +
+                                $"reused worker produced wrong output (stale module or stale memory?).");
+
+                    int sizeNow = WasmAccelerator.SharedWorkerPoolSize;
+                    if (sizeNow > maxSizeSeen) maxSizeSeen = sizeNow;
+                }
+                finally
+                {
+                    accelerator.Dispose();
+                    context.Dispose();
+                }
+            }
+
+            // BOUNDED invariant: the shared pool settled at one accelerator's worth, not K×.
+            // (The pool is process-global so other lane tests may have already grown it to
+            // oneAccWorkerCount before this test ran — that's the steady state we expect.)
+            if (maxSizeSeen > oneAccWorkerCount)
+                throw new Exception(
+                    $"Shared worker pool grew to {maxSizeSeen} across {accelerators} accelerators, " +
+                    $"exceeding a single accelerator's {oneAccWorkerCount} workers — it is accumulating " +
+                    $"per-accelerator instead of persisting (the per-accelerator-pool regression).");
+            if (maxSizeSeen < 1)
+                throw new Exception(
+                    "Shared worker pool size never registered >= 1 after dispatching on " +
+                    $"{accelerators} accelerators — the persistent pool was not used.");
+
+            // ISOLATION invariant (order-independent): an accelerator with an EXPLICIT non-default
+            // WorkerCount must use a PRIVATE pool and leave the shared pool untouched — otherwise a
+            // single oversubscription stress test (16/48/3×cores) would permanently inflate the
+            // shared pool for the rest of the lane (the original 32-vs-10 ballooning).
+            int sizeBeforeExplicit = WasmAccelerator.SharedWorkerPoolSize;
+            {
+                var exCtx = Context.Create().Wasm().ToContext();
+                // +6 guarantees a count distinct from the default so it routes to a private pool.
+                var exAcc = await exCtx.CreateWasmAcceleratorAsync(
+                    new WasmBackendOptions { WorkerCount = oneAccWorkerCount + 6 });
+                try
+                {
+                    using var b = exAcc.Allocate1D<int>(count);
+                    var k = exAcc.LoadAutoGroupedStreamKernel<Index1D, ArrayView<int>>(
+                        (i, v) => v[i] = i * 3 + 7);
+                    k((Index1D)count, b.View);
+                    await exAcc.SynchronizeAsync();
+                    var r = await b.CopyToHostAsync<int>();
+                    for (int i = 0; i < count; i++)
+                        if (r[i] != oracle[i])
+                            throw new Exception(
+                                $"Explicit-WorkerCount accelerator: result[{i}]={r[i]} expected {oracle[i]}.");
+                }
+                finally { exAcc.Dispose(); exCtx.Dispose(); }
+            }
+            int sizeAfterExplicit = WasmAccelerator.SharedWorkerPoolSize;
+            if (sizeAfterExplicit > sizeBeforeExplicit)
+                throw new Exception(
+                    $"An explicit-WorkerCount accelerator grew the SHARED pool ({sizeBeforeExplicit} -> " +
+                    $"{sizeAfterExplicit}) — it must use a private pool and leave the shared pool untouched.");
+        }
     }
 }
diff --git a/SpawnDev.ILGPU/SpawnDev.ILGPU.csproj b/SpawnDev.ILGPU/SpawnDev.ILGPU.csproj
@@ -4,9 +4,9 @@
 		<TargetFramework>net10.0</TargetFramework>
 		<ImplicitUsings>enable</ImplicitUsings>
 		<Nullable>enable</Nullable>
-		<Version>4.12.1-local.2</Version>
+		<Version>4.12.1-local.3</Version>
 		<!-- Brief current-version highlights only. Full per-version history with code samples lives in CHANGELOG.md (linked from the README). -->
-		<PackageReleaseNotes>4.12.1: (a) WebGPU GEMV grid-stride fix — a cooperative GEMV's inner K-tile loop (k=tid;k&lt;K;k+=G) was conflated with a synthetic group grid-stride counter (~K/G x too-small accumulation); fixed with a two-pass uniform-break (decide the break from the EMITTED body's barriers — barrier-free loops keep their natural break, barrier loops keep the uniform break unchanged). Re-enables the fast cooperative M=1 GEMV on WebGPU. (b) fixes special-float (±inf/NaN) scalar kernel params on WebGL + Wasm - a kernel float scalar holding ±inf/NaN (e.g. ConstantOfShape(-inf)) silently failed at the .NET-&gt;JS dispatch boundary (WebGL: System.Text.Json in postMessage rejected ±inf/NaN; Wasm: culture-sensitive ToString emitted invalid JS tokens). WebGL now sends float scalars as int32 bit patterns (glWorker reconstructs); Wasm uses InvariantCulture. Also adds the AcceleratorRequirements.RequiresScatterStores capability flag (rules out WebGL for kernels doing in-kernel scatter / &gt;1 non-positional output element per thread - WebGL Transform-Feedback is gather-only). Forks stay 2.0.16. --- 4.12.0 Sync/async contract (bundles forks 2.0.16): an operation that WAITS for completion or OBSERVES a result is async-only on the browser backends (WebGPU/WebGL/Wasm) - its synchronous form throws NotSupportedException; fire-and-forget operations (dispatch, alloc, host-&gt;device upload, flush-submit) stay synchronous everywhere. Synchronize() now THROWS on browser (was a silent non-waiting flush) - use await SynchronizeAsync() to wait. New Flush() submits batched work without waiting (valid synchronously on browser - submit is sync on every backend, so there is no async Flush twin). CopyFromCPU/Allocate1D(data) work on every browser backend again (routed through the new EnsureHostCopyConsumed hook). Sync CreateScan/CreateRadixSort builders run on the browser backends. Gate: full cross-backend PMT 3384/0/218. Full contract: Docs/async.md. Full per-version history: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
+		<PackageReleaseNotes>4.12.1: (a) WebGPU GEMV grid-stride fix — a cooperative GEMV's inner K-tile loop (k=tid;k&lt;K;k+=G) was conflated with a synthetic group grid-stride counter (~K/G x too-small accumulation); fixed with a two-pass uniform-break (decide the break from the EMITTED body's barriers — barrier-free loops keep their natural break, barrier loops keep the uniform break unchanged). Re-enables the fast cooperative M=1 GEMV on WebGPU. (b) fixes special-float (±inf/NaN) scalar kernel params on WebGL + Wasm - a kernel float scalar holding ±inf/NaN (e.g. ConstantOfShape(-inf)) silently failed at the .NET-&gt;JS dispatch boundary (WebGL: System.Text.Json in postMessage rejected ±inf/NaN; Wasm: culture-sensitive ToString emitted invalid JS tokens). WebGL now sends float scalars as int32 bit patterns (glWorker reconstructs); Wasm uses InvariantCulture. Also adds the AcceleratorRequirements.RequiresScatterStores capability flag (rules out WebGL for kernels doing in-kernel scatter / &gt;1 non-positional output element per thread - WebGL Transform-Feedback is gather-only). (c) Wasm: the Web Worker pool is now process-persistent (shared across accelerators) instead of created + terminated per accelerator - removes the per-accelerator worker create/async-terminate churn that transiently oversubscribed cores and starved compute-heavy tests late in a long sequential lane. Default-WorkerCount accelerators share the pool (bounded at ~hardwareConcurrency-2); an explicit WorkerCount (oversubscription stress tests) keeps a private pool. Forks stay 2.0.16. --- 4.12.0 Sync/async contract (bundles forks 2.0.16): an operation that WAITS for completion or OBSERVES a result is async-only on the browser backends (WebGPU/WebGL/Wasm) - its synchronous form throws NotSupportedException; fire-and-forget operations (dispatch, alloc, host-&gt;device upload, flush-submit) stay synchronous everywhere. Synchronize() now THROWS on browser (was a silent non-waiting flush) - use await SynchronizeAsync() to wait. New Flush() submits batched work without waiting (valid synchronously on browser - submit is sync on every backend, so there is no async Flush twin). CopyFromCPU/Allocate1D(data) work on every browser backend again (routed through the new EnsureHostCopyConsumed hook). Sync CreateScan/CreateRadixSort builders run on the browser backends. Gate: full cross-backend PMT 3384/0/218. Full contract: Docs/async.md. Full per-version history: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
 		<GeneratePackageOnBuild>True</GeneratePackageOnBuild>
 		<GenerateDocumentationFile>true</GenerateDocumentationFile>
 		<EmbedAllSources>true</EmbedAllSources>
diff --git a/SpawnDev.ILGPU/Wasm/CLAUDE.md b/SpawnDev.ILGPU/Wasm/CLAUDE.md
@@ -52,6 +52,52 @@ Compiles ILGPU IR → WebAssembly binary. Dispatches via Web Workers with Shared
 - `WasmMemoryBuffer.cs` — SharedArrayBuffer-backed memory, zero-copy sharing
 - `WasmILGPUDevice.cs` — device config (`MaxNumThreadsPerGroup = 256`, `MaxGroupSize = (256,1,1)`). NOTE: an earlier version of this line said 64 — that was stale; the device has set 256 (verified `WasmILGPUDevice.cs:68-69` + offline compile dump 2026-06-09). RadixSortKernel1's `scanMemory` is `int[groupSize*UnrollFactor]` = `int[1024]` at groupSize 256, UnrollFactor 4.
 
+## Persistent Shared Worker Pool (2026-06-13) — `s_sharedWorkerPool`
+
+The Web Worker pool is **process-static** (`WasmAccelerator.s_sharedWorkerPool`), created ONCE
+per tab and reused across every **default-WorkerCount** `WasmAccelerator` (the PMT/production
+common case). It is **not** recreated/terminated per accelerator. An accelerator created with an
+explicit `WasmBackendOptions.WorkerCount` (the oversubscription stress tests at 16/48/3×cores)
+uses a **private** pool (`_ownPool`, old create-on-first-dispatch / terminate-on-Dispose
+lifecycle) so its large worker count can't permanently inflate the shared pool for the rest of
+the lane. Two leak guards keep the shared pool bounded at ≈`hardwareConcurrency-2`: (a) only
+default accelerators touch it, and (b) on Dispose any worker still **checked out** (dispatch in
+flight at an abnormal/fire-and-forget dispose) is **terminated + removed** (not stranded — a
+stranded worker drains the available queue and the shortfall-grow path balloons the pool; this
+was caught 32-vs-10 by the regression test below before the guards were added). **Why:** PMT creates a fresh accelerator per test (~569 in the Wasm lane). The old
+per-accelerator pool called `Worker.terminate()` on its whole pool at Dispose, but
+`terminate()` is an **asynchronous** browser signal — the OS thread + its SharedArrayBuffer/Wasm-
+memory references wind down *after* Dispose returns. So the next test immediately spun up a fresh
+`hardwareConcurrency` pool while the previous pool's threads were still dying → transient worker
+**oversubscription** that compounds across a long sequential lane → the pure-spin barrier can't get
+all workers scheduled in its spin window → compute-**heavy** tests starve and time out late in
+Phase B (light tests still squeak in — a throughput tax, not a flat tax). Tuvok's full-sweep
+report: `TurboQuant_*` pass 128/128 in ~4s scoped, time out at 30s late in the 569-test lane.
+
+**Why cross-accelerator reuse is SAFE:**
+- the worker-side module-cache key (`KernelCacheEntry.KernelId`) is the **process-static** monotonic
+  `_nextKernelId`, so two different accelerators can never hand a worker the same id → never a
+  stale/stomped module (same reason the GC-hash kernelId bug was fixed — see "kernelId MUST be a
+  monotonic unique id");
+- a memory-buffer change invalidates the worker's cached instances (`_instancesById = {}` in
+  `WorkerPool.cs` when `d.memory.buffer` differs), so a worker reused by a new accelerator
+  re-instantiates against the new accelerator's `WebAssembly.Memory`;
+- each accelerator installs its OWN per-worker message handlers on adoption
+  (`EnsurePersistentHandlers`) and **DETACHES** them on Dispose (`DisposeAccelerator_SyncRoot`:
+  `worker.OnMessage -= state.MsgHandler`). Detach (not terminate) is what frees the accelerator —
+  without it, each disposed accelerator's handler closures stay attached to the persistent workers
+  (a managed leak that keeps the disposed accelerator alive) and pile up as dead no-op listeners.
+  So a reused worker carries exactly one accelerator's handlers at a time.
+
+The pool is **never** disposed on accelerator Dispose (it outlives accelerators by design); tear it
+down only via `WasmAccelerator.DisposeSharedWorkerPool()` (test/shutdown) or tab teardown.
+`WasmAccelerator.SharedWorkerPoolSize` is a diagnostic: a correct pool stays bounded
+(≈ `hardwareConcurrency - 2`) no matter how many accelerators come and go — it does NOT grow per
+accelerator. Locked by `WasmTests.Wasm_SharedWorkerPool_PersistsAndStaysBoundedAcrossAccelerators`
+(bounded-size + correct-on-reuse). **Caveat (not the PMT case):** if two accelerators are *concurrently*
+alive and both adopt the same worker, both install handlers — detach-on-dispose only cleans the
+dominant sequential case (PMT disposes the previous accelerator before creating the next).
+
 ## Offline compile dump (desktop, no browser) — `wasm-dump`
 
 `SpawnDev.ILGPU.DemoConsole -- wasm-dump` compiles RadixSort kernels on the DESKTOP and prints the emitted shared-memory alloca table + flags any `GenerateCode(Alloca)` type+size fallback aliasing or offset overlap. Works because `WasmAccelerator.Create` wraps the `BlazorJSRuntime.JS` lookup in try/catch (defaults to 4 cores) and `CreateRadixSort*` compiles its kernels eagerly via `LoadKernel` BEFORE any dispatch — so the IL→wasm compile path runs fully offline (no workers, no Chromium, no dispatch). Reusable for any shared-memory layout audit. Source: `SpawnDev.ILGPU.DemoConsole/WasmCompileDump.cs`.
diff --git a/SpawnDev.ILGPU/Wasm/WasmAccelerator.cs b/SpawnDev.ILGPU/Wasm/WasmAccelerator.cs
diff --git a/SpawnDev.ILGPU/WorkerPool.cs b/SpawnDev.ILGPU/WorkerPool.cs