Skip to content

Commit 4f93830

Browse files
LostBeardclaude
andcommitted
Wasm: process-persistent shared Web Worker pool (fix heavy-test starvation)
The Web Worker pool was per-accelerator: created on first dispatch and terminate()'d on Dispose. Worker.terminate() is an asynchronous browser signal, so a fresh-accelerator-per-test pattern (PMT's ~531-test Wasm lane) spun up a new hardwareConcurrency pool while the previous pool's threads were still winding down -> transient worker oversubscription that compounded across the lane -> the pure-spin barrier couldn't schedule all workers in its window -> compute-heavy tests starved and timed out late while light tests stayed fast (Tuvok's ML full-sweep report). Make the pool process-static (s_sharedWorkerPool), created once per tab and reused across every default-WorkerCount accelerator. Removes both the terminate churn and the per-test re-create cost. Safe across accelerators: the worker-side module-cache key is the process-static monotonic _nextKernelId (no cross- accelerator collision), a memory-buffer change invalidates a reused worker's cached instances, and each accelerator detaches its own per-worker handlers on Dispose instead of relying on terminate. Bounded at ~hardwareConcurrency-2: - an explicit WorkerCount (oversubscription stress tests at 16/48/3xcores) keeps a PRIVATE pool (old create/terminate lifecycle) so its non-default count can't inflate the shared pool; routing is by resolved-count == machine-default. - a worker still CHECKED OUT at an abnormal/fire-and-forget Dispose (may be running an orphaned parked-barrier fiber) is terminated + removed from the pool rather than stranded, which would drain the available queue and let the shortfall-grow path balloon the pool. New WorkerPool.Remove(worker); WasmAccelerator.SharedWorkerPoolSize / DisposeSharedWorkerPool() diagnostics. Regression guard WasmTests.Wasm_SharedWorkerPool_PersistsAndStaysBoundedAcrossAccelerators (persisted + bounded + correct-on-reuse + explicit-count isolation). Gate: PMT_FILTER=WasmTests 516/0/17. Version 4.12.1-local.3 (forks stay 2.0.16). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 0190202 commit 4f93830

6 files changed

Lines changed: 387 additions & 26 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ Wrapper-only (forks stay **2.0.16**). Adds a new selection-gate capability flag:
88

99
- **`AcceleratorRequirements.RequiresScatterStores`** (rules out WebGL) - declare it when a kernel writes a computed/arbitrary output index (`out[someIndex] = ...`) or more than one element of one buffer per thread that isn't the consecutive `v*storeCount+slot` layout. WebGL Transform-Feedback captures one output record per vertex at the thread's own slot (gather-only), so in-kernel scatter can't run there; the flag filters WebGL at `EnumerateCompatibleDevices` / `CreatePreferredAccelerator` / `Satisfies` time. (WebGL still scatters at the host/algorithm layer - e.g. RadixSort via render-to-texture.)
1010
- A compile-time fail-loud guard for this class (mirroring the atomics/barriers/Scan throws) was prototyped and backed out - the blunt criterion false-positived on legitimate positional multi-store + grid-stride-loop kernels. The correct codegen-level criterion is a tracked open item (`Plans/webgl-multistore-fail-loud-guard-plan-2026-06-13.md`). For now use the selection flag.
11+
- **Wasm: process-persistent shared Web Worker pool.** The Web Worker pool is now process-static (`WasmAccelerator.s_sharedWorkerPool`) - created once per tab and reused across every default-WorkerCount accelerator - instead of being created and `terminate()`d per accelerator. `Worker.terminate()` is an asynchronous browser signal, so a fresh-accelerator-per-test pattern (PMT's ~531-test Wasm lane) spun up a new `hardwareConcurrency` pool while the previous pool's threads were still winding down -> transient worker oversubscription that compounded across the lane -> the pure-spin barrier couldn't schedule all workers in its window -> compute-heavy tests starved and timed out late while light tests stayed fast. The shared pool removes both the terminate churn and the per-test re-create cost. Safe across accelerators: the worker-side module-cache key is a process-static monotonic id (no cross-accelerator collision), a memory-buffer change invalidates a reused worker's cached instances, and each accelerator detaches its handlers on Dispose. Bounded at ~`hardwareConcurrency-2`: an explicit `WorkerCount` (oversubscription stress tests) keeps a private pool, and a worker still checked out at an abnormal Dispose is terminated+removed rather than stranded. Locked by `WasmTests.Wasm_SharedWorkerPool_PersistsAndStaysBoundedAcrossAccelerators`. Gate: `PMT_FILTER=WasmTests` 516/0/17.
1112

1213
## 4.12.0 (2026-06-13) - Sync/async contract: async-only where it waits/observes, sync for fire-and-forget
1314

SpawnDev.ILGPU.Demo/UnitTests/WasmTests.cs

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1522,5 +1522,120 @@ public async Task WasmGroupBarrierOversubscriptionTest()
15221522
context.Dispose();
15231523
}
15241524
}
1525+
1526+
// Persistent shared worker pool (2026-06-13, Geordi). The Wasm Web Worker pool is
1527+
// process-static and reused across EVERY accelerator instead of being recreated +
1528+
// terminated per accelerator. PMT creates a fresh accelerator per test (~569 in the
1529+
// Wasm lane); the old per-accelerator pool terminated its whole worker pool on Dispose,
1530+
// but Worker.terminate() is an ASYNC browser signal — so the next test spun up a fresh
1531+
// hardwareConcurrency pool while the previous pool's threads were still dying → transient
1532+
// worker oversubscription that starved compute-heavy tests late in the lane (Tuvok's
1533+
// full-sweep report: heavy tests pass scoped in ~4s, time out at 30s in-lane).
1534+
//
1535+
// This test locks BOTH halves of the fix:
1536+
// (1) BOUNDED: creating + dispatching + disposing K accelerators leaves the shared pool
1537+
// at ~one accelerator's worth of workers, NOT K× (the old design would have churned
1538+
// K separate pools through create/terminate).
1539+
// (2) CORRECT-ON-REUSE: a dispatch on accelerators 2..K (which adopt the SAME persistent
1540+
// workers freed by the disposed earlier accelerators) still matches the CPU oracle —
1541+
// proving the worker-side module cache (keyed by the process-static monotonic
1542+
// kernelId) and the memory-buffer-change instance invalidation handle cross-
1543+
// accelerator reuse with no stale module / stale memory.
1544+
[TestMethod(Timeout = 120000)]
1545+
public async Task Wasm_SharedWorkerPool_PersistsAndStaysBoundedAcrossAccelerators()
1546+
{
1547+
const int count = 4096;
1548+
const int accelerators = 5;
1549+
1550+
// A single accelerator's worker request — the pool must never exceed this regardless
1551+
// of how many accelerators come and go.
1552+
int oneAccWorkerCount;
1553+
{
1554+
using var probeCtx = Context.Create().Wasm().ToContext();
1555+
using var probeAcc = await probeCtx.CreateWasmAcceleratorAsync();
1556+
oneAccWorkerCount = ((WasmAccelerator)probeAcc).WorkerCount;
1557+
}
1558+
if (oneAccWorkerCount < 1)
1559+
throw new Exception($"Unexpected WorkerCount {oneAccWorkerCount} (expected >= 1).");
1560+
1561+
var oracle = new int[count];
1562+
for (int i = 0; i < count; i++) oracle[i] = i * 3 + 7;
1563+
1564+
int maxSizeSeen = 0;
1565+
for (int a = 0; a < accelerators; a++)
1566+
{
1567+
var context = Context.Create().Wasm().ToContext();
1568+
var accelerator = await context.CreateWasmAcceleratorAsync();
1569+
try
1570+
{
1571+
using var buf = accelerator.Allocate1D<int>(count);
1572+
var fill = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<int>>(
1573+
(i, v) => v[i] = i * 3 + 7);
1574+
fill((Index1D)count, buf.View);
1575+
await accelerator.SynchronizeAsync();
1576+
var result = await buf.CopyToHostAsync<int>();
1577+
1578+
// Correctness on a (re)used worker pool — iterations 1..K-1 adopt workers freed
1579+
// by the previous disposed accelerator.
1580+
for (int i = 0; i < count; i++)
1581+
if (result[i] != oracle[i])
1582+
throw new Exception(
1583+
$"Accelerator #{a}: result[{i}]={result[i]} expected {oracle[i]} — " +
1584+
$"reused worker produced wrong output (stale module or stale memory?).");
1585+
1586+
int sizeNow = WasmAccelerator.SharedWorkerPoolSize;
1587+
if (sizeNow > maxSizeSeen) maxSizeSeen = sizeNow;
1588+
}
1589+
finally
1590+
{
1591+
accelerator.Dispose();
1592+
context.Dispose();
1593+
}
1594+
}
1595+
1596+
// BOUNDED invariant: the shared pool settled at one accelerator's worth, not K×.
1597+
// (The pool is process-global so other lane tests may have already grown it to
1598+
// oneAccWorkerCount before this test ran — that's the steady state we expect.)
1599+
if (maxSizeSeen > oneAccWorkerCount)
1600+
throw new Exception(
1601+
$"Shared worker pool grew to {maxSizeSeen} across {accelerators} accelerators, " +
1602+
$"exceeding a single accelerator's {oneAccWorkerCount} workers — it is accumulating " +
1603+
$"per-accelerator instead of persisting (the per-accelerator-pool regression).");
1604+
if (maxSizeSeen < 1)
1605+
throw new Exception(
1606+
"Shared worker pool size never registered >= 1 after dispatching on " +
1607+
$"{accelerators} accelerators — the persistent pool was not used.");
1608+
1609+
// ISOLATION invariant (order-independent): an accelerator with an EXPLICIT non-default
1610+
// WorkerCount must use a PRIVATE pool and leave the shared pool untouched — otherwise a
1611+
// single oversubscription stress test (16/48/3×cores) would permanently inflate the
1612+
// shared pool for the rest of the lane (the original 32-vs-10 ballooning).
1613+
int sizeBeforeExplicit = WasmAccelerator.SharedWorkerPoolSize;
1614+
{
1615+
var exCtx = Context.Create().Wasm().ToContext();
1616+
// +6 guarantees a count distinct from the default so it routes to a private pool.
1617+
var exAcc = await exCtx.CreateWasmAcceleratorAsync(
1618+
new WasmBackendOptions { WorkerCount = oneAccWorkerCount + 6 });
1619+
try
1620+
{
1621+
using var b = exAcc.Allocate1D<int>(count);
1622+
var k = exAcc.LoadAutoGroupedStreamKernel<Index1D, ArrayView<int>>(
1623+
(i, v) => v[i] = i * 3 + 7);
1624+
k((Index1D)count, b.View);
1625+
await exAcc.SynchronizeAsync();
1626+
var r = await b.CopyToHostAsync<int>();
1627+
for (int i = 0; i < count; i++)
1628+
if (r[i] != oracle[i])
1629+
throw new Exception(
1630+
$"Explicit-WorkerCount accelerator: result[{i}]={r[i]} expected {oracle[i]}.");
1631+
}
1632+
finally { exAcc.Dispose(); exCtx.Dispose(); }
1633+
}
1634+
int sizeAfterExplicit = WasmAccelerator.SharedWorkerPoolSize;
1635+
if (sizeAfterExplicit > sizeBeforeExplicit)
1636+
throw new Exception(
1637+
$"An explicit-WorkerCount accelerator grew the SHARED pool ({sizeBeforeExplicit} -> " +
1638+
$"{sizeAfterExplicit}) — it must use a private pool and leave the shared pool untouched.");
1639+
}
15251640
}
15261641
}

SpawnDev.ILGPU/SpawnDev.ILGPU.csproj

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@
44
<TargetFramework>net10.0</TargetFramework>
55
<ImplicitUsings>enable</ImplicitUsings>
66
<Nullable>enable</Nullable>
7-
<Version>4.12.1-local.2</Version>
7+
<Version>4.12.1-local.3</Version>
88
<!-- Brief current-version highlights only. Full per-version history with code samples lives in CHANGELOG.md (linked from the README). -->
9-
<PackageReleaseNotes>4.12.1: (a) WebGPU GEMV grid-stride fix — a cooperative GEMV's inner K-tile loop (k=tid;k&lt;K;k+=G) was conflated with a synthetic group grid-stride counter (~K/G x too-small accumulation); fixed with a two-pass uniform-break (decide the break from the EMITTED body's barriers — barrier-free loops keep their natural break, barrier loops keep the uniform break unchanged). Re-enables the fast cooperative M=1 GEMV on WebGPU. (b) fixes special-float (±inf/NaN) scalar kernel params on WebGL + Wasm - a kernel float scalar holding ±inf/NaN (e.g. ConstantOfShape(-inf)) silently failed at the .NET-&gt;JS dispatch boundary (WebGL: System.Text.Json in postMessage rejected ±inf/NaN; Wasm: culture-sensitive ToString emitted invalid JS tokens). WebGL now sends float scalars as int32 bit patterns (glWorker reconstructs); Wasm uses InvariantCulture. Also adds the AcceleratorRequirements.RequiresScatterStores capability flag (rules out WebGL for kernels doing in-kernel scatter / &gt;1 non-positional output element per thread - WebGL Transform-Feedback is gather-only). Forks stay 2.0.16. --- 4.12.0 Sync/async contract (bundles forks 2.0.16): an operation that WAITS for completion or OBSERVES a result is async-only on the browser backends (WebGPU/WebGL/Wasm) - its synchronous form throws NotSupportedException; fire-and-forget operations (dispatch, alloc, host-&gt;device upload, flush-submit) stay synchronous everywhere. Synchronize() now THROWS on browser (was a silent non-waiting flush) - use await SynchronizeAsync() to wait. New Flush() submits batched work without waiting (valid synchronously on browser - submit is sync on every backend, so there is no async Flush twin). CopyFromCPU/Allocate1D(data) work on every browser backend again (routed through the new EnsureHostCopyConsumed hook). Sync CreateScan/CreateRadixSort builders run on the browser backends. Gate: full cross-backend PMT 3384/0/218. Full contract: Docs/async.md. Full per-version history: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
9+
<PackageReleaseNotes>4.12.1: (a) WebGPU GEMV grid-stride fix — a cooperative GEMV's inner K-tile loop (k=tid;k&lt;K;k+=G) was conflated with a synthetic group grid-stride counter (~K/G x too-small accumulation); fixed with a two-pass uniform-break (decide the break from the EMITTED body's barriers — barrier-free loops keep their natural break, barrier loops keep the uniform break unchanged). Re-enables the fast cooperative M=1 GEMV on WebGPU. (b) fixes special-float (±inf/NaN) scalar kernel params on WebGL + Wasm - a kernel float scalar holding ±inf/NaN (e.g. ConstantOfShape(-inf)) silently failed at the .NET-&gt;JS dispatch boundary (WebGL: System.Text.Json in postMessage rejected ±inf/NaN; Wasm: culture-sensitive ToString emitted invalid JS tokens). WebGL now sends float scalars as int32 bit patterns (glWorker reconstructs); Wasm uses InvariantCulture. Also adds the AcceleratorRequirements.RequiresScatterStores capability flag (rules out WebGL for kernels doing in-kernel scatter / &gt;1 non-positional output element per thread - WebGL Transform-Feedback is gather-only). (c) Wasm: the Web Worker pool is now process-persistent (shared across accelerators) instead of created + terminated per accelerator - removes the per-accelerator worker create/async-terminate churn that transiently oversubscribed cores and starved compute-heavy tests late in a long sequential lane. Default-WorkerCount accelerators share the pool (bounded at ~hardwareConcurrency-2); an explicit WorkerCount (oversubscription stress tests) keeps a private pool. Forks stay 2.0.16. --- 4.12.0 Sync/async contract (bundles forks 2.0.16): an operation that WAITS for completion or OBSERVES a result is async-only on the browser backends (WebGPU/WebGL/Wasm) - its synchronous form throws NotSupportedException; fire-and-forget operations (dispatch, alloc, host-&gt;device upload, flush-submit) stay synchronous everywhere. Synchronize() now THROWS on browser (was a silent non-waiting flush) - use await SynchronizeAsync() to wait. New Flush() submits batched work without waiting (valid synchronously on browser - submit is sync on every backend, so there is no async Flush twin). CopyFromCPU/Allocate1D(data) work on every browser backend again (routed through the new EnsureHostCopyConsumed hook). Sync CreateScan/CreateRadixSort builders run on the browser backends. Gate: full cross-backend PMT 3384/0/218. Full contract: Docs/async.md. Full per-version history: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
1010
<GeneratePackageOnBuild>True</GeneratePackageOnBuild>
1111
<GenerateDocumentationFile>true</GenerateDocumentationFile>
1212
<EmbedAllSources>true</EmbedAllSources>

SpawnDev.ILGPU/Wasm/CLAUDE.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,52 @@ Compiles ILGPU IR → WebAssembly binary. Dispatches via Web Workers with Shared
5252
- `WasmMemoryBuffer.cs` — SharedArrayBuffer-backed memory, zero-copy sharing
5353
- `WasmILGPUDevice.cs` — device config (`MaxNumThreadsPerGroup = 256`, `MaxGroupSize = (256,1,1)`). NOTE: an earlier version of this line said 64 — that was stale; the device has set 256 (verified `WasmILGPUDevice.cs:68-69` + offline compile dump 2026-06-09). RadixSortKernel1's `scanMemory` is `int[groupSize*UnrollFactor]` = `int[1024]` at groupSize 256, UnrollFactor 4.
5454

55+
## Persistent Shared Worker Pool (2026-06-13) — `s_sharedWorkerPool`
56+
57+
The Web Worker pool is **process-static** (`WasmAccelerator.s_sharedWorkerPool`), created ONCE
58+
per tab and reused across every **default-WorkerCount** `WasmAccelerator` (the PMT/production
59+
common case). It is **not** recreated/terminated per accelerator. An accelerator created with an
60+
explicit `WasmBackendOptions.WorkerCount` (the oversubscription stress tests at 16/48/3×cores)
61+
uses a **private** pool (`_ownPool`, old create-on-first-dispatch / terminate-on-Dispose
62+
lifecycle) so its large worker count can't permanently inflate the shared pool for the rest of
63+
the lane. Two leak guards keep the shared pool bounded at ≈`hardwareConcurrency-2`: (a) only
64+
default accelerators touch it, and (b) on Dispose any worker still **checked out** (dispatch in
65+
flight at an abnormal/fire-and-forget dispose) is **terminated + removed** (not stranded — a
66+
stranded worker drains the available queue and the shortfall-grow path balloons the pool; this
67+
was caught 32-vs-10 by the regression test below before the guards were added). **Why:** PMT creates a fresh accelerator per test (~569 in the Wasm lane). The old
68+
per-accelerator pool called `Worker.terminate()` on its whole pool at Dispose, but
69+
`terminate()` is an **asynchronous** browser signal — the OS thread + its SharedArrayBuffer/Wasm-
70+
memory references wind down *after* Dispose returns. So the next test immediately spun up a fresh
71+
`hardwareConcurrency` pool while the previous pool's threads were still dying → transient worker
72+
**oversubscription** that compounds across a long sequential lane → the pure-spin barrier can't get
73+
all workers scheduled in its spin window → compute-**heavy** tests starve and time out late in
74+
Phase B (light tests still squeak in — a throughput tax, not a flat tax). Tuvok's full-sweep
75+
report: `TurboQuant_*` pass 128/128 in ~4s scoped, time out at 30s late in the 569-test lane.
76+
77+
**Why cross-accelerator reuse is SAFE:**
78+
- the worker-side module-cache key (`KernelCacheEntry.KernelId`) is the **process-static** monotonic
79+
`_nextKernelId`, so two different accelerators can never hand a worker the same id → never a
80+
stale/stomped module (same reason the GC-hash kernelId bug was fixed — see "kernelId MUST be a
81+
monotonic unique id");
82+
- a memory-buffer change invalidates the worker's cached instances (`_instancesById = {}` in
83+
`WorkerPool.cs` when `d.memory.buffer` differs), so a worker reused by a new accelerator
84+
re-instantiates against the new accelerator's `WebAssembly.Memory`;
85+
- each accelerator installs its OWN per-worker message handlers on adoption
86+
(`EnsurePersistentHandlers`) and **DETACHES** them on Dispose (`DisposeAccelerator_SyncRoot`:
87+
`worker.OnMessage -= state.MsgHandler`). Detach (not terminate) is what frees the accelerator —
88+
without it, each disposed accelerator's handler closures stay attached to the persistent workers
89+
(a managed leak that keeps the disposed accelerator alive) and pile up as dead no-op listeners.
90+
So a reused worker carries exactly one accelerator's handlers at a time.
91+
92+
The pool is **never** disposed on accelerator Dispose (it outlives accelerators by design); tear it
93+
down only via `WasmAccelerator.DisposeSharedWorkerPool()` (test/shutdown) or tab teardown.
94+
`WasmAccelerator.SharedWorkerPoolSize` is a diagnostic: a correct pool stays bounded
95+
(≈ `hardwareConcurrency - 2`) no matter how many accelerators come and go — it does NOT grow per
96+
accelerator. Locked by `WasmTests.Wasm_SharedWorkerPool_PersistsAndStaysBoundedAcrossAccelerators`
97+
(bounded-size + correct-on-reuse). **Caveat (not the PMT case):** if two accelerators are *concurrently*
98+
alive and both adopt the same worker, both install handlers — detach-on-dispose only cleans the
99+
dominant sequential case (PMT disposes the previous accelerator before creating the next).
100+
55101
## Offline compile dump (desktop, no browser) — `wasm-dump`
56102

57103
`SpawnDev.ILGPU.DemoConsole -- wasm-dump` compiles RadixSort kernels on the DESKTOP and prints the emitted shared-memory alloca table + flags any `GenerateCode(Alloca)` type+size fallback aliasing or offset overlap. Works because `WasmAccelerator.Create` wraps the `BlazorJSRuntime.JS` lookup in try/catch (defaults to 4 cores) and `CreateRadixSort*` compiles its kernels eagerly via `LoadKernel` BEFORE any dispatch — so the IL→wasm compile path runs fully offline (no workers, no Chromium, no dispatch). Reusable for any shared-memory layout audit. Source: `SpawnDev.ILGPU.DemoConsole/WasmCompileDump.cs`.

0 commit comments

Comments
 (0)