Skip to content

Commit 665a704

Browse files
LostBeardclaude
andcommitted
Wasm: bound persistent-worker module cache via flush-at-accelerator-boundary
Tuvok's local.6 memory-pressure trace (2026-06-14) proved the driver of the ML heavy-test late-lane timeouts: the persistent worker pool's per-kernel module cache (_modulesById) accumulates UNBOUNDED (TotalKernelsCompiled 2 -> 1057 monotonic across a ~570-test lane), while the committed shared linear memory is flat (~96 MiB) -- the cache, not the working set, is the pressure. Causal proof: TurboQuant tests pass 128/128 isolated, time out only late in the lane where kernels ~ 1100. Fix: flush the worker module caches at an accelerator boundary. When cumulative kernels compiled since the last flush cross WasmBackend.ModuleCacheFlushThreshold (default 256; 0 disables), the host sets clearModuleCache=true on the worker messages of the NEXT fresh default-pool accelerator's FIRST dispatch. The worker drops _modulesById/_instancesById (+ nulls _lastMemoryBuffer) then recompiles from the re-sent bytes. SAFE only at a first dispatch: the accelerator's worker-init tracking is empty there so it re-sends its own kernels; the cleared modules are disposed accelerators' dead weight. NEVER mid-accelerator (would orphan modules it already told workers it had -> "module not cached"). Bounds peak modules to ~threshold. Short workloads never cross it -> never flush -> kernels stay fully warm (ILGPU library Wasm lane / RadixSort record times unaffected). Sequential-accelerator assumption, like the rest of the pool. Chose flush over per-kernel LRU: LRU needs process-static per-worker module tracking + cross-accelerator eviction coordination in the kernelId-collision-sensitive area (delicate); flush is simple, race-free, and a guaranteed bound. WorkerPool.cs (worker clearModuleCache handling), WasmDispatchMessages.cs (field on both message types), WasmAccelerator.cs (first-dispatch flush trigger + s_lastFlushKernelId), WasmBackend.ModuleCacheFlushThreshold. Guard WasmTests.Wasm_ModuleCacheFlush_DoesNotBreakCorrectness (threshold=1 -> flush every accelerator, 6 accels x 2 distinct kernels, CPU-oracle -> catches any "module not cached"/stale). Gate: PMT_FILTER=WasmTests 520/0/17. Version 4.12.1-local.7 (forks stay 2.0.16). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent f25be18 commit 665a704

8 files changed

Lines changed: 148 additions & 1 deletion

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Wrapper-only (forks stay **2.0.16**). Adds a new selection-gate capability flag:
1111
- **Wasm: process-persistent shared Web Worker pool.** The Web Worker pool is now process-static (`WasmAccelerator.s_sharedWorkerPool`) - created once per tab and reused across every default-WorkerCount accelerator - instead of being created and `terminate()`d per accelerator. `Worker.terminate()` is an asynchronous browser signal, so a fresh-accelerator-per-test pattern (PMT's ~531-test Wasm lane) spun up a new `hardwareConcurrency` pool while the previous pool's threads were still winding down -> transient worker oversubscription that compounded across the lane -> the pure-spin barrier couldn't schedule all workers in its window -> compute-heavy tests starved and timed out late while light tests stayed fast. The shared pool removes both the terminate churn and the per-test re-create cost. Safe across accelerators: the worker-side module-cache key is a process-static monotonic id (no cross-accelerator collision), a memory-buffer change invalidates a reused worker's cached instances, and each accelerator detaches its handlers on Dispose. Bounded at ~`hardwareConcurrency-2`: an explicit `WorkerCount` (oversubscription stress tests) keeps a private pool, and a worker still checked out at an abnormal Dispose is terminated+removed rather than stranded. Locked by `WasmTests.Wasm_SharedWorkerPool_PersistsAndStaysBoundedAcrossAccelerators`. Gate: `PMT_FILTER=WasmTests` 516/0/17.
1212
- **Wasm: process-static shared linear memory** (the second half the persistent worker pool unmasked). A `new WebAssembly.Memory({ shared: true })` reserves its full `maximum` (default 16384 pages = 1 GiB) of virtual address space at construction and can never relocate, so each default accelerator that built its own memory burned a full 1 GiB reservation. Before the persistent pool, `Worker.terminate()` per accelerator Dispose dropped the workers' references so the old reservation was freed/GC'd per test; with the persistent pool the workers pin the last memory they instantiated against (until they next swap), so across a ~569-test lane the per-accelerator memories accumulated up to `workerCount` live 1 GiB reservations until V8's address-space cap was hit and the `new WebAssembly.Memory(...)` constructor threw `could not allocate memory`. Default-WorkerCount accelerators now share a process-static `WebAssembly.Memory` keyed by their `MaxLinearMemoryPages` (`WasmAccelerator.s_sharedByMaxPages`) - ONE shared memory per distinct max value, grown to the lane high-water and never re-created -> a single reservation per max-group. Safe because the linear memory is per-dispatch transient working/staging memory (zero region -> copy-in -> run -> copy-out; no cross-accelerator state); a per-group `SemaphoreSlim` serializes that group's dispatch window across concurrently-alive accelerators (zero-cost in the sequential case). Keyed by max because the kernel module declares its memory-import maximum = its own `MaxLinearMemoryPages` and the spec requires the supplied memory's maximum equal the module's declared maximum - so all 16384 accelerators share a 16384 memory, all 32768 (e.g. ML's DA3-Small at 2 GiB) share a 32768 memory, etc. An explicit-`WorkerCount` accelerator (oversubscription stress tests, which want worker isolation) keeps a private memory. Bonus: with persistent workers and a persistent memory the buffer only changes on `grow()`, so after high-water the workers stop re-instantiating kernels entirely (the per-test new-memory churn is gone). (Originally only the default 16384 was shared, which missed the ML test lane's ~569 accelerators at a custom 32768 max - they re-accumulated the leak at 2 GiB each; generalized to per-max.) Diagnostics `WasmAccelerator.SharedWasmMemoryCreateCount` / `SharedWasmMemoryPages` (summed across groups); locked by `WasmTests.Wasm_SharedLinearMemory_PersistsAndStaysBoundedAcrossAccelerators` (default max) + `Wasm_SharedLinearMemory_CustomMaxPages_AlsoBounded` (custom max).
1313
- **Wasm SIMD128 emitter foundation (Phase 1 of the SIMD port).** Additive groundwork only - no production kernel emits v128 yet, so the scalar path is byte-identical. Adds the v128 value type and the 0xFD-prefixed SIMD opcode set to `WasmOpCodes` (spec-verified; sub-opcodes are u32-LEB128 after the prefix, so multi-byte ones like `f32x4.add`=228 encode correctly), v128 emit helpers in `WasmModuleBuilder` (`EmitSimd`/`EmitSimdMem`/`EmitSimdLane`/`EmitV128Const`/`EmitI8x16Shuffle`), and the runtime SIMD capability surface: `WasmBackend.RuntimeSupportsWasmSimd` (via `System.Runtime.Intrinsics.Wasm.PackedSimd.IsSupported` - if the running Blazor WASM build has SIMD enabled, the browser/workers accept v128), `ForceScalar`/`ForceSimd` test overrides, `EffectiveWasmSimd`, `WasmCapabilityContext.WasmSimd`, and `WasmAccelerator.SupportsSimd`. **Non-SIMD devices stay first-class forever** (the scalar path is a supported mode, not a deprecated fallback - real hardware/browsers without wasm SIMD are common; see the dual-build technique in `BlazorWASMSIMDDetectExample`). Verified by the offline `DemoConsole -- wasm-simd-probe`: a hand-built v128 module is `wasm-validate`-clean and `wasm2wat`-decodes to the intended instructions.
14+
- **Wasm: bound the persistent-worker module cache (late-lane memory-pressure fix).** The process-persistent worker pool keeps every distinct kernel's compiled `WebAssembly.Module` in a per-worker cache (`_modulesById`) for the tab's life. Across a long test lane each per-test accelerator's kernels get fresh ids, so the cache accumulated unbounded (measured 2 -> 1057 across a ~570-test lane) until late, heavy tests hit process-memory pressure and timed out (the committed shared linear memory was flat/small - the module cache was the driver). Fix: when cumulative kernels compiled since the last flush cross `WasmBackend.ModuleCacheFlushThreshold` (default 256; 0 disables), the host instructs the workers to drop their module/instance caches at the next fresh accelerator's FIRST dispatch (safe - that accelerator re-sends its own kernels; the cleared modules are disposed accelerators' dead weight). Bounds peak modules to ~the threshold. Short workloads never reach it -> never flush -> kernels stay fully warm. Diagnostics `WasmAccelerator.TotalKernelsCompiled` / `SharedWasmMemoryPages`; guard `WasmTests.Wasm_ModuleCacheFlush_DoesNotBreakCorrectness` (flushes every accelerator, asserts CPU-oracle).
1415

1516
## 4.12.0 (2026-06-13) - Sync/async contract: async-only where it waits/observes, sync for fire-and-forget
1617

SpawnDev.ILGPU.Demo/UnitTests/WasmTests.cs

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1792,6 +1792,61 @@ public async Task Wasm_SharedLinearMemory_CustomMaxPages_AlsoBounded()
17921792
$"accelerators — the custom-max reservation leak (Tuvok's ML-lane 88->91) has regressed (expected <= 1).");
17931793
}
17941794

1795+
// Module-cache flush correctness (2026-06-14, Geordi). The persistent worker pool's per-kernel
1796+
// module cache (_modulesById) accumulates across a long lane (Tuvok's ML trace: 2->1057 kernels →
1797+
// late-heavy-test memory-pressure timeouts). The fix flushes the worker caches at a fresh
1798+
// accelerator's first dispatch once cumulative kernels cross WasmBackend.ModuleCacheFlushThreshold.
1799+
// The RISK is the flush orphaning a module the host still thinks a worker has → "module not cached"
1800+
// / wrong output. This test forces flushes EVERY accelerator (threshold=1) across many accelerators,
1801+
// each running TWO distinct kernels (so the within-accelerator repopulation after a flush is
1802+
// exercised), and asserts CPU-oracle correctness throughout. If the flush coordination were wrong,
1803+
// this fails loudly. (A green run with aggressive flushing proves the dispatch-boundary flush is safe.)
1804+
[TestMethod(Timeout = 120000)]
1805+
public async Task Wasm_ModuleCacheFlush_DoesNotBreakCorrectness()
1806+
{
1807+
const int count = 2048;
1808+
const int accelerators = 6;
1809+
int savedThreshold = WasmBackend.ModuleCacheFlushThreshold;
1810+
WasmBackend.ModuleCacheFlushThreshold = 1; // flush on essentially every fresh accelerator
1811+
try
1812+
{
1813+
for (int a = 0; a < accelerators; a++)
1814+
{
1815+
var context = Context.Create().Wasm().ToContext();
1816+
var accelerator = await context.CreateWasmAcceleratorAsync();
1817+
try
1818+
{
1819+
using var inBuf = accelerator.Allocate1D<int>(count);
1820+
using var outA = accelerator.Allocate1D<int>(count);
1821+
using var outB = accelerator.Allocate1D<int>(count);
1822+
var seed = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<int>>(
1823+
(i, v) => v[i] = i);
1824+
seed((Index1D)count, inBuf.View);
1825+
// Two DISTINCT kernels in this accelerator → 2 module compiles → repopulation after
1826+
// the flush that fires at this accelerator's first dispatch.
1827+
var kA = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<int>, ArrayView<int>>(
1828+
(i, src, o) => o[i] = src[i] * 2 + 1);
1829+
var kB = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<int>, ArrayView<int>>(
1830+
(i, src, o) => o[i] = src[i] + 100);
1831+
kA((Index1D)count, inBuf.View, outA.View);
1832+
kB((Index1D)count, inBuf.View, outB.View);
1833+
await accelerator.SynchronizeAsync();
1834+
var rA = await outA.CopyToHostAsync<int>();
1835+
var rB = await outB.CopyToHostAsync<int>();
1836+
for (int i = 0; i < count; i++)
1837+
{
1838+
if (rA[i] != i * 2 + 1)
1839+
throw new Exception($"Accelerator #{a} kA[{i}]={rA[i]} expected {i * 2 + 1} — flush broke kernel A (module not cached / stale?).");
1840+
if (rB[i] != i + 100)
1841+
throw new Exception($"Accelerator #{a} kB[{i}]={rB[i]} expected {i + 100} — flush broke kernel B.");
1842+
}
1843+
}
1844+
finally { accelerator.Dispose(); context.Dispose(); }
1845+
}
1846+
}
1847+
finally { WasmBackend.ModuleCacheFlushThreshold = savedThreshold; }
1848+
}
1849+
17951850
// Wasm SIMD128 emitter foundation (Phase 1, 2026-06-14, Geordi). Pure-CPU regression guard on
17961851
// the v128 encoding — NO browser/GPU needed, just byte assertions. Locks the part most likely
17971852
// to silently break: SIMD sub-opcodes are u32-LEB128 after the 0xFD prefix (NOT single bytes

SpawnDev.ILGPU/SpawnDev.ILGPU.csproj

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
<TargetFramework>net10.0</TargetFramework>
55
<ImplicitUsings>enable</ImplicitUsings>
66
<Nullable>enable</Nullable>
7-
<Version>4.12.1-local.6</Version>
7+
<Version>4.12.1-local.7</Version>
88
<!-- Brief current-version highlights only. Full per-version history with code samples lives in CHANGELOG.md (linked from the README). -->
99
<PackageReleaseNotes>4.12.1: WebGPU cooperative GEMV grid-stride fix; ±inf/NaN scalar kernel params on WebGL+Wasm; AcceleratorRequirements.RequiresScatterStores flag; Wasm process-persistent shared Web Worker pool AND shared linear memory keyed per MaxLinearMemoryPages (default-WorkerCount accelerators share one pool + one WebAssembly.Memory per distinct max per tab, fixing worker-churn starvation and the WebAssembly.Memory-reservation accumulation across long test lanes — at both the default 1 GiB and custom maxes like 2 GiB); Wasm SIMD128 emitter foundation (additive groundwork, scalar path unchanged). Forks stay 2.0.16. Full per-version history with details: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
1010
<GeneratePackageOnBuild>True</GeneratePackageOnBuild>

SpawnDev.ILGPU/Wasm/Backend/WasmBackend.cs

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,21 @@ public static bool RuntimeSupportsWasmSimd
165165
/// </summary>
166166
public static bool EffectiveWasmSimd => !ForceScalar && (ForceSimd || RuntimeSupportsWasmSimd);
167167

168+
/// <summary>
169+
/// Bounds the persistent-worker module cache (`_modulesById`). The shared worker pool keeps every
170+
/// distinct kernel's compiled `WebAssembly.Module` for the tab's life; across a long test lane
171+
/// (Tuvok's ML trace 2026-06-14: 2→1057 kernels, monotonic) this accumulates until late, heavy
172+
/// tests hit memory pressure and time out. When the cumulative kernels compiled since the last
173+
/// flush crosses this threshold, the host instructs the workers to drop their module/instance
174+
/// caches at the NEXT fresh accelerator's first dispatch (safe: that accelerator re-sends its own
175+
/// kernels; older disposed accelerators' modules are the dead weight cleared). Only default-pool
176+
/// (shared-worker) accelerators trigger it. Short workloads never reach the threshold → never
177+
/// flush → kernels stay fully warm (e.g. the ILGPU library Wasm lane is unaffected). Set 0 to
178+
/// disable. Default 256 (≈ one flush per ~140 ML tests at ~1.8 new kernels/test, keeping peak
179+
/// modules well under the ~1057 that caused pressure).
180+
/// </summary>
181+
public static int ModuleCacheFlushThreshold { get; set; } = 256;
182+
168183
/// <summary>
169184
/// Stage-3a SIMD uniformity analysis result for the most recently compiled kernel
170185
/// (<see cref="WasmSimdAnalysis"/>). DIAGNOSTIC ONLY — computed read-only during codegen; it

SpawnDev.ILGPU/Wasm/CLAUDE.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,22 @@ accelerator. Locked by `WasmTests.Wasm_SharedWorkerPool_PersistsAndStaysBoundedA
9898
alive and both adopt the same worker, both install handlers — detach-on-dispose only cleans the
9999
dominant sequential case (PMT disposes the previous accelerator before creating the next).
100100

101+
**Module-cache flush — bounds `_modulesById` accumulation (2026-06-14).** Persistent workers keep every
102+
distinct kernel's compiled `WebAssembly.Module` (`_modulesById`) for the tab's life. Each per-test
103+
accelerator's kernels get fresh ids, so across a long lane the cache grows UNBOUNDED (Tuvok's ML trace:
104+
`TotalKernelsCompiled` 2→1057 monotonic; committed shared linear memory was flat@96 MiB — the module
105+
cache, not the working set, drove late-heavy-test memory-pressure timeouts). Fix: when
106+
`_nextKernelId - s_lastFlushKernelId >= WasmBackend.ModuleCacheFlushThreshold` (default 256, 0=off), the
107+
host sets `clearModuleCache=true` on the worker messages of the **next fresh accelerator's FIRST
108+
dispatch**; the worker drops `_modulesById`/`_instancesById` (+ nulls `_lastMemoryBuffer`) then recompiles
109+
from the re-sent bytes. **SAFE only at a first dispatch** — the accelerator's worker-init tracking is empty
110+
there so it re-sends its own kernels; the cleared modules are disposed accelerators' dead weight. NEVER
111+
flush mid-accelerator (would orphan modules it already told workers it had → "module not cached").
112+
Sequential-accelerator assumption (like the rest of the shared pool — concurrent live accelerators could
113+
see a peer's flush). Short workloads never cross the threshold → never flush → stay fully warm (the ILGPU
114+
library Wasm lane is unaffected). Guard: `WasmTests.Wasm_ModuleCacheFlush_DoesNotBreakCorrectness`
115+
(threshold=1, CPU-oracle). Diagnostic: `WasmAccelerator.TotalKernelsCompiled`.
116+
101117
## Process-static SHARED linear memory (2026-06-14) — `s_sharedWasmMemory`
102118

103119
The memory analog of the worker pool, and the **second half** the pool fix unmasked. A

SpawnDev.ILGPU/Wasm/WasmAccelerator.cs

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -313,6 +313,17 @@ public static void DisposeSharedWasmMemory()
313313
/// </summary>
314314
public static int TotalKernelsCompiled => _nextKernelId;
315315

316+
/// <summary><see cref="_nextKernelId"/> value at the last module-cache flush. The host flushes the
317+
/// persistent workers' module caches when <c>_nextKernelId - s_lastFlushKernelId</c> exceeds
318+
/// <see cref="WasmBackend.ModuleCacheFlushThreshold"/> — see that flag + the flush in RunKernelAsync.</summary>
319+
private static int s_lastFlushKernelId = 0;
320+
321+
/// <summary>Per-accelerator: has this accelerator dispatched yet? The module-cache flush check runs
322+
/// ONLY on a fresh accelerator's FIRST dispatch (where its own worker-init tracking is still empty,
323+
/// so dropping all cached modules is safe — it re-sends its kernels; mid-accelerator a flush would
324+
/// orphan modules this accelerator already told workers it had → "module not cached").</summary>
325+
private bool _firstDispatchDone = false;
326+
316327
/// <summary>
317328
/// Worker-init tracking for one distinct kernel (keyed by its <c>wasmBytes</c> reference in
318329
/// <see cref="_initializedWorkersByKernel"/>). Carries a stable, unique <see cref="KernelId"/>
@@ -2031,6 +2042,32 @@ private async Task DispatchToWorkers(
20312042
}
20322043
int kernelId = kernelCacheEntry.KernelId;
20332044

2045+
// Module-cache flush decision (bounds persistent-worker _modulesById accumulation; Tuvok
2046+
// trace 2026-06-14). ONLY on a fresh default-pool accelerator's FIRST dispatch (tracking
2047+
// empty ⇒ safe to drop all cached modules; this dispatch re-sends its kernel). When set,
2048+
// every worker message below carries clearModuleCache=true; the worker drops its caches then
2049+
// recompiles from the wasmBytes it's re-sent here. Sequential-accelerator assumption (like the
2050+
// rest of the shared pool). Short workloads never cross the threshold ⇒ never flush ⇒ stay warm.
2051+
bool clearCacheThisDispatch = false;
2052+
if (_useSharedPool && !_firstDispatchDone)
2053+
{
2054+
_firstDispatchDone = true;
2055+
int flushThreshold = WasmBackend.ModuleCacheFlushThreshold;
2056+
if (flushThreshold > 0)
2057+
{
2058+
lock (s_sharedMemoryLock)
2059+
{
2060+
if (_nextKernelId - s_lastFlushKernelId >= flushThreshold)
2061+
{
2062+
clearCacheThisDispatch = true;
2063+
s_lastFlushKernelId = _nextKernelId;
2064+
}
2065+
}
2066+
if (clearCacheThisDispatch && WasmBackend.VerboseLogging)
2067+
WasmBackend.Log($"[Wasm-MODFLUSH] disp={dispNum} clearing worker module caches at kernels={_nextKernelId} (threshold={flushThreshold})");
2068+
}
2069+
}
2070+
20342071
var tasks = new List<Task>();
20352072

20362073
if (hasBarriers)
@@ -2080,6 +2117,7 @@ private async Task DispatchToWorkers(
20802117
script = workerScript,
20812118
wasmBytes = firstTimeOnWorker ? wasmBytes : null,
20822119
kernelId = kernelId,
2120+
clearModuleCache = clearCacheThisDispatch,
20832121
memory = wasmMemory,
20842122
threadStart = threadStart,
20852123
threadEnd = threadEnd,
@@ -2127,6 +2165,7 @@ private async Task DispatchToWorkers(
21272165
script = workerScript,
21282166
wasmBytes = firstTimeOnWorker ? wasmBytes : null,
21292167
kernelId = kernelId,
2168+
clearModuleCache = clearCacheThisDispatch,
21302169
memory = wasmMemory,
21312170
startIdx = startIdx,
21322171
endIdx = endIdx,

0 commit comments

Comments
 (0)