Wasm: bound persistent-worker module cache via flush-at-accelerator-boundary

LostBeard · claude · LostBeard · commit 665a704eaf4f · 2026-06-14T18:27:18.000-04:00
Tuvok's local.6 memory-pressure trace (2026-06-14) proved the driver of the ML heavy-test late-lane
timeouts: the persistent worker pool's per-kernel module cache (_modulesById) accumulates UNBOUNDED
(TotalKernelsCompiled 2 -&gt; 1057 monotonic across a ~570-test lane), while the committed shared linear
memory is flat (~96 MiB) -- the cache, not the working set, is the pressure. Causal proof: TurboQuant
tests pass 128/128 isolated, time out only late in the lane where kernels ~ 1100.

Fix: flush the worker module caches at an accelerator boundary. When cumulative kernels compiled since
the last flush cross WasmBackend.ModuleCacheFlushThreshold (default 256; 0 disables), the host sets
clearModuleCache=true on the worker messages of the NEXT fresh default-pool accelerator's FIRST dispatch.
The worker drops _modulesById/_instancesById (+ nulls _lastMemoryBuffer) then recompiles from the
re-sent bytes. SAFE only at a first dispatch: the accelerator's worker-init tracking is empty there so it
re-sends its own kernels; the cleared modules are disposed accelerators' dead weight. NEVER mid-accelerator
(would orphan modules it already told workers it had -&gt; "module not cached"). Bounds peak modules to
~threshold. Short workloads never cross it -&gt; never flush -&gt; kernels stay fully warm (ILGPU library Wasm
lane / RadixSort record times unaffected). Sequential-accelerator assumption, like the rest of the pool.

Chose flush over per-kernel LRU: LRU needs process-static per-worker module tracking + cross-accelerator
eviction coordination in the kernelId-collision-sensitive area (delicate); flush is simple, race-free,
and a guaranteed bound.

WorkerPool.cs (worker clearModuleCache handling), WasmDispatchMessages.cs (field on both message types),
WasmAccelerator.cs (first-dispatch flush trigger + s_lastFlushKernelId), WasmBackend.ModuleCacheFlushThreshold.
Guard WasmTests.Wasm_ModuleCacheFlush_DoesNotBreakCorrectness (threshold=1 -&gt; flush every accelerator,
6 accels x 2 distinct kernels, CPU-oracle -&gt; catches any "module not cached"/stale). Gate:
PMT_FILTER=WasmTests 520/0/17. Version 4.12.1-local.7 (forks stay 2.0.16).

Co-Authored-By: Claude Opus 4.8 &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,7 @@ Wrapper-only (forks stay **2.0.16**). Adds a new selection-gate capability flag:
 - **Wasm: process-persistent shared Web Worker pool.** The Web Worker pool is now process-static (`WasmAccelerator.s_sharedWorkerPool`) - created once per tab and reused across every default-WorkerCount accelerator - instead of being created and `terminate()`d per accelerator. `Worker.terminate()` is an asynchronous browser signal, so a fresh-accelerator-per-test pattern (PMT's ~531-test Wasm lane) spun up a new `hardwareConcurrency` pool while the previous pool's threads were still winding down -> transient worker oversubscription that compounded across the lane -> the pure-spin barrier couldn't schedule all workers in its window -> compute-heavy tests starved and timed out late while light tests stayed fast. The shared pool removes both the terminate churn and the per-test re-create cost. Safe across accelerators: the worker-side module-cache key is a process-static monotonic id (no cross-accelerator collision), a memory-buffer change invalidates a reused worker's cached instances, and each accelerator detaches its handlers on Dispose. Bounded at ~`hardwareConcurrency-2`: an explicit `WorkerCount` (oversubscription stress tests) keeps a private pool, and a worker still checked out at an abnormal Dispose is terminated+removed rather than stranded. Locked by `WasmTests.Wasm_SharedWorkerPool_PersistsAndStaysBoundedAcrossAccelerators`. Gate: `PMT_FILTER=WasmTests` 516/0/17.
 - **Wasm: process-static shared linear memory** (the second half the persistent worker pool unmasked). A `new WebAssembly.Memory({ shared: true })` reserves its full `maximum` (default 16384 pages = 1 GiB) of virtual address space at construction and can never relocate, so each default accelerator that built its own memory burned a full 1 GiB reservation. Before the persistent pool, `Worker.terminate()` per accelerator Dispose dropped the workers' references so the old reservation was freed/GC'd per test; with the persistent pool the workers pin the last memory they instantiated against (until they next swap), so across a ~569-test lane the per-accelerator memories accumulated up to `workerCount` live 1 GiB reservations until V8's address-space cap was hit and the `new WebAssembly.Memory(...)` constructor threw `could not allocate memory`. Default-WorkerCount accelerators now share a process-static `WebAssembly.Memory` keyed by their `MaxLinearMemoryPages` (`WasmAccelerator.s_sharedByMaxPages`) - ONE shared memory per distinct max value, grown to the lane high-water and never re-created -> a single reservation per max-group. Safe because the linear memory is per-dispatch transient working/staging memory (zero region -> copy-in -> run -> copy-out; no cross-accelerator state); a per-group `SemaphoreSlim` serializes that group's dispatch window across concurrently-alive accelerators (zero-cost in the sequential case). Keyed by max because the kernel module declares its memory-import maximum = its own `MaxLinearMemoryPages` and the spec requires the supplied memory's maximum equal the module's declared maximum - so all 16384 accelerators share a 16384 memory, all 32768 (e.g. ML's DA3-Small at 2 GiB) share a 32768 memory, etc. An explicit-`WorkerCount` accelerator (oversubscription stress tests, which want worker isolation) keeps a private memory. Bonus: with persistent workers and a persistent memory the buffer only changes on `grow()`, so after high-water the workers stop re-instantiating kernels entirely (the per-test new-memory churn is gone). (Originally only the default 16384 was shared, which missed the ML test lane's ~569 accelerators at a custom 32768 max - they re-accumulated the leak at 2 GiB each; generalized to per-max.) Diagnostics `WasmAccelerator.SharedWasmMemoryCreateCount` / `SharedWasmMemoryPages` (summed across groups); locked by `WasmTests.Wasm_SharedLinearMemory_PersistsAndStaysBoundedAcrossAccelerators` (default max) + `Wasm_SharedLinearMemory_CustomMaxPages_AlsoBounded` (custom max).
 - **Wasm SIMD128 emitter foundation (Phase 1 of the SIMD port).** Additive groundwork only - no production kernel emits v128 yet, so the scalar path is byte-identical. Adds the v128 value type and the 0xFD-prefixed SIMD opcode set to `WasmOpCodes` (spec-verified; sub-opcodes are u32-LEB128 after the prefix, so multi-byte ones like `f32x4.add`=228 encode correctly), v128 emit helpers in `WasmModuleBuilder` (`EmitSimd`/`EmitSimdMem`/`EmitSimdLane`/`EmitV128Const`/`EmitI8x16Shuffle`), and the runtime SIMD capability surface: `WasmBackend.RuntimeSupportsWasmSimd` (via `System.Runtime.Intrinsics.Wasm.PackedSimd.IsSupported` - if the running Blazor WASM build has SIMD enabled, the browser/workers accept v128), `ForceScalar`/`ForceSimd` test overrides, `EffectiveWasmSimd`, `WasmCapabilityContext.WasmSimd`, and `WasmAccelerator.SupportsSimd`. **Non-SIMD devices stay first-class forever** (the scalar path is a supported mode, not a deprecated fallback - real hardware/browsers without wasm SIMD are common; see the dual-build technique in `BlazorWASMSIMDDetectExample`). Verified by the offline `DemoConsole -- wasm-simd-probe`: a hand-built v128 module is `wasm-validate`-clean and `wasm2wat`-decodes to the intended instructions.
+- **Wasm: bound the persistent-worker module cache (late-lane memory-pressure fix).** The process-persistent worker pool keeps every distinct kernel's compiled `WebAssembly.Module` in a per-worker cache (`_modulesById`) for the tab's life. Across a long test lane each per-test accelerator's kernels get fresh ids, so the cache accumulated unbounded (measured 2 -> 1057 across a ~570-test lane) until late, heavy tests hit process-memory pressure and timed out (the committed shared linear memory was flat/small - the module cache was the driver). Fix: when cumulative kernels compiled since the last flush cross `WasmBackend.ModuleCacheFlushThreshold` (default 256; 0 disables), the host instructs the workers to drop their module/instance caches at the next fresh accelerator's FIRST dispatch (safe - that accelerator re-sends its own kernels; the cleared modules are disposed accelerators' dead weight). Bounds peak modules to ~the threshold. Short workloads never reach it -> never flush -> kernels stay fully warm. Diagnostics `WasmAccelerator.TotalKernelsCompiled` / `SharedWasmMemoryPages`; guard `WasmTests.Wasm_ModuleCacheFlush_DoesNotBreakCorrectness` (flushes every accelerator, asserts CPU-oracle).
 
 ## 4.12.0 (2026-06-13) - Sync/async contract: async-only where it waits/observes, sync for fire-and-forget
 
diff --git a/SpawnDev.ILGPU.Demo/UnitTests/WasmTests.cs b/SpawnDev.ILGPU.Demo/UnitTests/WasmTests.cs
@@ -1792,6 +1792,61 @@ public async Task Wasm_SharedLinearMemory_CustomMaxPages_AlsoBounded()
                     $"accelerators — the custom-max reservation leak (Tuvok's ML-lane 88->91) has regressed (expected <= 1).");
         }
 
+        // Module-cache flush correctness (2026-06-14, Geordi). The persistent worker pool's per-kernel
+        // module cache (_modulesById) accumulates across a long lane (Tuvok's ML trace: 2->1057 kernels →
+        // late-heavy-test memory-pressure timeouts). The fix flushes the worker caches at a fresh
+        // accelerator's first dispatch once cumulative kernels cross WasmBackend.ModuleCacheFlushThreshold.
+        // The RISK is the flush orphaning a module the host still thinks a worker has → "module not cached"
+        // / wrong output. This test forces flushes EVERY accelerator (threshold=1) across many accelerators,
+        // each running TWO distinct kernels (so the within-accelerator repopulation after a flush is
+        // exercised), and asserts CPU-oracle correctness throughout. If the flush coordination were wrong,
+        // this fails loudly. (A green run with aggressive flushing proves the dispatch-boundary flush is safe.)
+        [TestMethod(Timeout = 120000)]
+        public async Task Wasm_ModuleCacheFlush_DoesNotBreakCorrectness()
+        {
+            const int count = 2048;
+            const int accelerators = 6;
+            int savedThreshold = WasmBackend.ModuleCacheFlushThreshold;
+            WasmBackend.ModuleCacheFlushThreshold = 1; // flush on essentially every fresh accelerator
+            try
+            {
+                for (int a = 0; a < accelerators; a++)
+                {
+                    var context = Context.Create().Wasm().ToContext();
+                    var accelerator = await context.CreateWasmAcceleratorAsync();
+                    try
+                    {
+                        using var inBuf = accelerator.Allocate1D<int>(count);
+                        using var outA = accelerator.Allocate1D<int>(count);
+                        using var outB = accelerator.Allocate1D<int>(count);
+                        var seed = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<int>>(
+                            (i, v) => v[i] = i);
+                        seed((Index1D)count, inBuf.View);
+                        // Two DISTINCT kernels in this accelerator → 2 module compiles → repopulation after
+                        // the flush that fires at this accelerator's first dispatch.
+                        var kA = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<int>, ArrayView<int>>(
+                            (i, src, o) => o[i] = src[i] * 2 + 1);
+                        var kB = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<int>, ArrayView<int>>(
+                            (i, src, o) => o[i] = src[i] + 100);
+                        kA((Index1D)count, inBuf.View, outA.View);
+                        kB((Index1D)count, inBuf.View, outB.View);
+                        await accelerator.SynchronizeAsync();
+                        var rA = await outA.CopyToHostAsync<int>();
+                        var rB = await outB.CopyToHostAsync<int>();
+                        for (int i = 0; i < count; i++)
+                        {
+                            if (rA[i] != i * 2 + 1)
+                                throw new Exception($"Accelerator #{a} kA[{i}]={rA[i]} expected {i * 2 + 1} — flush broke kernel A (module not cached / stale?).");
+                            if (rB[i] != i + 100)
+                                throw new Exception($"Accelerator #{a} kB[{i}]={rB[i]} expected {i + 100} — flush broke kernel B.");
+                        }
+                    }
+                    finally { accelerator.Dispose(); context.Dispose(); }
+                }
+            }
+            finally { WasmBackend.ModuleCacheFlushThreshold = savedThreshold; }
+        }
+
         // Wasm SIMD128 emitter foundation (Phase 1, 2026-06-14, Geordi). Pure-CPU regression guard on
         // the v128 encoding — NO browser/GPU needed, just byte assertions. Locks the part most likely
         // to silently break: SIMD sub-opcodes are u32-LEB128 after the 0xFD prefix (NOT single bytes
diff --git a/SpawnDev.ILGPU/SpawnDev.ILGPU.csproj b/SpawnDev.ILGPU/SpawnDev.ILGPU.csproj
@@ -4,7 +4,7 @@
 		<TargetFramework>net10.0</TargetFramework>
 		<ImplicitUsings>enable</ImplicitUsings>
 		<Nullable>enable</Nullable>
-		<Version>4.12.1-local.6</Version>
+		<Version>4.12.1-local.7</Version>
 		<!-- Brief current-version highlights only. Full per-version history with code samples lives in CHANGELOG.md (linked from the README). -->
 		<PackageReleaseNotes>4.12.1: WebGPU cooperative GEMV grid-stride fix; ±inf/NaN scalar kernel params on WebGL+Wasm; AcceleratorRequirements.RequiresScatterStores flag; Wasm process-persistent shared Web Worker pool AND shared linear memory keyed per MaxLinearMemoryPages (default-WorkerCount accelerators share one pool + one WebAssembly.Memory per distinct max per tab, fixing worker-churn starvation and the WebAssembly.Memory-reservation accumulation across long test lanes — at both the default 1 GiB and custom maxes like 2 GiB); Wasm SIMD128 emitter foundation (additive groundwork, scalar path unchanged). Forks stay 2.0.16. Full per-version history with details: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
 		<GeneratePackageOnBuild>True</GeneratePackageOnBuild>
diff --git a/SpawnDev.ILGPU/Wasm/Backend/WasmBackend.cs b/SpawnDev.ILGPU/Wasm/Backend/WasmBackend.cs
@@ -165,6 +165,21 @@ public static bool RuntimeSupportsWasmSimd
         /// </summary>
         public static bool EffectiveWasmSimd => !ForceScalar && (ForceSimd || RuntimeSupportsWasmSimd);
 
+        /// <summary>
+        /// Bounds the persistent-worker module cache (`_modulesById`). The shared worker pool keeps every
+        /// distinct kernel's compiled `WebAssembly.Module` for the tab's life; across a long test lane
+        /// (Tuvok's ML trace 2026-06-14: 2→1057 kernels, monotonic) this accumulates until late, heavy
+        /// tests hit memory pressure and time out. When the cumulative kernels compiled since the last
+        /// flush crosses this threshold, the host instructs the workers to drop their module/instance
+        /// caches at the NEXT fresh accelerator's first dispatch (safe: that accelerator re-sends its own
+        /// kernels; older disposed accelerators' modules are the dead weight cleared). Only default-pool
+        /// (shared-worker) accelerators trigger it. Short workloads never reach the threshold → never
+        /// flush → kernels stay fully warm (e.g. the ILGPU library Wasm lane is unaffected). Set 0 to
+        /// disable. Default 256 (≈ one flush per ~140 ML tests at ~1.8 new kernels/test, keeping peak
+        /// modules well under the ~1057 that caused pressure).
+        /// </summary>
+        public static int ModuleCacheFlushThreshold { get; set; } = 256;
+
         /// <summary>
         /// Stage-3a SIMD uniformity analysis result for the most recently compiled kernel
         /// (<see cref="WasmSimdAnalysis"/>). DIAGNOSTIC ONLY — computed read-only during codegen; it
diff --git a/SpawnDev.ILGPU/Wasm/CLAUDE.md b/SpawnDev.ILGPU/Wasm/CLAUDE.md
@@ -98,6 +98,22 @@ accelerator. Locked by `WasmTests.Wasm_SharedWorkerPool_PersistsAndStaysBoundedA
 alive and both adopt the same worker, both install handlers — detach-on-dispose only cleans the
 dominant sequential case (PMT disposes the previous accelerator before creating the next).
 
+**Module-cache flush — bounds `_modulesById` accumulation (2026-06-14).** Persistent workers keep every
+distinct kernel's compiled `WebAssembly.Module` (`_modulesById`) for the tab's life. Each per-test
+accelerator's kernels get fresh ids, so across a long lane the cache grows UNBOUNDED (Tuvok's ML trace:
+`TotalKernelsCompiled` 2→1057 monotonic; committed shared linear memory was flat@96 MiB — the module
+cache, not the working set, drove late-heavy-test memory-pressure timeouts). Fix: when
+`_nextKernelId - s_lastFlushKernelId >= WasmBackend.ModuleCacheFlushThreshold` (default 256, 0=off), the
+host sets `clearModuleCache=true` on the worker messages of the **next fresh accelerator's FIRST
+dispatch**; the worker drops `_modulesById`/`_instancesById` (+ nulls `_lastMemoryBuffer`) then recompiles
+from the re-sent bytes. **SAFE only at a first dispatch** — the accelerator's worker-init tracking is empty
+there so it re-sends its own kernels; the cleared modules are disposed accelerators' dead weight. NEVER
+flush mid-accelerator (would orphan modules it already told workers it had → "module not cached").
+Sequential-accelerator assumption (like the rest of the shared pool — concurrent live accelerators could
+see a peer's flush). Short workloads never cross the threshold → never flush → stay fully warm (the ILGPU
+library Wasm lane is unaffected). Guard: `WasmTests.Wasm_ModuleCacheFlush_DoesNotBreakCorrectness`
+(threshold=1, CPU-oracle). Diagnostic: `WasmAccelerator.TotalKernelsCompiled`.
+
 ## Process-static SHARED linear memory (2026-06-14) — `s_sharedWasmMemory`
 
 The memory analog of the worker pool, and the **second half** the pool fix unmasked. A
diff --git a/SpawnDev.ILGPU/Wasm/WasmAccelerator.cs b/SpawnDev.ILGPU/Wasm/WasmAccelerator.cs
@@ -313,6 +313,17 @@ public static void DisposeSharedWasmMemory()
         /// </summary>
         public static int TotalKernelsCompiled => _nextKernelId;
 
+        /// <summary><see cref="_nextKernelId"/> value at the last module-cache flush. The host flushes the
+        /// persistent workers' module caches when <c>_nextKernelId - s_lastFlushKernelId</c> exceeds
+        /// <see cref="WasmBackend.ModuleCacheFlushThreshold"/> — see that flag + the flush in RunKernelAsync.</summary>
+        private static int s_lastFlushKernelId = 0;
+
+        /// <summary>Per-accelerator: has this accelerator dispatched yet? The module-cache flush check runs
+        /// ONLY on a fresh accelerator's FIRST dispatch (where its own worker-init tracking is still empty,
+        /// so dropping all cached modules is safe — it re-sends its kernels; mid-accelerator a flush would
+        /// orphan modules this accelerator already told workers it had → "module not cached").</summary>
+        private bool _firstDispatchDone = false;
+
         /// <summary>
         /// Worker-init tracking for one distinct kernel (keyed by its <c>wasmBytes</c> reference in
         /// <see cref="_initializedWorkersByKernel"/>). Carries a stable, unique <see cref="KernelId"/>
@@ -2031,6 +2042,32 @@ private async Task DispatchToWorkers(
             }
             int kernelId = kernelCacheEntry.KernelId;
 
+            // Module-cache flush decision (bounds persistent-worker _modulesById accumulation; Tuvok
+            // trace 2026-06-14). ONLY on a fresh default-pool accelerator's FIRST dispatch (tracking
+            // empty ⇒ safe to drop all cached modules; this dispatch re-sends its kernel). When set,
+            // every worker message below carries clearModuleCache=true; the worker drops its caches then
+            // recompiles from the wasmBytes it's re-sent here. Sequential-accelerator assumption (like the
+            // rest of the shared pool). Short workloads never cross the threshold ⇒ never flush ⇒ stay warm.
+            bool clearCacheThisDispatch = false;
+            if (_useSharedPool && !_firstDispatchDone)
+            {
+                _firstDispatchDone = true;
+                int flushThreshold = WasmBackend.ModuleCacheFlushThreshold;
+                if (flushThreshold > 0)
+                {
+                    lock (s_sharedMemoryLock)
+                    {
+                        if (_nextKernelId - s_lastFlushKernelId >= flushThreshold)
+                        {
+                            clearCacheThisDispatch = true;
+                            s_lastFlushKernelId = _nextKernelId;
+                        }
+                    }
+                    if (clearCacheThisDispatch && WasmBackend.VerboseLogging)
+                        WasmBackend.Log($"[Wasm-MODFLUSH] disp={dispNum} clearing worker module caches at kernels={_nextKernelId} (threshold={flushThreshold})");
+                }
+            }
+
             var tasks = new List<Task>();
 
             if (hasBarriers)
@@ -2080,6 +2117,7 @@ private async Task DispatchToWorkers(
                         script = workerScript,
                         wasmBytes = firstTimeOnWorker ? wasmBytes : null,
                         kernelId = kernelId,
+                        clearModuleCache = clearCacheThisDispatch,
                         memory = wasmMemory,
                         threadStart = threadStart,
                         threadEnd = threadEnd,
@@ -2127,6 +2165,7 @@ private async Task DispatchToWorkers(
                         script = workerScript,
                         wasmBytes = firstTimeOnWorker ? wasmBytes : null,
                         kernelId = kernelId,
+                        clearModuleCache = clearCacheThisDispatch,
                         memory = wasmMemory,
                         startIdx = startIdx,
                         endIdx = endIdx,
diff --git a/SpawnDev.ILGPU/Wasm/WasmDispatchMessages.cs b/SpawnDev.ILGPU/Wasm/WasmDispatchMessages.cs
diff --git a/SpawnDev.ILGPU/WorkerPool.cs b/SpawnDev.ILGPU/WorkerPool.cs