Wasm: fix host-write snapshot SharedArrayBuffer leak (the ML-lane heavy-test memory leak)

LostBeard · claude · LostBeard · commit 3784cfbbcc4e · 2026-06-14T22:14:15.000-04:00
Root-caused via a RESIDENT-memory trace (Tuvok): the ML Wasm heavy-test late-lane timeouts are a JS-heap
leak (usedJSHeapSize 154-&gt;1644 MiB across the lane; worker pool flat, linear memory flat, module cache
flat by magnitude). Reading the code found a SECOND SharedArrayBuffer path my earlier byte counter never
saw: WasmMemoryBuffer.PrepareHostWrite (WasmMemoryBuffer.cs:87) allocates a FULL-buffer-size
`new SharedArrayBuffer` snapshot when a host write lands while a dispatch is in flight on that buffer (the
lazy copy-out-race defense). CompleteDispatchIntent Remove()'d the snapshot from _snapshotsByHWC but NEVER
Disposed the SharedArrayBuffer -- despite the method's own doc claiming "that tier's SAB is freed" -- and
the all-intents-complete path nulled the dict without disposing. So every materialized snapshot leaked a
full-buffer JS SAB; ML's CopyFromCPU+dispatch pattern materializes them constantly -&gt; ~1.5 GiB.

Fix: Dispose the snapshot SAB on release (the free the doc always promised) + DisposeAllSnapshots() on the
intents==0 path and on buffer Dispose (buffer torn down with pending snapshots). New diagnostic
WasmMemoryBuffer.LiveSnapshotBytes (the previously-invisible snapshot-SAB resident bytes). Regression guard
WasmTests.Wasm_HostWriteSnapshot_DoesNotLeakSAB deterministically materializes snapshots (launch registers
the dispatch intent synchronously, then host-write mid-flight) and asserts LiveSnapshotBytes returns to
baseline after dispatch-complete + buffer dispose. Also keeps the local.8 resident-count diagnostics
(LiveBufferCount/Bytes, LiveAcceleratorCount).

Gate: PMT_FILTER=WasmTests 521/0/17 (guard green = the dispose works on the deterministic trigger; large
RadixSorts faster: 1.4M 11.7s, 4M 33s). Version 4.12.1-local.9 (forks stay 2.0.16). Re-verify pending on
Tuvok's full ML lane (expect LiveSnapshotBytes flat-near-0, heap flat, the 8 timeouts gone).

Co-Authored-By: Claude Opus 4.8 &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,6 +12,7 @@ Wrapper-only (forks stay **2.0.16**). Adds a new selection-gate capability flag:
 - **Wasm: process-static shared linear memory** (the second half the persistent worker pool unmasked). A `new WebAssembly.Memory({ shared: true })` reserves its full `maximum` (default 16384 pages = 1 GiB) of virtual address space at construction and can never relocate, so each default accelerator that built its own memory burned a full 1 GiB reservation. Before the persistent pool, `Worker.terminate()` per accelerator Dispose dropped the workers' references so the old reservation was freed/GC'd per test; with the persistent pool the workers pin the last memory they instantiated against (until they next swap), so across a ~569-test lane the per-accelerator memories accumulated up to `workerCount` live 1 GiB reservations until V8's address-space cap was hit and the `new WebAssembly.Memory(...)` constructor threw `could not allocate memory`. Default-WorkerCount accelerators now share a process-static `WebAssembly.Memory` keyed by their `MaxLinearMemoryPages` (`WasmAccelerator.s_sharedByMaxPages`) - ONE shared memory per distinct max value, grown to the lane high-water and never re-created -> a single reservation per max-group. Safe because the linear memory is per-dispatch transient working/staging memory (zero region -> copy-in -> run -> copy-out; no cross-accelerator state); a per-group `SemaphoreSlim` serializes that group's dispatch window across concurrently-alive accelerators (zero-cost in the sequential case). Keyed by max because the kernel module declares its memory-import maximum = its own `MaxLinearMemoryPages` and the spec requires the supplied memory's maximum equal the module's declared maximum - so all 16384 accelerators share a 16384 memory, all 32768 (e.g. ML's DA3-Small at 2 GiB) share a 32768 memory, etc. An explicit-`WorkerCount` accelerator (oversubscription stress tests, which want worker isolation) keeps a private memory. Bonus: with persistent workers and a persistent memory the buffer only changes on `grow()`, so after high-water the workers stop re-instantiating kernels entirely (the per-test new-memory churn is gone). (Originally only the default 16384 was shared, which missed the ML test lane's ~569 accelerators at a custom 32768 max - they re-accumulated the leak at 2 GiB each; generalized to per-max.) Diagnostics `WasmAccelerator.SharedWasmMemoryCreateCount` / `SharedWasmMemoryPages` (summed across groups); locked by `WasmTests.Wasm_SharedLinearMemory_PersistsAndStaysBoundedAcrossAccelerators` (default max) + `Wasm_SharedLinearMemory_CustomMaxPages_AlsoBounded` (custom max).
 - **Wasm SIMD128 emitter foundation (Phase 1 of the SIMD port).** Additive groundwork only - no production kernel emits v128 yet, so the scalar path is byte-identical. Adds the v128 value type and the 0xFD-prefixed SIMD opcode set to `WasmOpCodes` (spec-verified; sub-opcodes are u32-LEB128 after the prefix, so multi-byte ones like `f32x4.add`=228 encode correctly), v128 emit helpers in `WasmModuleBuilder` (`EmitSimd`/`EmitSimdMem`/`EmitSimdLane`/`EmitV128Const`/`EmitI8x16Shuffle`), and the runtime SIMD capability surface: `WasmBackend.RuntimeSupportsWasmSimd` (via `System.Runtime.Intrinsics.Wasm.PackedSimd.IsSupported` - if the running Blazor WASM build has SIMD enabled, the browser/workers accept v128), `ForceScalar`/`ForceSimd` test overrides, `EffectiveWasmSimd`, `WasmCapabilityContext.WasmSimd`, and `WasmAccelerator.SupportsSimd`. **Non-SIMD devices stay first-class forever** (the scalar path is a supported mode, not a deprecated fallback - real hardware/browsers without wasm SIMD are common; see the dual-build technique in `BlazorWASMSIMDDetectExample`). Verified by the offline `DemoConsole -- wasm-simd-probe`: a hand-built v128 module is `wasm-validate`-clean and `wasm2wat`-decodes to the intended instructions.
 - **Wasm: bound the persistent-worker module cache (late-lane memory-pressure fix).** The process-persistent worker pool keeps every distinct kernel's compiled `WebAssembly.Module` in a per-worker cache (`_modulesById`) for the tab's life. Across a long test lane each per-test accelerator's kernels get fresh ids, so the cache accumulated unbounded (measured 2 -> 1057 across a ~570-test lane) until late, heavy tests hit process-memory pressure and timed out (the committed shared linear memory was flat/small - the module cache was the driver). Fix: when cumulative kernels compiled since the last flush cross `WasmBackend.ModuleCacheFlushThreshold` (default 256; 0 disables), the host instructs the workers to drop their module/instance caches at the next fresh accelerator's FIRST dispatch (safe - that accelerator re-sends its own kernels; the cleared modules are disposed accelerators' dead weight). Bounds peak modules to ~the threshold. Short workloads never reach it -> never flush -> kernels stay fully warm. Diagnostics `WasmAccelerator.TotalKernelsCompiled` / `SharedWasmMemoryPages`; guard `WasmTests.Wasm_ModuleCacheFlush_DoesNotBreakCorrectness` (flushes every accelerator, asserts CPU-oracle).
+- **Wasm: fixed a host-write SNAPSHOT SharedArrayBuffer leak (the real ML-lane heavy-test memory leak).** `WasmMemoryBuffer.PrepareHostWrite` allocates a full-buffer-size SharedArrayBuffer when a host write lands while a dispatch is in flight on that buffer (the lazy copy-out race defense). `CompleteDispatchIntent` removed the snapshot from its tracking dict but **never `Dispose()`d the SharedArrayBuffer** (despite its own doc claiming "that tier's SAB is freed"), and the all-intents-complete path dropped the dict without disposing either - so every materialized snapshot leaked a full-buffer-size JS SharedArrayBuffer. Under a long heavy-workload lane (ML's CopyFromCPU+dispatch pattern) this accumulated to ~1.5 GiB of JS heap, slowing late tests into timeouts (root-caused via a resident-memory trace: heap 154->1644 MiB; worker pool flat, linear memory flat, module cache flat by magnitude). Fix: dispose the snapshot SAB on release + on buffer dispose (`DisposeAllSnapshots`). New diagnostic `WasmMemoryBuffer.LiveSnapshotBytes`; guard `WasmTests.Wasm_HostWriteSnapshot_DoesNotLeakSAB` (deterministically materializes snapshots, asserts the resident bytes return to baseline). Also adds resident-count diagnostics `WasmMemoryBuffer.LiveBufferCount`/`LiveBufferBytes` + `WasmAccelerator.LiveAcceleratorCount`.
 
 ## 4.12.0 (2026-06-13) - Sync/async contract: async-only where it waits/observes, sync for fire-and-forget
 
diff --git a/SpawnDev.ILGPU.Demo/UnitTests/WasmTests.cs b/SpawnDev.ILGPU.Demo/UnitTests/WasmTests.cs
@@ -1847,6 +1847,43 @@ public async Task Wasm_ModuleCacheFlush_DoesNotBreakCorrectness()
             finally { WasmBackend.ModuleCacheFlushThreshold = savedThreshold; }
         }
 
+        // Host-write snapshot SAB leak guard (2026-06-14, Geordi). The lazy host-write snapshot
+        // (WasmMemoryBuffer.PrepareHostWrite) allocates a FULL-buffer-size SharedArrayBuffer when a host
+        // write lands while a dispatch is in flight on that buffer. CompleteDispatchIntent used to
+        // Remove() the snapshot from its dict but NEVER Dispose() the SAB (despite its doc claiming it
+        // did) → every snapshot leaked a full-buffer SAB → the ~1.5 GiB ML-lane late-test JS-heap leak
+        // (Tuvok trio trace). This guard deterministically materializes snapshots (launch a dispatch —
+        // which registers the intent synchronously — then host-write the buffer mid-flight) and asserts
+        // LiveSnapshotBytes returns to baseline after the dispatch completes + buffers dispose.
+        [TestMethod(Timeout = 120000)]
+        public async Task Wasm_HostWriteSnapshot_DoesNotLeakSAB()
+        {
+            const int count = 8192;
+            var context = Context.Create().Wasm().ToContext();
+            var accelerator = await context.CreateWasmAcceleratorAsync();
+            try
+            {
+                long baseline = SpawnDev.ILGPU.Wasm.WasmMemoryBuffer.LiveSnapshotBytes;
+                var data = new int[count];
+                for (int r = 0; r < 8; r++)
+                {
+                    using var buf = accelerator.Allocate1D<int>(count);
+                    var k = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<int>>(
+                        (i, v) => v[i] = i * 3);
+                    k((Index1D)count, buf.View);          // launch → registers the dispatch intent (synchronous)
+                    buf.View.CopyFromCPU(data);            // host write while in-flight → materializes a snapshot
+                    await accelerator.SynchronizeAsync();  // dispatch completes → snapshot must be Disposed
+                }
+                long leaked = SpawnDev.ILGPU.Wasm.WasmMemoryBuffer.LiveSnapshotBytes - baseline;
+                if (leaked != 0)
+                    throw new Exception(
+                        $"Host-write snapshot SABs leaked {leaked} bytes after dispatch-complete + buffer dispose — " +
+                        $"CompleteDispatchIntent/DisposeAcceleratorObject must Dispose the snapshot SharedArrayBuffers " +
+                        $"(the ML-lane ~1.5 GiB JS-heap leak has regressed).");
+            }
+            finally { accelerator.Dispose(); context.Dispose(); }
+        }
+
         // Wasm SIMD128 emitter foundation (Phase 1, 2026-06-14, Geordi). Pure-CPU regression guard on
         // the v128 encoding — NO browser/GPU needed, just byte assertions. Locks the part most likely
         // to silently break: SIMD sub-opcodes are u32-LEB128 after the 0xFD prefix (NOT single bytes
diff --git a/SpawnDev.ILGPU/SpawnDev.ILGPU.csproj b/SpawnDev.ILGPU/SpawnDev.ILGPU.csproj
@@ -4,7 +4,7 @@
 		<TargetFramework>net10.0</TargetFramework>
 		<ImplicitUsings>enable</ImplicitUsings>
 		<Nullable>enable</Nullable>
-		<Version>4.12.1-local.8</Version>
+		<Version>4.12.1-local.9</Version>
 		<!-- Brief current-version highlights only. Full per-version history with code samples lives in CHANGELOG.md (linked from the README). -->
 		<PackageReleaseNotes>4.12.1: WebGPU cooperative GEMV grid-stride fix; ±inf/NaN scalar kernel params on WebGL+Wasm; AcceleratorRequirements.RequiresScatterStores flag; Wasm process-persistent shared Web Worker pool AND shared linear memory keyed per MaxLinearMemoryPages (default-WorkerCount accelerators share one pool + one WebAssembly.Memory per distinct max per tab, fixing worker-churn starvation and the WebAssembly.Memory-reservation accumulation across long test lanes — at both the default 1 GiB and custom maxes like 2 GiB); Wasm SIMD128 emitter foundation (additive groundwork, scalar path unchanged). Forks stay 2.0.16. Full per-version history with details: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
 		<GeneratePackageOnBuild>True</GeneratePackageOnBuild>
diff --git a/SpawnDev.ILGPU/Wasm/CLAUDE.md b/SpawnDev.ILGPU/Wasm/CLAUDE.md
@@ -114,6 +114,19 @@ see a peer's flush). Short workloads never cross the threshold → never flush 
 library Wasm lane is unaffected). Guard: `WasmTests.Wasm_ModuleCacheFlush_DoesNotBreakCorrectness`
 (threshold=1, CPU-oracle). Diagnostic: `WasmAccelerator.TotalKernelsCompiled`.
 
+**Host-write snapshot SABs MUST be Disposed (2026-06-14 leak fix).** `WasmMemoryBuffer.PrepareHostWrite`
+allocates a FULL-buffer-size `new SharedArrayBuffer` (`WasmMemoryBuffer.cs:87`) when a host write lands
+while a dispatch is in flight on that buffer (the lazy copy-out-race snapshot). This is a SECOND SAB path
+distinct from the buffer's primary `SharedBuffer` — easy to miss. The original `CompleteDispatchIntent`
+Remove()'d the snapshot from `_snapshotsByHWC` but NEVER `Dispose()`d the SAB (its doc lied: "that tier's
+SAB is freed"), and the intents==0 path nulled the dict without disposing → every snapshot leaked a
+full-buffer JS SharedArrayBuffer → ~1.5 GiB across the ML lane → late-test timeouts (root-caused by a
+resident-memory trace; it was invisible to a primary-SAB byte counter). RULE: any `new SharedArrayBuffer`
+(or JSObject) created here must be `.Dispose()`d on EVERY exit path (release + buffer dispose) —
+`DisposeAllSnapshots()` does it now. Diagnostic `WasmMemoryBuffer.LiveSnapshotBytes`; guard
+`WasmTests.Wasm_HostWriteSnapshot_DoesNotLeakSAB`. Lesson: a monotonic counter is not a memory proxy, and
+a buffer's primary-SAB counter won't see a SECOND SAB path — measure the actual resident bytes per path.
+
 ## Process-static SHARED linear memory (2026-06-14) — `s_sharedWasmMemory`
 
 The memory analog of the worker pool, and the **second half** the pool fix unmasked. A
diff --git a/SpawnDev.ILGPU/Wasm/WasmMemoryBuffer.cs b/SpawnDev.ILGPU/Wasm/WasmMemoryBuffer.cs
@@ -90,6 +90,7 @@ internal void PrepareHostWrite()
             dst.JSRef!.CallVoid("set", src);
             _snapshotsByHWC[hwcKey] = fresh;
             _snapshotRefCounts[hwcKey] = _pendingSnapshotIntents; // every pending intent shares this tier
+            s_liveSnapshotBytes += LengthInBytes; // resident-SAB diagnostic (see LiveSnapshotBytes)
         }
 
         /// <summary>
@@ -198,7 +199,18 @@ internal void CompleteDispatchIntent(int queueTimeHostWriteCounter)
                 if (rc <= 0)
                 {
                     _snapshotRefCounts.Remove(queueTimeHostWriteCounter);
-                    _snapshotsByHWC?.Remove(queueTimeHostWriteCounter);
+                    // BUGFIX 2026-06-14: actually free the snapshot SAB. The previous code Remove()'d
+                    // the dict entry but never Disposed the SharedArrayBuffer (despite this method's
+                    // doc claiming "that tier's SAB is freed"), so every full-buffer-size host-write
+                    // snapshot leaked in the JS heap — the ~1.5 GiB ML-lane late-test leak (Tuvok trace
+                    // 2026-06-14; invisible to LiveBufferBytes because it's a separate SAB path).
+                    if (_snapshotsByHWC != null
+                        && _snapshotsByHWC.TryGetValue(queueTimeHostWriteCounter, out var doneSnap))
+                    {
+                        _snapshotsByHWC.Remove(queueTimeHostWriteCounter);
+                        doneSnap.Dispose();
+                        s_liveSnapshotBytes -= LengthInBytes;
+                    }
                 }
                 else
                 {
@@ -207,11 +219,27 @@ internal void CompleteDispatchIntent(int queueTimeHostWriteCounter)
             }
             if (_pendingSnapshotIntents == 0)
             {
+                // Dispose any snapshot SABs still resident (tiers not released by refcount) before
+                // dropping the dicts — otherwise their JS SharedArrayBuffers leak (same root bug).
+                DisposeAllSnapshots();
                 _snapshotsByHWC = null;
                 _snapshotRefCounts = null;
             }
         }
 
+        /// <summary>Disposes every snapshot SAB still in <see cref="_snapshotsByHWC"/> and clears the
+        /// resident-bytes accounting. Idempotent. Called on the last-intent-completes path and on Dispose.</summary>
+        private void DisposeAllSnapshots()
+        {
+            if (_snapshotsByHWC == null) return;
+            foreach (var kv in _snapshotsByHWC)
+            {
+                kv.Value.Dispose();
+                s_liveSnapshotBytes -= LengthInBytes;
+            }
+            _snapshotsByHWC.Clear();
+        }
+
         /// <summary>
         /// Returns the snapshot tier matching the dispatch's queue-time HWC, or
         /// null if no host write has clobbered SharedBuffer since the dispatch
@@ -531,12 +559,18 @@ protected override void CopyFromBuffer(
         // (TotalKernelsCompiled) is NOT a memory proxy; these are the real resident measure.
         private static int s_liveBufferCount;
         private static long s_liveBufferBytes;
+        private static long s_liveSnapshotBytes;
         private readonly int _liveBytes;
         private bool _liveCounted;
         /// <summary>Number of WasmMemoryBuffers currently alive (constructed, not yet disposed).</summary>
         public static int LiveBufferCount => s_liveBufferCount;
-        /// <summary>Total resident bytes across all live WasmMemoryBuffers (≈ SharedArrayBuffer bytes held).</summary>
+        /// <summary>Total resident bytes across all live WasmMemoryBuffers' primary SharedArrayBuffers.</summary>
         public static long LiveBufferBytes => s_liveBufferBytes;
+        /// <summary>Total resident bytes of host-write SNAPSHOT SharedArrayBuffers (the SEPARATE SAB path —
+        /// full-buffer-size copies materialized by <see cref="PrepareHostWrite"/>). This was the invisible
+        /// ML-lane leak: snapshots were removed from the dict but never Disposed. If this climbs across a
+        /// lane, snapshot SABs are leaking; it should now return to ~0 between tests.</summary>
+        public static long LiveSnapshotBytes => s_liveSnapshotBytes;
 
         /// <inheritdoc/>
         protected override void DisposeAcceleratorObject(bool disposing)
@@ -545,6 +579,11 @@ protected override void DisposeAcceleratorObject(bool disposing)
             {
                 TypedArrayView?.Dispose();
                 SharedBuffer?.Dispose();
+                // Free any snapshot SABs still resident (buffer disposed with pending host-write
+                // snapshots) — otherwise their full-buffer-size JS SharedArrayBuffers leak.
+                DisposeAllSnapshots();
+                _snapshotsByHWC = null;
+                _snapshotRefCounts = null;
             }
             // Decrement the resident counters exactly once (dispose may run on the finalizer path too).
             if (_liveCounted)