Skip to content

Commit 3784cfb

Browse files
LostBeardclaude
andcommitted
Wasm: fix host-write snapshot SharedArrayBuffer leak (the ML-lane heavy-test memory leak)
Root-caused via a RESIDENT-memory trace (Tuvok): the ML Wasm heavy-test late-lane timeouts are a JS-heap leak (usedJSHeapSize 154->1644 MiB across the lane; worker pool flat, linear memory flat, module cache flat by magnitude). Reading the code found a SECOND SharedArrayBuffer path my earlier byte counter never saw: WasmMemoryBuffer.PrepareHostWrite (WasmMemoryBuffer.cs:87) allocates a FULL-buffer-size `new SharedArrayBuffer` snapshot when a host write lands while a dispatch is in flight on that buffer (the lazy copy-out-race defense). CompleteDispatchIntent Remove()'d the snapshot from _snapshotsByHWC but NEVER Disposed the SharedArrayBuffer -- despite the method's own doc claiming "that tier's SAB is freed" -- and the all-intents-complete path nulled the dict without disposing. So every materialized snapshot leaked a full-buffer JS SAB; ML's CopyFromCPU+dispatch pattern materializes them constantly -> ~1.5 GiB. Fix: Dispose the snapshot SAB on release (the free the doc always promised) + DisposeAllSnapshots() on the intents==0 path and on buffer Dispose (buffer torn down with pending snapshots). New diagnostic WasmMemoryBuffer.LiveSnapshotBytes (the previously-invisible snapshot-SAB resident bytes). Regression guard WasmTests.Wasm_HostWriteSnapshot_DoesNotLeakSAB deterministically materializes snapshots (launch registers the dispatch intent synchronously, then host-write mid-flight) and asserts LiveSnapshotBytes returns to baseline after dispatch-complete + buffer dispose. Also keeps the local.8 resident-count diagnostics (LiveBufferCount/Bytes, LiveAcceleratorCount). Gate: PMT_FILTER=WasmTests 521/0/17 (guard green = the dispose works on the deterministic trigger; large RadixSorts faster: 1.4M 11.7s, 4M 33s). Version 4.12.1-local.9 (forks stay 2.0.16). Re-verify pending on Tuvok's full ML lane (expect LiveSnapshotBytes flat-near-0, heap flat, the 8 timeouts gone). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 196ca43 commit 3784cfb

5 files changed

Lines changed: 93 additions & 3 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Wrapper-only (forks stay **2.0.16**). Adds a new selection-gate capability flag:
1212
- **Wasm: process-static shared linear memory** (the second half the persistent worker pool unmasked). A `new WebAssembly.Memory({ shared: true })` reserves its full `maximum` (default 16384 pages = 1 GiB) of virtual address space at construction and can never relocate, so each default accelerator that built its own memory burned a full 1 GiB reservation. Before the persistent pool, `Worker.terminate()` per accelerator Dispose dropped the workers' references so the old reservation was freed/GC'd per test; with the persistent pool the workers pin the last memory they instantiated against (until they next swap), so across a ~569-test lane the per-accelerator memories accumulated up to `workerCount` live 1 GiB reservations until V8's address-space cap was hit and the `new WebAssembly.Memory(...)` constructor threw `could not allocate memory`. Default-WorkerCount accelerators now share a process-static `WebAssembly.Memory` keyed by their `MaxLinearMemoryPages` (`WasmAccelerator.s_sharedByMaxPages`) - ONE shared memory per distinct max value, grown to the lane high-water and never re-created -> a single reservation per max-group. Safe because the linear memory is per-dispatch transient working/staging memory (zero region -> copy-in -> run -> copy-out; no cross-accelerator state); a per-group `SemaphoreSlim` serializes that group's dispatch window across concurrently-alive accelerators (zero-cost in the sequential case). Keyed by max because the kernel module declares its memory-import maximum = its own `MaxLinearMemoryPages` and the spec requires the supplied memory's maximum equal the module's declared maximum - so all 16384 accelerators share a 16384 memory, all 32768 (e.g. ML's DA3-Small at 2 GiB) share a 32768 memory, etc. An explicit-`WorkerCount` accelerator (oversubscription stress tests, which want worker isolation) keeps a private memory. Bonus: with persistent workers and a persistent memory the buffer only changes on `grow()`, so after high-water the workers stop re-instantiating kernels entirely (the per-test new-memory churn is gone). (Originally only the default 16384 was shared, which missed the ML test lane's ~569 accelerators at a custom 32768 max - they re-accumulated the leak at 2 GiB each; generalized to per-max.) Diagnostics `WasmAccelerator.SharedWasmMemoryCreateCount` / `SharedWasmMemoryPages` (summed across groups); locked by `WasmTests.Wasm_SharedLinearMemory_PersistsAndStaysBoundedAcrossAccelerators` (default max) + `Wasm_SharedLinearMemory_CustomMaxPages_AlsoBounded` (custom max).
1313
- **Wasm SIMD128 emitter foundation (Phase 1 of the SIMD port).** Additive groundwork only - no production kernel emits v128 yet, so the scalar path is byte-identical. Adds the v128 value type and the 0xFD-prefixed SIMD opcode set to `WasmOpCodes` (spec-verified; sub-opcodes are u32-LEB128 after the prefix, so multi-byte ones like `f32x4.add`=228 encode correctly), v128 emit helpers in `WasmModuleBuilder` (`EmitSimd`/`EmitSimdMem`/`EmitSimdLane`/`EmitV128Const`/`EmitI8x16Shuffle`), and the runtime SIMD capability surface: `WasmBackend.RuntimeSupportsWasmSimd` (via `System.Runtime.Intrinsics.Wasm.PackedSimd.IsSupported` - if the running Blazor WASM build has SIMD enabled, the browser/workers accept v128), `ForceScalar`/`ForceSimd` test overrides, `EffectiveWasmSimd`, `WasmCapabilityContext.WasmSimd`, and `WasmAccelerator.SupportsSimd`. **Non-SIMD devices stay first-class forever** (the scalar path is a supported mode, not a deprecated fallback - real hardware/browsers without wasm SIMD are common; see the dual-build technique in `BlazorWASMSIMDDetectExample`). Verified by the offline `DemoConsole -- wasm-simd-probe`: a hand-built v128 module is `wasm-validate`-clean and `wasm2wat`-decodes to the intended instructions.
1414
- **Wasm: bound the persistent-worker module cache (late-lane memory-pressure fix).** The process-persistent worker pool keeps every distinct kernel's compiled `WebAssembly.Module` in a per-worker cache (`_modulesById`) for the tab's life. Across a long test lane each per-test accelerator's kernels get fresh ids, so the cache accumulated unbounded (measured 2 -> 1057 across a ~570-test lane) until late, heavy tests hit process-memory pressure and timed out (the committed shared linear memory was flat/small - the module cache was the driver). Fix: when cumulative kernels compiled since the last flush cross `WasmBackend.ModuleCacheFlushThreshold` (default 256; 0 disables), the host instructs the workers to drop their module/instance caches at the next fresh accelerator's FIRST dispatch (safe - that accelerator re-sends its own kernels; the cleared modules are disposed accelerators' dead weight). Bounds peak modules to ~the threshold. Short workloads never reach it -> never flush -> kernels stay fully warm. Diagnostics `WasmAccelerator.TotalKernelsCompiled` / `SharedWasmMemoryPages`; guard `WasmTests.Wasm_ModuleCacheFlush_DoesNotBreakCorrectness` (flushes every accelerator, asserts CPU-oracle).
15+
- **Wasm: fixed a host-write SNAPSHOT SharedArrayBuffer leak (the real ML-lane heavy-test memory leak).** `WasmMemoryBuffer.PrepareHostWrite` allocates a full-buffer-size SharedArrayBuffer when a host write lands while a dispatch is in flight on that buffer (the lazy copy-out race defense). `CompleteDispatchIntent` removed the snapshot from its tracking dict but **never `Dispose()`d the SharedArrayBuffer** (despite its own doc claiming "that tier's SAB is freed"), and the all-intents-complete path dropped the dict without disposing either - so every materialized snapshot leaked a full-buffer-size JS SharedArrayBuffer. Under a long heavy-workload lane (ML's CopyFromCPU+dispatch pattern) this accumulated to ~1.5 GiB of JS heap, slowing late tests into timeouts (root-caused via a resident-memory trace: heap 154->1644 MiB; worker pool flat, linear memory flat, module cache flat by magnitude). Fix: dispose the snapshot SAB on release + on buffer dispose (`DisposeAllSnapshots`). New diagnostic `WasmMemoryBuffer.LiveSnapshotBytes`; guard `WasmTests.Wasm_HostWriteSnapshot_DoesNotLeakSAB` (deterministically materializes snapshots, asserts the resident bytes return to baseline). Also adds resident-count diagnostics `WasmMemoryBuffer.LiveBufferCount`/`LiveBufferBytes` + `WasmAccelerator.LiveAcceleratorCount`.
1516

1617
## 4.12.0 (2026-06-13) - Sync/async contract: async-only where it waits/observes, sync for fire-and-forget
1718

SpawnDev.ILGPU.Demo/UnitTests/WasmTests.cs

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1847,6 +1847,43 @@ public async Task Wasm_ModuleCacheFlush_DoesNotBreakCorrectness()
18471847
finally { WasmBackend.ModuleCacheFlushThreshold = savedThreshold; }
18481848
}
18491849

1850+
// Host-write snapshot SAB leak guard (2026-06-14, Geordi). The lazy host-write snapshot
1851+
// (WasmMemoryBuffer.PrepareHostWrite) allocates a FULL-buffer-size SharedArrayBuffer when a host
1852+
// write lands while a dispatch is in flight on that buffer. CompleteDispatchIntent used to
1853+
// Remove() the snapshot from its dict but NEVER Dispose() the SAB (despite its doc claiming it
1854+
// did) → every snapshot leaked a full-buffer SAB → the ~1.5 GiB ML-lane late-test JS-heap leak
1855+
// (Tuvok trio trace). This guard deterministically materializes snapshots (launch a dispatch —
1856+
// which registers the intent synchronously — then host-write the buffer mid-flight) and asserts
1857+
// LiveSnapshotBytes returns to baseline after the dispatch completes + buffers dispose.
1858+
[TestMethod(Timeout = 120000)]
1859+
public async Task Wasm_HostWriteSnapshot_DoesNotLeakSAB()
1860+
{
1861+
const int count = 8192;
1862+
var context = Context.Create().Wasm().ToContext();
1863+
var accelerator = await context.CreateWasmAcceleratorAsync();
1864+
try
1865+
{
1866+
long baseline = SpawnDev.ILGPU.Wasm.WasmMemoryBuffer.LiveSnapshotBytes;
1867+
var data = new int[count];
1868+
for (int r = 0; r < 8; r++)
1869+
{
1870+
using var buf = accelerator.Allocate1D<int>(count);
1871+
var k = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<int>>(
1872+
(i, v) => v[i] = i * 3);
1873+
k((Index1D)count, buf.View); // launch → registers the dispatch intent (synchronous)
1874+
buf.View.CopyFromCPU(data); // host write while in-flight → materializes a snapshot
1875+
await accelerator.SynchronizeAsync(); // dispatch completes → snapshot must be Disposed
1876+
}
1877+
long leaked = SpawnDev.ILGPU.Wasm.WasmMemoryBuffer.LiveSnapshotBytes - baseline;
1878+
if (leaked != 0)
1879+
throw new Exception(
1880+
$"Host-write snapshot SABs leaked {leaked} bytes after dispatch-complete + buffer dispose — " +
1881+
$"CompleteDispatchIntent/DisposeAcceleratorObject must Dispose the snapshot SharedArrayBuffers " +
1882+
$"(the ML-lane ~1.5 GiB JS-heap leak has regressed).");
1883+
}
1884+
finally { accelerator.Dispose(); context.Dispose(); }
1885+
}
1886+
18501887
// Wasm SIMD128 emitter foundation (Phase 1, 2026-06-14, Geordi). Pure-CPU regression guard on
18511888
// the v128 encoding — NO browser/GPU needed, just byte assertions. Locks the part most likely
18521889
// to silently break: SIMD sub-opcodes are u32-LEB128 after the 0xFD prefix (NOT single bytes

SpawnDev.ILGPU/SpawnDev.ILGPU.csproj

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
<TargetFramework>net10.0</TargetFramework>
55
<ImplicitUsings>enable</ImplicitUsings>
66
<Nullable>enable</Nullable>
7-
<Version>4.12.1-local.8</Version>
7+
<Version>4.12.1-local.9</Version>
88
<!-- Brief current-version highlights only. Full per-version history with code samples lives in CHANGELOG.md (linked from the README). -->
99
<PackageReleaseNotes>4.12.1: WebGPU cooperative GEMV grid-stride fix; ±inf/NaN scalar kernel params on WebGL+Wasm; AcceleratorRequirements.RequiresScatterStores flag; Wasm process-persistent shared Web Worker pool AND shared linear memory keyed per MaxLinearMemoryPages (default-WorkerCount accelerators share one pool + one WebAssembly.Memory per distinct max per tab, fixing worker-churn starvation and the WebAssembly.Memory-reservation accumulation across long test lanes — at both the default 1 GiB and custom maxes like 2 GiB); Wasm SIMD128 emitter foundation (additive groundwork, scalar path unchanged). Forks stay 2.0.16. Full per-version history with details: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
1010
<GeneratePackageOnBuild>True</GeneratePackageOnBuild>

SpawnDev.ILGPU/Wasm/CLAUDE.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,19 @@ see a peer's flush). Short workloads never cross the threshold → never flush
114114
library Wasm lane is unaffected). Guard: `WasmTests.Wasm_ModuleCacheFlush_DoesNotBreakCorrectness`
115115
(threshold=1, CPU-oracle). Diagnostic: `WasmAccelerator.TotalKernelsCompiled`.
116116

117+
**Host-write snapshot SABs MUST be Disposed (2026-06-14 leak fix).** `WasmMemoryBuffer.PrepareHostWrite`
118+
allocates a FULL-buffer-size `new SharedArrayBuffer` (`WasmMemoryBuffer.cs:87`) when a host write lands
119+
while a dispatch is in flight on that buffer (the lazy copy-out-race snapshot). This is a SECOND SAB path
120+
distinct from the buffer's primary `SharedBuffer` — easy to miss. The original `CompleteDispatchIntent`
121+
Remove()'d the snapshot from `_snapshotsByHWC` but NEVER `Dispose()`d the SAB (its doc lied: "that tier's
122+
SAB is freed"), and the intents==0 path nulled the dict without disposing → every snapshot leaked a
123+
full-buffer JS SharedArrayBuffer → ~1.5 GiB across the ML lane → late-test timeouts (root-caused by a
124+
resident-memory trace; it was invisible to a primary-SAB byte counter). RULE: any `new SharedArrayBuffer`
125+
(or JSObject) created here must be `.Dispose()`d on EVERY exit path (release + buffer dispose) —
126+
`DisposeAllSnapshots()` does it now. Diagnostic `WasmMemoryBuffer.LiveSnapshotBytes`; guard
127+
`WasmTests.Wasm_HostWriteSnapshot_DoesNotLeakSAB`. Lesson: a monotonic counter is not a memory proxy, and
128+
a buffer's primary-SAB counter won't see a SECOND SAB path — measure the actual resident bytes per path.
129+
117130
## Process-static SHARED linear memory (2026-06-14) — `s_sharedWasmMemory`
118131

119132
The memory analog of the worker pool, and the **second half** the pool fix unmasked. A

SpawnDev.ILGPU/Wasm/WasmMemoryBuffer.cs

Lines changed: 41 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ internal void PrepareHostWrite()
9090
dst.JSRef!.CallVoid("set", src);
9191
_snapshotsByHWC[hwcKey] = fresh;
9292
_snapshotRefCounts[hwcKey] = _pendingSnapshotIntents; // every pending intent shares this tier
93+
s_liveSnapshotBytes += LengthInBytes; // resident-SAB diagnostic (see LiveSnapshotBytes)
9394
}
9495

9596
/// <summary>
@@ -198,7 +199,18 @@ internal void CompleteDispatchIntent(int queueTimeHostWriteCounter)
198199
if (rc <= 0)
199200
{
200201
_snapshotRefCounts.Remove(queueTimeHostWriteCounter);
201-
_snapshotsByHWC?.Remove(queueTimeHostWriteCounter);
202+
// BUGFIX 2026-06-14: actually free the snapshot SAB. The previous code Remove()'d
203+
// the dict entry but never Disposed the SharedArrayBuffer (despite this method's
204+
// doc claiming "that tier's SAB is freed"), so every full-buffer-size host-write
205+
// snapshot leaked in the JS heap — the ~1.5 GiB ML-lane late-test leak (Tuvok trace
206+
// 2026-06-14; invisible to LiveBufferBytes because it's a separate SAB path).
207+
if (_snapshotsByHWC != null
208+
&& _snapshotsByHWC.TryGetValue(queueTimeHostWriteCounter, out var doneSnap))
209+
{
210+
_snapshotsByHWC.Remove(queueTimeHostWriteCounter);
211+
doneSnap.Dispose();
212+
s_liveSnapshotBytes -= LengthInBytes;
213+
}
202214
}
203215
else
204216
{
@@ -207,11 +219,27 @@ internal void CompleteDispatchIntent(int queueTimeHostWriteCounter)
207219
}
208220
if (_pendingSnapshotIntents == 0)
209221
{
222+
// Dispose any snapshot SABs still resident (tiers not released by refcount) before
223+
// dropping the dicts — otherwise their JS SharedArrayBuffers leak (same root bug).
224+
DisposeAllSnapshots();
210225
_snapshotsByHWC = null;
211226
_snapshotRefCounts = null;
212227
}
213228
}
214229

230+
/// <summary>Disposes every snapshot SAB still in <see cref="_snapshotsByHWC"/> and clears the
231+
/// resident-bytes accounting. Idempotent. Called on the last-intent-completes path and on Dispose.</summary>
232+
private void DisposeAllSnapshots()
233+
{
234+
if (_snapshotsByHWC == null) return;
235+
foreach (var kv in _snapshotsByHWC)
236+
{
237+
kv.Value.Dispose();
238+
s_liveSnapshotBytes -= LengthInBytes;
239+
}
240+
_snapshotsByHWC.Clear();
241+
}
242+
215243
/// <summary>
216244
/// Returns the snapshot tier matching the dispatch's queue-time HWC, or
217245
/// null if no host write has clobbered SharedBuffer since the dispatch
@@ -531,12 +559,18 @@ protected override void CopyFromBuffer(
531559
// (TotalKernelsCompiled) is NOT a memory proxy; these are the real resident measure.
532560
private static int s_liveBufferCount;
533561
private static long s_liveBufferBytes;
562+
private static long s_liveSnapshotBytes;
534563
private readonly int _liveBytes;
535564
private bool _liveCounted;
536565
/// <summary>Number of WasmMemoryBuffers currently alive (constructed, not yet disposed).</summary>
537566
public static int LiveBufferCount => s_liveBufferCount;
538-
/// <summary>Total resident bytes across all live WasmMemoryBuffers (≈ SharedArrayBuffer bytes held).</summary>
567+
/// <summary>Total resident bytes across all live WasmMemoryBuffers' primary SharedArrayBuffers.</summary>
539568
public static long LiveBufferBytes => s_liveBufferBytes;
569+
/// <summary>Total resident bytes of host-write SNAPSHOT SharedArrayBuffers (the SEPARATE SAB path —
570+
/// full-buffer-size copies materialized by <see cref="PrepareHostWrite"/>). This was the invisible
571+
/// ML-lane leak: snapshots were removed from the dict but never Disposed. If this climbs across a
572+
/// lane, snapshot SABs are leaking; it should now return to ~0 between tests.</summary>
573+
public static long LiveSnapshotBytes => s_liveSnapshotBytes;
540574

541575
/// <inheritdoc/>
542576
protected override void DisposeAcceleratorObject(bool disposing)
@@ -545,6 +579,11 @@ protected override void DisposeAcceleratorObject(bool disposing)
545579
{
546580
TypedArrayView?.Dispose();
547581
SharedBuffer?.Dispose();
582+
// Free any snapshot SABs still resident (buffer disposed with pending host-write
583+
// snapshots) — otherwise their full-buffer-size JS SharedArrayBuffers leak.
584+
DisposeAllSnapshots();
585+
_snapshotsByHWC = null;
586+
_snapshotRefCounts = null;
548587
}
549588
// Decrement the resident counters exactly once (dispose may run on the finalizer path too).
550589
if (_liveCounted)

0 commit comments

Comments
 (0)