Wasm: process-static shared linear memory (fix reservation accumulation)

LostBeard · claude · LostBeard · commit c8d043f9f1e8 · 2026-06-14T08:54:46.000-04:00
The second half the persistent worker pool (4f93830) unmasked. Each default WasmAccelerator built its own `new WebAssembly.Memory({maximum:16384, shared:true})`. A shared WebAssembly.Memory reserves its FULL maximum (1 GiB) of virtual address space at construction and can never relocate. Before the persistent pool, Worker.terminate() per accelerator Dispose dropped the workers' references so the old reservation was freed/GC'd per test; with persistent workers each worker now PINS the last memory it instantiated against (_lastMemoryBuffer + _instancesById) until it next swaps. Across the ~569-test PMT Wasm lane the per-accelerator memories accumulated up to workerCount live 1 GiB reservations (plus JS-GC lag) until V8's address-space cap was hit and the `new WebAssembly.Memory()` CONSTRUCTOR threw "could not allocate memory" (Tuvok's 88 RangeErrors on the ML full Wasm lane). The constructor failing (not grow() -> our OutOfMemoryException) is the tell: reservation accumulation, not single-memory high-water. Fix (memory analog of the worker-pool fix): default accelerators share ONE process-static s_sharedWasmMemory per tab, grown to the lane high-water and never re-created -> a single reservation. The linear memory is per-dispatch transient working/staging memory (zero region -> copy-IN -> run -> copy-OUT; no cross-accelerator state), so sharing is correct. Bonus perf: with persistent workers AND one persistent memory the buffer only changes on grow(), so after high-water the workers stop re-instantiating kernels entirely (the per-test new-memory -> instance-cache-clear churn is gone). - Routing: UsesSharedMemory = _useSharedPool && _maxLinearMemoryPages == 16384. The create/grow/reuse block + Dispose read/write through CachedWasmMemory / CachedMemoryBuffer / CachedWasmPages properties that pick static-vs-instance backing. A custom-MaxLinearMemoryPages accelerator (e.g. ML DA3 at 32768) or explicit-WorkerCount accelerator keeps a PRIVATE memory -- required, since the kernel module declares its import max = its own MaxLinearMemoryPages and the spec needs supplied-max <= module-declared-max (a 16384 memory can't back 32768 modules or vice-versa); these are rare/long-lived so no accumulation. - Concurrency: s_sharedMemoryGate (SemaphoreSlim(1,1)) serializes the shared-memory dispatch window (acquire -> zero -> copy-IN -> exec -> copy-OUT) across concurrently-alive default accelerators (overlapping region-[0..) writes on one memory would corrupt). Within one accelerator dispatches already serialize via _pendingWork; uncontended/zero-cost in the sequential PMT/production case. - Lifetime: never disposed on accelerator Dispose (shared accelerators leave the instance handles null); torn down only via DisposeSharedWasmMemory() (also called by DisposeSharedWorkerPool()) or tab teardown. Diagnostics: SharedWasmMemoryCreateCount (stays 1 across the lane), SharedWasmMemoryPages. Regression guard Wasm_SharedLinearMemory_PersistsAndStaysBoundedAcrossAccelerators (<=1 construction across K accelerators + correct-on-reuse + explicit-count isolation). Gate: PMT_FILTER=WasmTests 517/0/17. Version 4.12.1-local.4 (forks stay 2.0.16). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,7 @@ Wrapper-only (forks stay **2.0.16**). Adds a new selection-gate capability flag:
 - **`AcceleratorRequirements.RequiresScatterStores`** (rules out WebGL) - declare it when a kernel writes a computed/arbitrary output index (`out[someIndex] = ...`) or more than one element of one buffer per thread that isn't the consecutive `v*storeCount+slot` layout. WebGL Transform-Feedback captures one output record per vertex at the thread's own slot (gather-only), so in-kernel scatter can't run there; the flag filters WebGL at `EnumerateCompatibleDevices` / `CreatePreferredAccelerator` / `Satisfies` time. (WebGL still scatters at the host/algorithm layer - e.g. RadixSort via render-to-texture.)
 - A compile-time fail-loud guard for this class (mirroring the atomics/barriers/Scan throws) was prototyped and backed out - the blunt criterion false-positived on legitimate positional multi-store + grid-stride-loop kernels. The correct codegen-level criterion is a tracked open item (`Plans/webgl-multistore-fail-loud-guard-plan-2026-06-13.md`). For now use the selection flag.
 - **Wasm: process-persistent shared Web Worker pool.** The Web Worker pool is now process-static (`WasmAccelerator.s_sharedWorkerPool`) - created once per tab and reused across every default-WorkerCount accelerator - instead of being created and `terminate()`d per accelerator. `Worker.terminate()` is an asynchronous browser signal, so a fresh-accelerator-per-test pattern (PMT's ~531-test Wasm lane) spun up a new `hardwareConcurrency` pool while the previous pool's threads were still winding down -> transient worker oversubscription that compounded across the lane -> the pure-spin barrier couldn't schedule all workers in its window -> compute-heavy tests starved and timed out late while light tests stayed fast. The shared pool removes both the terminate churn and the per-test re-create cost. Safe across accelerators: the worker-side module-cache key is a process-static monotonic id (no cross-accelerator collision), a memory-buffer change invalidates a reused worker's cached instances, and each accelerator detaches its handlers on Dispose. Bounded at ~`hardwareConcurrency-2`: an explicit `WorkerCount` (oversubscription stress tests) keeps a private pool, and a worker still checked out at an abnormal Dispose is terminated+removed rather than stranded. Locked by `WasmTests.Wasm_SharedWorkerPool_PersistsAndStaysBoundedAcrossAccelerators`. Gate: `PMT_FILTER=WasmTests` 516/0/17.
+- **Wasm: process-static shared linear memory** (the second half the persistent worker pool unmasked). A `new WebAssembly.Memory({ shared: true })` reserves its full `maximum` (default 16384 pages = 1 GiB) of virtual address space at construction and can never relocate, so each default accelerator that built its own memory burned a full 1 GiB reservation. Before the persistent pool, `Worker.terminate()` per accelerator Dispose dropped the workers' references so the old reservation was freed/GC'd per test; with the persistent pool the workers pin the last memory they instantiated against (until they next swap), so across a ~569-test lane the per-accelerator memories accumulated up to `workerCount` live 1 GiB reservations until V8's address-space cap was hit and the `new WebAssembly.Memory(...)` constructor threw `could not allocate memory`. Default-WorkerCount accelerators now share ONE process-static `WebAssembly.Memory` per tab (`WasmAccelerator.s_sharedWasmMemory`), grown to the lane high-water and never re-created -> a single reservation. Safe because the linear memory is per-dispatch transient working/staging memory (zero region -> copy-in -> run -> copy-out; no cross-accelerator state); a process-wide `SemaphoreSlim` serializes the shared-memory dispatch window across concurrently-alive accelerators (zero-cost in the sequential case). Scoped to default `MaxLinearMemoryPages` (16384): a custom-max accelerator (e.g. ML at 32768) or explicit-`WorkerCount` accelerator keeps a private memory, because the kernel module declares its memory-import maximum = its own `MaxLinearMemoryPages` and the spec requires the supplied memory's maximum be <= the module's declared maximum. Bonus: with persistent workers and one persistent memory the buffer only changes on `grow()`, so after high-water the workers stop re-instantiating kernels entirely (the per-test new-memory churn is gone). Diagnostics `WasmAccelerator.SharedWasmMemoryCreateCount` / `SharedWasmMemoryPages`; locked by `WasmTests.Wasm_SharedLinearMemory_PersistsAndStaysBoundedAcrossAccelerators`.
 
 ## 4.12.0 (2026-06-13) - Sync/async contract: async-only where it waits/observes, sync for fire-and-forget
 
diff --git a/SpawnDev.ILGPU.Demo/UnitTests/WasmTests.cs b/SpawnDev.ILGPU.Demo/UnitTests/WasmTests.cs
@@ -1637,5 +1637,107 @@ public async Task Wasm_SharedWorkerPool_PersistsAndStaysBoundedAcrossAccelerator
                     $"An explicit-WorkerCount accelerator grew the SHARED pool ({sizeBeforeExplicit} -> " +
                     $"{sizeAfterExplicit}) — it must use a private pool and leave the shared pool untouched.");
         }
+
+        // Process-static SHARED linear memory (2026-06-14, Geordi). A `WebAssembly.Memory({shared:true})`
+        // reserves its full `maximum` (default 1 GiB) of virtual address space at construction and can
+        // never relocate. Before the persistent worker pool, Worker.terminate() per accelerator dropped
+        // the workers' references each test so the old reservation was freed; with persistent workers the
+        // workers PIN the last memory they instantiated against, so per-accelerator memories accumulated
+        // up to workerCount live 1 GiB reservations across the ~569-test Wasm lane until V8's address-
+        // space cap was hit and `new WebAssembly.Memory()` threw "could not allocate memory" (Tuvok's 88
+        // RangeErrors — the memory half the pool fix unmasked). The fix: default accelerators share ONE
+        // process-static linear memory (grown to the lane high-water, never re-created).
+        //
+        // This locks BOTH halves:
+        //  (1) BOUNDED: creating + dispatching + disposing K default accelerators constructs AT MOST ONE
+        //      new shared memory (then reuses it), NOT K — directly the reservation-accumulation fix.
+        //  (2) CORRECT-ON-REUSE: every accelerator's dispatch into the shared memory matches the CPU
+        //      oracle, proving the shared linear memory carries no stale cross-accelerator state.
+        // Plus ISOLATION: an explicit-WorkerCount accelerator (private pool → private memory) must NOT
+        // construct a shared memory.
+        [TestMethod(Timeout = 120000)]
+        public async Task Wasm_SharedLinearMemory_PersistsAndStaysBoundedAcrossAccelerators()
+        {
+            const int count = 4096;
+            const int accelerators = 5;
+
+            var oracle = new int[count];
+            for (int i = 0; i < count; i++) oracle[i] = i * 5 + 3;
+
+            int createCountBefore = WasmAccelerator.SharedWasmMemoryCreateCount;
+            for (int a = 0; a < accelerators; a++)
+            {
+                var context = Context.Create().Wasm().ToContext();
+                var accelerator = await context.CreateWasmAcceleratorAsync();
+                try
+                {
+                    using var buf = accelerator.Allocate1D<int>(count);
+                    var fill = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<int>>(
+                        (i, v) => v[i] = i * 5 + 3);
+                    fill((Index1D)count, buf.View);
+                    await accelerator.SynchronizeAsync();
+                    var result = await buf.CopyToHostAsync<int>();
+
+                    for (int i = 0; i < count; i++)
+                        if (result[i] != oracle[i])
+                            throw new Exception(
+                                $"Accelerator #{a}: result[{i}]={result[i]} expected {oracle[i]} — " +
+                                $"shared linear memory produced wrong output (stale cross-accelerator state?).");
+                }
+                finally
+                {
+                    accelerator.Dispose();
+                    context.Dispose();
+                }
+            }
+            int createCountAfter = WasmAccelerator.SharedWasmMemoryCreateCount;
+
+            // BOUNDED invariant: across K default accelerators, at most ONE shared memory was
+            // constructed (0 if a prior lane test already built it; 1 if this test was first). The
+            // pre-fix per-accelerator design would have constructed K (one per accelerator).
+            int created = createCountAfter - createCountBefore;
+            if (created > 1)
+                throw new Exception(
+                    $"{created} shared WebAssembly.Memory objects were constructed across {accelerators} " +
+                    $"accelerators — the per-accelerator-memory reservation leak has regressed (expected <= 1).");
+            if (WasmAccelerator.SharedWasmMemoryPages < 1)
+                throw new Exception(
+                    "Shared linear memory page count never registered >= 1 after dispatching on " +
+                    $"{accelerators} default accelerators — the shared memory was not used.");
+
+            // ISOLATION invariant: an explicit-WorkerCount accelerator uses a private pool AND a private
+            // memory, so it must NOT construct a shared memory.
+            int createBeforeExplicit = WasmAccelerator.SharedWasmMemoryCreateCount;
+            {
+                int defaultWorkerCount;
+                {
+                    using var probeCtx = Context.Create().Wasm().ToContext();
+                    using var probeAcc = await probeCtx.CreateWasmAcceleratorAsync();
+                    defaultWorkerCount = ((WasmAccelerator)probeAcc).WorkerCount;
+                }
+                var exCtx = Context.Create().Wasm().ToContext();
+                var exAcc = await exCtx.CreateWasmAcceleratorAsync(
+                    new WasmBackendOptions { WorkerCount = defaultWorkerCount + 6 });
+                try
+                {
+                    using var b = exAcc.Allocate1D<int>(count);
+                    var k = exAcc.LoadAutoGroupedStreamKernel<Index1D, ArrayView<int>>(
+                        (i, v) => v[i] = i * 5 + 3);
+                    k((Index1D)count, b.View);
+                    await exAcc.SynchronizeAsync();
+                    var r = await b.CopyToHostAsync<int>();
+                    for (int i = 0; i < count; i++)
+                        if (r[i] != oracle[i])
+                            throw new Exception(
+                                $"Explicit-WorkerCount accelerator: result[{i}]={r[i]} expected {oracle[i]}.");
+                }
+                finally { exAcc.Dispose(); exCtx.Dispose(); }
+            }
+            int createAfterExplicit = WasmAccelerator.SharedWasmMemoryCreateCount;
+            if (createAfterExplicit > createBeforeExplicit)
+                throw new Exception(
+                    $"An explicit-WorkerCount accelerator constructed a SHARED memory ({createBeforeExplicit} " +
+                    $"-> {createAfterExplicit}) — it must use a private memory and leave the shared one untouched.");
+        }
     }
 }
diff --git a/SpawnDev.ILGPU/SpawnDev.ILGPU.csproj b/SpawnDev.ILGPU/SpawnDev.ILGPU.csproj
@@ -4,9 +4,9 @@
 		<TargetFramework>net10.0</TargetFramework>
 		<ImplicitUsings>enable</ImplicitUsings>
 		<Nullable>enable</Nullable>
-		<Version>4.12.1-local.3</Version>
+		<Version>4.12.1-local.4</Version>
 		<!-- Brief current-version highlights only. Full per-version history with code samples lives in CHANGELOG.md (linked from the README). -->
-		<PackageReleaseNotes>4.12.1: (a) WebGPU GEMV grid-stride fix — a cooperative GEMV's inner K-tile loop (k=tid;k&lt;K;k+=G) was conflated with a synthetic group grid-stride counter (~K/G x too-small accumulation); fixed with a two-pass uniform-break (decide the break from the EMITTED body's barriers — barrier-free loops keep their natural break, barrier loops keep the uniform break unchanged). Re-enables the fast cooperative M=1 GEMV on WebGPU. (b) fixes special-float (±inf/NaN) scalar kernel params on WebGL + Wasm - a kernel float scalar holding ±inf/NaN (e.g. ConstantOfShape(-inf)) silently failed at the .NET-&gt;JS dispatch boundary (WebGL: System.Text.Json in postMessage rejected ±inf/NaN; Wasm: culture-sensitive ToString emitted invalid JS tokens). WebGL now sends float scalars as int32 bit patterns (glWorker reconstructs); Wasm uses InvariantCulture. Also adds the AcceleratorRequirements.RequiresScatterStores capability flag (rules out WebGL for kernels doing in-kernel scatter / &gt;1 non-positional output element per thread - WebGL Transform-Feedback is gather-only). (c) Wasm: the Web Worker pool is now process-persistent (shared across accelerators) instead of created + terminated per accelerator - removes the per-accelerator worker create/async-terminate churn that transiently oversubscribed cores and starved compute-heavy tests late in a long sequential lane. Default-WorkerCount accelerators share the pool (bounded at ~hardwareConcurrency-2); an explicit WorkerCount (oversubscription stress tests) keeps a private pool. Forks stay 2.0.16. --- 4.12.0 Sync/async contract (bundles forks 2.0.16): an operation that WAITS for completion or OBSERVES a result is async-only on the browser backends (WebGPU/WebGL/Wasm) - its synchronous form throws NotSupportedException; fire-and-forget operations (dispatch, alloc, host-&gt;device upload, flush-submit) stay synchronous everywhere. Synchronize() now THROWS on browser (was a silent non-waiting flush) - use await SynchronizeAsync() to wait. New Flush() submits batched work without waiting (valid synchronously on browser - submit is sync on every backend, so there is no async Flush twin). CopyFromCPU/Allocate1D(data) work on every browser backend again (routed through the new EnsureHostCopyConsumed hook). Sync CreateScan/CreateRadixSort builders run on the browser backends. Gate: full cross-backend PMT 3384/0/218. Full contract: Docs/async.md. Full per-version history: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
+		<PackageReleaseNotes>4.12.1: WebGPU cooperative GEMV grid-stride fix; ±inf/NaN scalar kernel params on WebGL+Wasm; AcceleratorRequirements.RequiresScatterStores flag; Wasm process-persistent shared Web Worker pool AND shared linear memory (default accelerators share one pool + one WebAssembly.Memory per tab, fixing per-accelerator worker-churn starvation and 1 GiB-reservation accumulation across long test lanes). Forks stay 2.0.16. Full per-version history with details: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
 		<GeneratePackageOnBuild>True</GeneratePackageOnBuild>
 		<GenerateDocumentationFile>true</GenerateDocumentationFile>
 		<EmbedAllSources>true</EmbedAllSources>
diff --git a/SpawnDev.ILGPU/Wasm/CLAUDE.md b/SpawnDev.ILGPU/Wasm/CLAUDE.md
@@ -98,6 +98,46 @@ accelerator. Locked by `WasmTests.Wasm_SharedWorkerPool_PersistsAndStaysBoundedA
 alive and both adopt the same worker, both install handlers — detach-on-dispose only cleans the
 dominant sequential case (PMT disposes the previous accelerator before creating the next).
 
+## Process-static SHARED linear memory (2026-06-14) — `s_sharedWasmMemory`
+
+The memory analog of the worker pool, and the **second half** the pool fix unmasked. A
+`new WebAssembly.Memory({ shared: true })` reserves its FULL `maximum` (default 16384 pages = 1 GiB)
+of virtual address space at construction and can **never relocate**, so each accelerator that built
+its own memory burned a full 1 GiB reservation. Before the persistent pool, `Worker.terminate()` per
+accelerator Dispose dropped the workers' references so the old reservation was freed/GC'd per test;
+with the persistent pool the workers **pin** the last memory they instantiated against
+(`_lastMemoryBuffer` + `_instancesById`, `WorkerPool.cs`) until they next swap. Across the ~569-test
+lane, per-accelerator memories accumulated up to `workerCount` live 1 GiB reservations (plus JS-GC lag)
+until V8's address-space cap was hit and the **`new WebAssembly.Memory(...)` CONSTRUCTOR** threw
+`could not allocate memory` (Tuvok's 88 `RangeError`s, the memory half). NB: the constructor failing
+(not `grow()` → our `OutOfMemoryException`) is the tell — accumulation, not single-memory high-water.
+
+**Fix:** default accelerators share **ONE** process-static `s_sharedWasmMemory` per tab (grown to the
+lane high-water, never re-created → exactly one reservation). The linear memory is per-dispatch
+**transient** working/staging memory (zero region → copy-IN → run → copy-OUT; no cross-accelerator
+state), so sharing is correct. Bonus: with persistent workers AND one persistent memory the buffer
+only changes on `grow()`, so after high-water the workers **stop re-instantiating kernels entirely**
+(per-test new-memory→instance-cache-clear churn gone).
+
+- **Routing:** `UsesSharedMemory` = `_useSharedPool && _maxLinearMemoryPages == 16384`. The whole
+  create/grow/reuse block + Dispose read/write through `CachedWasmMemory`/`CachedMemoryBuffer`/
+  `CachedWasmPages` properties that pick static-vs-instance backing. A custom-`MaxLinearMemoryPages`
+  accelerator (e.g. ML DA3 at 32768) or explicit-`WorkerCount` accelerator keeps a PRIVATE
+  `_cachedWasmMemory` — required, since the kernel module declares its import max = its own
+  MaxLinearMemoryPages and the spec needs supplied-max ≤ module-declared-max (a 16384 memory can't
+  back 32768 modules or vice-versa), and these are rare/long-lived so no accumulation.
+- **Concurrency:** `s_sharedMemoryGate` (a `SemaphoreSlim(1,1)`) serializes the shared-memory dispatch
+  window (acquire→zero→copy-IN→exec→copy-OUT) across accelerators — within one accelerator dispatches
+  already serialize via `_pendingWork`; this extends that to two concurrently-alive default
+  accelerators sharing the one memory (overlapping region-[0..) writes would corrupt). Uncontended /
+  zero-cost in the sequential PMT/production case.
+- **Lifetime:** never disposed on accelerator Dispose (shared accelerators leave the instance handles
+  null; private accelerators dispose their own); torn down only via `WasmAccelerator.DisposeSharedWasmMemory()`
+  (also called by `DisposeSharedWorkerPool()`) or tab teardown. Diagnostics: `SharedWasmMemoryCreateCount`
+  (stays 1 across the lane), `SharedWasmMemoryPages` (high-water). Locked by
+  `WasmTests.Wasm_SharedLinearMemory_PersistsAndStaysBoundedAcrossAccelerators` (≤1 construction across
+  K accelerators + correct-on-reuse + explicit-count isolation).
+
 ## Offline compile dump (desktop, no browser) — `wasm-dump`
 
 `SpawnDev.ILGPU.DemoConsole -- wasm-dump` compiles RadixSort kernels on the DESKTOP and prints the emitted shared-memory alloca table + flags any `GenerateCode(Alloca)` type+size fallback aliasing or offset overlap. Works because `WasmAccelerator.Create` wraps the `BlazorJSRuntime.JS` lookup in try/catch (defaults to 4 cores) and `CreateRadixSort*` compiles its kernels eagerly via `LoadKernel` BEFORE any dispatch — so the IL→wasm compile path runs fully offline (no workers, no Chromium, no dispatch). Reusable for any shared-memory layout audit. Source: `SpawnDev.ILGPU.DemoConsole/WasmCompileDump.cs`.
diff --git a/SpawnDev.ILGPU/Wasm/WasmAccelerator.cs b/SpawnDev.ILGPU/Wasm/WasmAccelerator.cs