You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Wasm: process-static shared linear memory (fix reservation accumulation)
The second half the persistent worker pool (4f93830) unmasked. Each default
WasmAccelerator built its own `new WebAssembly.Memory({maximum:16384, shared:true})`.
A shared WebAssembly.Memory reserves its FULL maximum (1 GiB) of virtual address
space at construction and can never relocate. Before the persistent pool,
Worker.terminate() per accelerator Dispose dropped the workers' references so the
old reservation was freed/GC'd per test; with persistent workers each worker now
PINS the last memory it instantiated against (_lastMemoryBuffer + _instancesById)
until it next swaps. Across the ~569-test PMT Wasm lane the per-accelerator
memories accumulated up to workerCount live 1 GiB reservations (plus JS-GC lag)
until V8's address-space cap was hit and the `new WebAssembly.Memory()`
CONSTRUCTOR threw "could not allocate memory" (Tuvok's 88 RangeErrors on the ML
full Wasm lane). The constructor failing (not grow() -> our OutOfMemoryException)
is the tell: reservation accumulation, not single-memory high-water.
Fix (memory analog of the worker-pool fix): default accelerators share ONE
process-static s_sharedWasmMemory per tab, grown to the lane high-water and never
re-created -> a single reservation. The linear memory is per-dispatch transient
working/staging memory (zero region -> copy-IN -> run -> copy-OUT; no
cross-accelerator state), so sharing is correct. Bonus perf: with persistent
workers AND one persistent memory the buffer only changes on grow(), so after
high-water the workers stop re-instantiating kernels entirely (the per-test
new-memory -> instance-cache-clear churn is gone).
- Routing: UsesSharedMemory = _useSharedPool && _maxLinearMemoryPages == 16384.
The create/grow/reuse block + Dispose read/write through CachedWasmMemory /
CachedMemoryBuffer / CachedWasmPages properties that pick static-vs-instance
backing. A custom-MaxLinearMemoryPages accelerator (e.g. ML DA3 at 32768) or
explicit-WorkerCount accelerator keeps a PRIVATE memory -- required, since the
kernel module declares its import max = its own MaxLinearMemoryPages and the
spec needs supplied-max <= module-declared-max (a 16384 memory can't back 32768
modules or vice-versa); these are rare/long-lived so no accumulation.
- Concurrency: s_sharedMemoryGate (SemaphoreSlim(1,1)) serializes the shared-memory
dispatch window (acquire -> zero -> copy-IN -> exec -> copy-OUT) across
concurrently-alive default accelerators (overlapping region-[0..) writes on one
memory would corrupt). Within one accelerator dispatches already serialize via
_pendingWork; uncontended/zero-cost in the sequential PMT/production case.
- Lifetime: never disposed on accelerator Dispose (shared accelerators leave the
instance handles null); torn down only via DisposeSharedWasmMemory() (also called
by DisposeSharedWorkerPool()) or tab teardown. Diagnostics:
SharedWasmMemoryCreateCount (stays 1 across the lane), SharedWasmMemoryPages.
Regression guard Wasm_SharedLinearMemory_PersistsAndStaysBoundedAcrossAccelerators
(<=1 construction across K accelerators + correct-on-reuse + explicit-count
isolation). Gate: PMT_FILTER=WasmTests 517/0/17. Version 4.12.1-local.4 (forks
stay 2.0.16).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,6 +9,7 @@ Wrapper-only (forks stay **2.0.16**). Adds a new selection-gate capability flag:
9
9
-**`AcceleratorRequirements.RequiresScatterStores`** (rules out WebGL) - declare it when a kernel writes a computed/arbitrary output index (`out[someIndex] = ...`) or more than one element of one buffer per thread that isn't the consecutive `v*storeCount+slot` layout. WebGL Transform-Feedback captures one output record per vertex at the thread's own slot (gather-only), so in-kernel scatter can't run there; the flag filters WebGL at `EnumerateCompatibleDevices` / `CreatePreferredAccelerator` / `Satisfies` time. (WebGL still scatters at the host/algorithm layer - e.g. RadixSort via render-to-texture.)
10
10
- A compile-time fail-loud guard for this class (mirroring the atomics/barriers/Scan throws) was prototyped and backed out - the blunt criterion false-positived on legitimate positional multi-store + grid-stride-loop kernels. The correct codegen-level criterion is a tracked open item (`Plans/webgl-multistore-fail-loud-guard-plan-2026-06-13.md`). For now use the selection flag.
11
11
- **Wasm: process-persistent shared Web Worker pool.** The Web Worker pool is now process-static (`WasmAccelerator.s_sharedWorkerPool`) - created once per tab and reused across every default-WorkerCount accelerator - instead of being created and `terminate()`d per accelerator. `Worker.terminate()` is an asynchronous browser signal, so a fresh-accelerator-per-test pattern (PMT's ~531-test Wasm lane) spun up a new `hardwareConcurrency` pool while the previous pool's threads were still winding down -> transient worker oversubscription that compounded across the lane -> the pure-spin barrier couldn't schedule all workers in its window -> compute-heavy tests starved and timed out late while light tests stayed fast. The shared pool removes both the terminate churn and the per-test re-create cost. Safe across accelerators: the worker-side module-cache key is a process-static monotonic id (no cross-accelerator collision), a memory-buffer change invalidates a reused worker's cached instances, and each accelerator detaches its handlers on Dispose. Bounded at ~`hardwareConcurrency-2`: an explicit `WorkerCount` (oversubscription stress tests) keeps a private pool, and a worker still checked out at an abnormal Dispose is terminated+removed rather than stranded. Locked by `WasmTests.Wasm_SharedWorkerPool_PersistsAndStaysBoundedAcrossAccelerators`. Gate: `PMT_FILTER=WasmTests` 516/0/17.
12
+
- **Wasm: process-static shared linear memory** (the second half the persistent worker pool unmasked). A `new WebAssembly.Memory({ shared: true })` reserves its full `maximum` (default 16384 pages = 1 GiB) of virtual address space at construction and can never relocate, so each default accelerator that built its own memory burned a full 1 GiB reservation. Before the persistent pool, `Worker.terminate()` per accelerator Dispose dropped the workers' references so the old reservation was freed/GC'd per test; with the persistent pool the workers pin the last memory they instantiated against (until they next swap), so across a ~569-test lane the per-accelerator memories accumulated up to `workerCount` live 1 GiB reservations until V8's address-space cap was hit and the `new WebAssembly.Memory(...)` constructor threw `could not allocate memory`. Default-WorkerCount accelerators now share ONE process-static `WebAssembly.Memory` per tab (`WasmAccelerator.s_sharedWasmMemory`), grown to the lane high-water and never re-created -> a single reservation. Safe because the linear memory is per-dispatch transient working/staging memory (zero region -> copy-in -> run -> copy-out; no cross-accelerator state); a process-wide `SemaphoreSlim` serializes the shared-memory dispatch window across concurrently-alive accelerators (zero-cost in the sequential case). Scoped to default `MaxLinearMemoryPages` (16384): a custom-max accelerator (e.g. ML at 32768) or explicit-`WorkerCount` accelerator keeps a private memory, because the kernel module declares its memory-import maximum = its own `MaxLinearMemoryPages` and the spec requires the supplied memory's maximum be <= the module's declared maximum. Bonus: with persistent workers and one persistent memory the buffer only changes on `grow()`, so after high-water the workers stop re-instantiating kernels entirely (the per-test new-memory churn is gone). Diagnostics `WasmAccelerator.SharedWasmMemoryCreateCount` / `SharedWasmMemoryPages`; locked by `WasmTests.Wasm_SharedLinearMemory_PersistsAndStaysBoundedAcrossAccelerators`.
12
13
13
14
## 4.12.0 (2026-06-13) - Sync/async contract: async-only where it waits/observes, sync for fire-and-forget
Copy file name to clipboardExpand all lines: SpawnDev.ILGPU/SpawnDev.ILGPU.csproj
+2-2Lines changed: 2 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -4,9 +4,9 @@
4
4
<TargetFramework>net10.0</TargetFramework>
5
5
<ImplicitUsings>enable</ImplicitUsings>
6
6
<Nullable>enable</Nullable>
7
-
<Version>4.12.1-local.3</Version>
7
+
<Version>4.12.1-local.4</Version>
8
8
<!-- Brief current-version highlights only. Full per-version history with code samples lives in CHANGELOG.md (linked from the README). -->
9
-
<PackageReleaseNotes>4.12.1: (a) WebGPU GEMV grid-stride fix — a cooperative GEMV's inner K-tile loop (k=tid;k<K;k+=G) was conflated with a synthetic group grid-stride counter (~K/G x too-small accumulation); fixed with a two-pass uniform-break (decide the break from the EMITTED body's barriers — barrier-free loops keep their natural break, barrier loops keep the uniform break unchanged). Re-enables the fast cooperative M=1 GEMV on WebGPU. (b) fixes special-float (±inf/NaN) scalar kernel params on WebGL + Wasm - a kernel float scalar holding ±inf/NaN (e.g. ConstantOfShape(-inf)) silently failed at the .NET->JS dispatch boundary (WebGL: System.Text.Json in postMessage rejected ±inf/NaN; Wasm: culture-sensitive ToString emitted invalid JS tokens). WebGL now sends float scalars as int32 bit patterns (glWorker reconstructs); Wasm uses InvariantCulture. Also adds the AcceleratorRequirements.RequiresScatterStores capability flag (rules out WebGL for kernels doing in-kernel scatter / >1 non-positional output element per thread - WebGL Transform-Feedback is gather-only). (c) Wasm: the Web Worker pool is now process-persistent (shared across accelerators) instead of created + terminated per accelerator - removes the per-accelerator worker create/async-terminate churn that transiently oversubscribed cores and starved compute-heavy tests late in a long sequential lane. Default-WorkerCount accelerators share the pool (bounded at ~hardwareConcurrency-2); an explicit WorkerCount (oversubscription stress tests) keeps a private pool. Forks stay 2.0.16. --- 4.12.0 Sync/async contract (bundles forks 2.0.16): an operation that WAITS for completion or OBSERVES a result is async-only on the browser backends (WebGPU/WebGL/Wasm) - its synchronous form throws NotSupportedException; fire-and-forget operations (dispatch, alloc, host->device upload, flush-submit) stay synchronous everywhere. Synchronize() now THROWS on browser (was a silent non-waiting flush) - use await SynchronizeAsync() to wait. New Flush() submits batched work without waiting (valid synchronously on browser - submit is sync on every backend, so there is no async Flush twin). CopyFromCPU/Allocate1D(data) work on every browser backend again (routed through the new EnsureHostCopyConsumed hook). Sync CreateScan/CreateRadixSort builders run on the browser backends. Gate: full cross-backend PMT 3384/0/218. Full contract: Docs/async.md. Full per-version history: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
9
+
<PackageReleaseNotes>4.12.1: WebGPU cooperative GEMV grid-stride fix; ±inf/NaN scalar kernel params on WebGL+Wasm; AcceleratorRequirements.RequiresScatterStores flag; Wasm process-persistent shared Web Worker pool AND shared linear memory (default accelerators share one pool + one WebAssembly.Memory per tab, fixing per-accelerator worker-churn starvation and 1 GiB-reservation accumulation across long test lanes). Forks stay 2.0.16. Full per-version history with details: CHANGELOG.md at https://github.com/LostBeard/SpawnDev.ILGPU/blob/master/CHANGELOG.md</PackageReleaseNotes>
-**Routing:**`UsesSharedMemory` = `_useSharedPool && _maxLinearMemoryPages == 16384`. The whole
123
+
create/grow/reuse block + Dispose read/write through `CachedWasmMemory`/`CachedMemoryBuffer`/
124
+
`CachedWasmPages` properties that pick static-vs-instance backing. A custom-`MaxLinearMemoryPages`
125
+
accelerator (e.g. ML DA3 at 32768) or explicit-`WorkerCount` accelerator keeps a PRIVATE
126
+
`_cachedWasmMemory` — required, since the kernel module declares its import max = its own
127
+
MaxLinearMemoryPages and the spec needs supplied-max ≤ module-declared-max (a 16384 memory can't
128
+
back 32768 modules or vice-versa), and these are rare/long-lived so no accumulation.
129
+
-**Concurrency:**`s_sharedMemoryGate` (a `SemaphoreSlim(1,1)`) serializes the shared-memory dispatch
130
+
window (acquire→zero→copy-IN→exec→copy-OUT) across accelerators — within one accelerator dispatches
131
+
already serialize via `_pendingWork`; this extends that to two concurrently-alive default
132
+
accelerators sharing the one memory (overlapping region-[0..) writes would corrupt). Uncontended /
133
+
zero-cost in the sequential PMT/production case.
134
+
-**Lifetime:** never disposed on accelerator Dispose (shared accelerators leave the instance handles
135
+
null; private accelerators dispose their own); torn down only via `WasmAccelerator.DisposeSharedWasmMemory()`
136
+
(also called by `DisposeSharedWorkerPool()`) or tab teardown. Diagnostics: `SharedWasmMemoryCreateCount`
137
+
(stays 1 across the lane), `SharedWasmMemoryPages` (high-water). Locked by
138
+
`WasmTests.Wasm_SharedLinearMemory_PersistsAndStaysBoundedAcrossAccelerators` (≤1 construction across
139
+
K accelerators + correct-on-reuse + explicit-count isolation).
140
+
101
141
## Offline compile dump (desktop, no browser) — `wasm-dump`
102
142
103
143
`SpawnDev.ILGPU.DemoConsole -- wasm-dump` compiles RadixSort kernels on the DESKTOP and prints the emitted shared-memory alloca table + flags any `GenerateCode(Alloca)` type+size fallback aliasing or offset overlap. Works because `WasmAccelerator.Create` wraps the `BlazorJSRuntime.JS` lookup in try/catch (defaults to 4 cores) and `CreateRadixSort*` compiles its kernels eagerly via `LoadKernel` BEFORE any dispatch — so the IL→wasm compile path runs fully offline (no workers, no Chromium, no dispatch). Reusable for any shared-memory layout audit. Source: `SpawnDev.ILGPU.DemoConsole/WasmCompileDump.cs`.
0 commit comments