SpawnDev.ILGPU transpiles .NET IL into GPU shader languages (WGSL, GLSL) and WebAssembly at runtime. Precompiled shaders let you move that transpilation off the runtime hot path and generate/inspect shader code on any machine without the target device. It is the classic AOT shader / pipeline-cache pattern (Unity shader variants, Vulkan VkPipelineCache, D3D precompiled bytecode) adapted to this fork's runtime transpiler.
The system has three layers, each independently useful:
| Layer | What | Status |
|---|---|---|
| 1 - Offline codegen | Generate a kernel's WGSL/GLSL/Wasm for a capability profile with NO device, on any host OS. Powers cross-backend debugging. | Shipped |
| 3 - Runtime cache | At kernel-load time, use a ready artifact for the active device profile instead of transpiling; fall back to runtime generation on a miss. | Shipped (WebGPU) |
| 2 - Build-time precompile | A .targets (gated by <SpawnDevPrecompileShaders>) spawns a worker that reflects the built assembly, runs Layer 1 per [PrecompiledKernel] (kernel x profile), and writes wwwroot/_shaders/{manifest.json + sidecars} the runtime loader consumes. |
Mechanism complete + build-cycle verified; nupkg tool-shipping is the remaining deployment wiring (see Layer 2 - Packaging). |
The whole pipeline is verified end-to-end with no device/browser:
dotnet run --project SpawnDev.ILGPU.DemoConsole -- precompile-e2e(build-time precompile -> emit -> runtime load -> cache hit), and a realdotnet buildof a consumer with<SpawnDevPrecompileShaders>true</...>emits the artifacts via the.targets.
Precompilation is a pure optimization. It is opt-in, profile-matched, and falls back to runtime generation on any miss, so it can never change results (global Rule 1).
using SpawnDev.ILGPU;
// No accelerator, no device - runs on a CI box with no GPU.
GeneratedKernel gen = ShaderCompiler.Generate(
(Index1D i, ArrayView<float> data) => data[i] = i,
CapabilityProfiles.WebGPUFull); // a named profile
Console.WriteLine(gen.Source); // the WGSL text (or gen.Binary for Wasm)
Console.WriteLine(gen.Metadata.GroupSize);
foreach (var d in gen.Diagnostics) // pre-validation results, if any
Console.WriteLine($"{d.Severity}: {d.Message}");Or dump from the command line:
dotnet run --project SpawnDev.ILGPU.DemoConsole -- shader-genThis replaces the older ad-hoc WGSLDumpPath / wasm-dump side channels with one uniform "generate any kernel for any backend/profile" path. It is JS-free and deterministic: (IL, profile) always produces identical bytes.
[PrecompiledKernel(AcceleratorType.WebGPU, Profile = "WebGPU-Dekker-Subgroups-NativeF16")]
[PrecompiledKernel(AcceleratorType.Wasm, Profile = "Wasm-NativeF64-Subgroups")]
static void MyKernel(Index1D i, ArrayView<float> data) { /* ... */ }<!-- in your .csproj -->
<PropertyGroup>
<SpawnDevPrecompileShaders>true</SpawnDevPrecompileShaders>
</PropertyGroup>When the flag is on, the build emits the precompiled artifacts (see Layer 2); at startup the runtime registers them so the first load of MyKernel is a cache hit instead of a transpile. When the flag is off, behavior is exactly today's (runtime transpile) - precompilation is never a hidden default.
A CapabilityProfile is a serializable description of the codegen-relevant capabilities of a target device - only the things the transpilers actually branch on (native vs emulated f16/f64/i64, subgroups, max threads per group, storage-buffer binding limit, warp size). It is the single source of capability truth: the code generators read only from the profile, both offline and at runtime, so an offline artifact is byte-identical to what the live backend emits for a matching device.
f16 (and f64, i64) work on every backend. Where the hardware lacks them, SpawnDev.ILGPU emulates them losslessly (f16 via
_f16_to_f32/_f32_to_f16bit conversion, f64 via Dekker/Ozaki, i64 via paired u32). A profile'sFloat16Native/Float64Native/Int64Nativeflags do not gate whether the type is supported - they only select which codegen path the transpiler emits (nativef16type vs emulation helpers + packed storage), because that changes the emitted shader and therefore the artifact. Thef16in a preset NAME refers to nativeshader-f16; an artifact without it still runs f16 kernels, just via the emulated path.
WebGPU presets are keyed by capability, not by browser (WGSL is a W3C standard; what differs between browsers is the feature/limit set):
Preset (CapabilityProfiles.*) |
Name string | f64 | f16 codegen | subgroups | Notes |
|---|---|---|---|---|---|
WebGPUFull |
WebGPU-Dekker-Subgroups-NativeF16 |
Dekker (emul.) | native shader-f16 |
yes | modern Chrome-class; 10 bindings |
WebGPUNoSubgroups |
WebGPU-Dekker-NativeF16 |
Dekker (emul.) | native shader-f16 |
no (shared-mem fallback) | typical Firefox-class today |
WebGPUBaseline |
WebGPU-Dekker |
Dekker (emul.) | emulated f16 | no | broadest device match (all paths emulated) |
WebGL2Baseline |
WebGL2-Dekker |
Dekker (emul.) | emulated f16 | no | i64 also emulated; no shared memory/atomics/barriers |
WasmDefault |
Wasm-NativeF64-Subgroups |
native | emulated f16 | yes (8-wide emulated warps) | native f64 + i64; 256 threads/group |
Profile names are ordered by codegen impact: f64 strategy (highest - changes the shader and the result; Dekker/Ozaki/noF64 when emulated, NativeF64 when native), then Subgroups, then NativeF16. The absence of NativeF16 means emulated f16, not "no f16" - every profile runs f16 kernels. The name is a human label only; the cache key hashes the full field set, so it never needs to encode every field (i64-native and numeric limits are intentionally not in the name). There is room to add Ozaki/noF64 f64 variants when first needed.
Every profile above supports f16 kernels; the column says which f16 codegen path the artifact uses. The Profile = "..." string on [PrecompiledKernel] (and the .csproj profile list) resolves against these names via CapabilityProfiles.Resolve(name).
// Register a project-defined profile (resolvable by its Name).
CapabilityProfiles.Register(myProfile);
// Snapshot a LIVE device's effective capabilities - the only sanctioned place to read
// caps off a real device. Run this ON the target hardware to capture an exact profile
// for build-time precompile of that hardware.
CapabilityProfile p = CapabilityProfiles.FromAccelerator(accelerator);ShaderCompiler.Generate(Delegate kernel, CapabilityProfile profile, KernelSpecialization? spec = null) is the canonical, device-independent entry point. It drives the SAME WGSLCodeGenerator / WasmKernelFunctionGenerator / GLSL generator used at runtime, fed by the profile instead of a live adapter, and stops short of resource/pipeline creation. The result:
public sealed record GeneratedKernel
{
AcceleratorType Backend; // WebGPU | WebGL | Wasm
CapabilityProfile Profile; // what it was generated for
string? Source; // WGSL/GLSL text (text backends)
byte[]? Binary; // Wasm bytes (binary backend)
GeneratedKernelMetadata Metadata; // group size, bindings, shared mem, barriers, emulation flags
object? CodegenMetadata; // backend-specific dispatch metadata (see Layer 3)
IReadOnlyList<GeneratedKernelDiagnostic> Diagnostics; // pre-validation results
bool HasErrors;
}Guarantees (enforced by guard tests): generation is deterministic (no GetHashCode-derived naming, no timestamps in the artifact), JS-runtime-free (runs on CI with no BlazorJSRuntime), and byte-identical to what the live backend emits for a matching profile.
Use it for: debugging generated code on a machine without that backend, diffing codegen changes, and (with a pre-validator wired in) reporting a shader-validation error - e.g. a WGSL cannot assign 'f32' to 'i32' - with no device or browser.
At kernel-load time the accelerator computes a cache key of (stable kernel identity, active device profile) and looks it up in ShaderArtifactCache:
- Hit: rebuild the compiled kernel from the cached shader + its
CodegenMetadata(the dispatch info - scalar packing, binding count, i64-spinlock indices, coalesce manifest, dynamic-shared overrides - that is NOT recoverable from the shader text), skipping the transpiler. - Miss: run the normal codegen and warm-register the result, so later loads in the same session hit even without build-time precompilation.
ShaderArtifactCache.Hits; // counters
ShaderArtifactCache.Misses;
ShaderArtifactCache.Count; // registered artifacts
ShaderArtifactCache.Enabled = false; // kill switch: force pure runtime generation (A/B, debugging)Kernel identity is the method's full signature (declaring type + name + parameter types) - stable across builds, identical between the build-time task and the runtime, and never Object.GetHashCode() (a non-unique heuristic - see the kernelId-collision lesson in Wasm/CLAUDE.md).
Currently implemented for the WebGPU backend. The Wasm artifact is self-contained bytes (dispatch-parameter-independent); a Wasm load-hook is a follow-up.
Status: mechanism complete + build-cycle verified. The
[PrecompiledKernel]attribute, theShaderPrecompilerworker, the spawnableSpawnDev.ILGPU.Precompilertool, theSpawnDev.ILGPU.targetsautomation, and the runtime loader all work and are verified end-to-end (a realdotnet buildof a consumer emits artifacts; the runtime loads them). The remaining piece is shipping the worker tool inside the NuGet package - see Packaging.
Method attributes say what to precompile:
[PrecompiledKernel(AcceleratorType.WebGPU, Profile = "WebGPU-Dekker-Subgroups-NativeF16")]
[PrecompiledKernel(AcceleratorType.Wasm, Profile = "Wasm-NativeF64-Subgroups")]
static void MyKernel(Index1D i, ArrayView<float> data) { /* ... */ }.csproj flags say whether and the global set:
<PropertyGroup>
<SpawnDevPrecompileShaders>true</SpawnDevPrecompileShaders> <!-- master toggle -->
<SpawnDevPrecompileProfiles>WebGPU-Dekker-Subgroups-NativeF16;Wasm-NativeF64-Subgroups</SpawnDevPrecompileProfiles> <!-- optional global set -->
<SpawnDevPrecompilePackaging>Content</SpawnDevPrecompilePackaging> <!-- Content (default) | Embedded | Both -->
</PropertyGroup>By default (Content packaging), blobs are written as wwwroot content files so they are lazily fetched on demand, independently cacheable (browser cache + service worker + OPFS), and trivially inspectable:
wwwroot/_shaders/
manifest.json # kernel-key -> [{backend, profile, artifact path, content hash, codegen version}]
{profile}/{kernelName}.{shorthash}.wgsl # WebGPU shader text
{profile}/{kernelName}.{shorthash}.meta.json # CodegenMetadata sidecar (dispatch info)
{profile}/{kernelName}.{shorthash}.wasm # Wasm module bytes
The content hash in the filename gives immutable browser caching. Only the tiny manifest.json is consulted at startup; the heavy blobs are fetched on demand. Embedded packaging instead bakes blobs into the assembly (for non-browser / single-DLL distribution); Both does both.
- A post-build MSBuild task (gated by
SpawnDevPrecompileShaders) spawns an isolateddotnetworker that loads the just-built assembly by reflection. - The worker scans for
[PrecompiledKernel](and the global profile set), runsShaderCompiler.Generatefor each (kernel x profile), and writes the sidecars + manifest. - The build fails (or warns, per a strictness flag) if a declared kernel does not transpile for a declared profile - precompile errors surface at build time, not at the user's runtime.
- At app startup the runtime fetches
manifest.json+ the referenced sidecars and registers each artifact intoShaderArtifactCache, so the first load of a precompiled kernel is a Layer 3 hit.
This mirrors the established SpawnDev.BlazorJS.WebWorkers pattern (a build-time MSBuild task gated by a .csproj flag).
Why a build tool, not a Roslyn source generator: the transpiler consumes compiled IL (it reflects over real
MethodInfo/IL), which a source generator does not have. The correct mechanism is a tool that reflects the BUILT assembly - the same modelWasmCompileDumpand the WebWorkers patcher use. The worker logic lives inShaderPrecompiler.Run(in the library, so it is unit-testable in-process);SpawnDev.ILGPU.Precompileris the thin spawnable host the.targetsruns.
Nothing fetches at startup automatically. The consumer opts in by configuring the loader with the manifest URL and a fetch delegate, then warms kernels on demand:
// Once, when wiring GPU work. A SpawnDev.BlazorJS.WebWorkers worker that runs no kernels
// never calls this, so it never touches the network.
ShaderArtifactManifestLoader.Configure(
"_shaders/manifest.json",
async url => await http.GetByteArrayAsync(url)); // BlazorJS fetch in-browser / HttpClient on desktop
// Lazy, per-kernel, right before a kernel is loaded (fetches just that one artifact):
await ShaderArtifactManifestLoader.TryWarmAsync(myKernelMethodInfo, profile);
// -> the next LoadKernel is a Layer 3 cache HIT (no transpile).
// Or preload the whole matching-profile set up front:
await ShaderArtifactManifestLoader.WarmAllAsync(profile);Not configuring it = pure runtime transpile (today's behavior). fetch is a Func<string,Task<byte[]>>, so the library core takes no SpawnDev.BlazorJS dependency and the loader is unit-testable with an in-memory fetch.
The .targets is auto-imported from the package build/ folder and is gated OFF (no opt-in = unaffected). It locates the worker via $(SpawnDevPrecompilerToolPath), which defaults to the package tools/ folder and is overridable for a source/repo build:
<!-- source/repo build: point at the tool's bin output -->
<SpawnDevPrecompilerToolPath>$(RepoRoot)\SpawnDev.ILGPU.Precompiler\bin\$(Configuration)\net10.0\SpawnDev.ILGPU.Precompiler.dll</SpawnDevPrecompilerToolPath>The remaining deployment wiring is bundling the SpawnDev.ILGPU.Precompiler tool (+ its dependency closure) into the package tools/ folder so a NuGet consumer gets it with no override. (Options: bundle in tools/, ship a sibling build-tool package, or a dotnet tool - a packaging-design choice; the mechanism above is independent of it.)
- Profile-keyed, profile-matched, fallback-on-miss. An artifact is valid only for devices whose profile matches exactly (v1 uses exact profile equality). A
shader-f16artifact is wrong on a non-f16 device, so the runtime matches profiles and falls back to runtime generation on any mismatch. - Byte-identical to runtime. Offline generation for profile P equals what the live backend emits for a device matching P. The generators read capabilities only through
CapabilityProfile(structural guard), so this holds by construction. - Deterministic + versioned.
(IL, profile)produces identical bytes; artifacts carry a profile schema + codegen version, and a version mismatch is treated as a miss (never trusted). - Opt-in only. Master
.csprojflag off = today's exact behavior.
SpawnDev.ILGPU generates kernels dynamically: Lambda Kernels (captured scalars), DelegateSpecialization (one kernel, many ops), and above all the ML layer transpiling ONNX graphs into kernels at runtime (you cannot precompile a kernel that does not exist until a graph is loaded). So this fork is AOT + runtime fallback, never AOT-only: precompile the static/hot kernels for the determinism win, and keep the runtime IL-to-shader transpiler alive for the dynamic long tail and for unknown/un-profiled devices. The fallback is a load-bearing capability, not a stopgap. (This is where we deliberately diverge from upstream ILGPU's AOT-only direction; see Plans/precompiled-shaders.md 6.1.)
- Design + locked decisions:
Plans/precompiled-shaders.md - Capability gating for backend selection:
Docs/capabilities-and-backend-selection.md - WebGPU codegen internals:
SpawnDev.ILGPU/WebGPU/CLAUDE.md - Cache-key lesson (never
GetHashCodefor identity):SpawnDev.ILGPU/Wasm/CLAUDE.md