Skip to content

Commit 450bb01

Browse files
vanceingallsVai
andcommitted
perf(producer): native WebGPU shader-blend via Dawn (opt-in, Mac-Metal)
EXPERIMENTAL spike. Adds a Node-side WebGPU compositor backed by the `webgpu` npm package (Dawn) and wires it into the existing shader-transition worker as an opt-in alternative to the CPU blend. Default OFF — set `HF_DAWN_WEBGPU=1` or pass `--gpu-shader-blend` to the CLI to engage. The hf#677 chain (PRs #756-#760) gets us 1.95x on Mac via a CPU shader worker pool, ring-buffered pipelining, and a hybrid layered/parallel capture path. The remaining ceiling is the per-pixel JS blend itself — scalar f64 math in v8. On any host with a real GPU (Mac/Metal, Linux/Vulkan, Windows/D3D), moving the blend to the GPU via Dawn should drop blend wall time on a 854x480 rgb48le frame from ~150-910 ms (depending on shader complexity) to a few ms, projecting 3-5x end-to-end on top of the existing cascade — see the `reference_5x_shader_perf_alternatives.md` memo (option B). beginFrame is structurally Mac-unavailable (crbug.com/40656275); native WebGPU is the next-best Mac-viable lever. The Dawn binding (`webgpu@0.4.0`, dawn-gpu/node-webgpu, maintained by Dawn upstream — Corentin Wallez, Kai Ninomiya) ships a `darwin-universal` prebuilt that targets Apple Metal directly, plus Linux x64/arm64 Vulkan and Windows x64 D3D. Verified locally that the package loads on Linux x64; adapter init returns null in the sandbox (no Vulkan driver) and the worker transparently falls back to the existing CPU path. Mac numbers must be measured on Vance's laptop — there's no GPU host in CI. - `packages/producer/src/services/shaderTransitionGpu.ts` (NEW) Node-side compositor: dynamic-imports `webgpu`, requests an adapter + device, compiles a WGSL compute shader per supported transition, manages per-size GPU resources, dispatches blends, reads back to `rgb48le`. Returns structured init failures rather than throwing — the caller can always fall back to CPU. `HF_DAWN_FORCE_FAIL=1` short-circuits init for testability of the fallback path. - `packages/producer/src/services/shaderTransitionWorker.ts` Augmented: on the first message, if `HF_DAWN_WEBGPU=1`, dynamic-imports the GPU module and probes once. On success, supported shaders run on the GPU. Unsupported shaders (`glitch`, `domain-warp`, `swirl-vortex`, the rest) and any host without a GPU adapter fall through to the existing `TRANSITIONS[shader] ?? crossfade` CPU path with zero behavior change. Mid-render GPU failure disables the GPU path for the rest of the worker's life and falls back — the frame still completes. - `packages/cli/src/commands/render.ts` New `--gpu-shader-blend` boolean flag (default false). When set, the CLI exports `HF_DAWN_WEBGPU=1` into `process.env` before any worker pool spawns. Env-var plumbing chosen over threading the flag through the orchestrator -> stage -> pool -> worker chain because env vars cross the `worker_threads` boundary unchanged. - `packages/producer/build.mjs` + `packages/cli/tsup.config.ts` `webgpu` added to the `external` list in both bundlers. The Dawn binding ships a 70+ MB native `.dawn.node` binary per platform — it must be loaded from the user's node_modules at runtime, not inlined into the bundle. - `packages/producer/package.json` + `packages/cli/package.json` `webgpu@^0.4.0` added to `optionalDependencies` on both. Optional because (a) it's a native binary that may fail to install on some hosts and (b) the feature is opt-in. - `packages/producer/src/services/shaderTransitionGpu.test.ts` (NEW) 3 vitest tests: `HF_DAWN_FORCE_FAIL` short-circuit, init never throws, and PSNR-vs-CPU >= 50 dB on hosts where the adapter is available (auto-skipped on Linux sandbox with a log line). One representative shader (`crossfade`) is ported to WGSL end-to-end as proof of correctness. The harness is shader-agnostic; porting more is a mechanical add to `SHADERS_WGSL` in `shaderTransitionGpu.ts`. Unsupported shaders transparently fall through to CPU even when the flag is on. GPU compute uses f32 + u16 storage; CPU canonical uses f64. Bit-exact equality is not realistic. The default-off path keeps CI byte-equality pins intact (the existing fixtures hit the CPU path unchanged). Fixtures that exercise the GPU path must use a PSNR pin (>= 50 dB documented in the new test). The fallback CPU path is bit-exact with the canonical CPU implementation. Bundled CLI worker smoke (854x480 single crossfade blend, see the investigation log for the harness): | config | wall | notes | |---|---:|---| | CPU baseline (`HF_DAWN_WEBGPU` unset) | 67 ms | reference | | `HF_DAWN_WEBGPU=1` (fell back, no GPU) | 70 ms | +3 ms init+probe, then CPU | | `HF_DAWN_WEBGPU=1` + `HF_DAWN_FORCE_FAIL=1` | 66 ms | fallback verified | Both fallback runs produced byte-identical output to the CPU baseline checksum — fallback fidelity confirmed. **Mac numbers: Vance, please fill in.** Pull the branch, run a shader-transition fixture both with and without `--gpu-shader-blend`, and drop the wall-time pair in a comment. Anywhere the GPU path can't run, the existing CPU path runs unchanged: - `webgpu` package not installed (optional dep skipped on install) -> CPU - `webgpu` loads but `requestAdapter()` returns null (no GPU) -> CPU - WGSL pipeline compile fails -> CPU - `--gpu-shader-blend` set but shader not in `SHADERS_WGSL` -> CPU - Mid-render GPU failure -> log + disable GPU + finish on CPU Failure reason is logged once per worker (not per frame). No crash path. - Port the remaining 12 shaders to WGSL (mechanical: `flashThroughWhite`, `chromaticSplit`, `sdfIris`, `glitch`, `lightLeak`, `crossWarpMorph`, `whipPan`, `cinematicZoom`, `gravitationalLens`, `rippleWaves`, `swirlVortex`, `thermalDistortion`, `domainWarp`, `ridgedBurn`). - Once Mac wall-time confirms the lever, decide on flip-default-on or keep opt-in. - Update fixture pins for any fixture that needs to exercise the GPU path under CI (PSNR-only). - A persistent device + pipeline cache shared across workers (currently per-worker init). _PR drafted by Vai_ Co-Authored-By: Vai <vai@heygen.com>
1 parent e1f63a8 commit 450bb01

9 files changed

Lines changed: 801 additions & 41 deletions

File tree

bun.lock

Lines changed: 13 additions & 7 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

packages/cli/package.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,8 @@
5757
"vitest": "^3.2.4"
5858
},
5959
"optionalDependencies": {
60-
"@google/genai": "^1.50.1"
60+
"@google/genai": "^1.50.1",
61+
"webgpu": "^0.4.0"
6162
},
6263
"engines": {
6364
"node": ">=22"

packages/cli/src/commands/render.ts

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,15 @@ export default defineCommand({
182182
description:
183183
"Force host GPU acceleration for Chrome/WebGL capture. Default: auto (probe on first launch; fall back to software if no GPU). Use --no-browser-gpu to force software (SwiftShader).",
184184
},
185+
"gpu-shader-blend": {
186+
type: "boolean",
187+
default: false,
188+
description:
189+
"EXPERIMENTAL. Use the native WebGPU (Dawn) compositor for shader-transition blends when a GPU is available. " +
190+
"Falls back to CPU when Dawn isn't installed or no GPU adapter is present. " +
191+
"Determinism: PSNR ≥ 50dB vs the CPU canonical path, not byte-equal. " +
192+
"Currently ports a subset of shaders (crossfade); unsupported shaders transparently fall back to CPU.",
193+
},
185194
quiet: {
186195
type: "boolean",
187196
description: "Suppress verbose output",
@@ -293,6 +302,17 @@ export default defineCommand({
293302
workers = parsed;
294303
}
295304

305+
// ── GPU shader-blend (Dawn/WebGPU, EXPERIMENTAL) ────────────────────
306+
// The flag flips an env var that the shader-blend worker reads on first
307+
// message. We pipe through an env var (rather than threading the flag
308+
// through render orchestrator → captureHdrStage → captureHdrHybridLoop
309+
// → pool → worker) because env vars survive the worker_threads boundary
310+
// unchanged and require zero plumbing. The worker logs once whether it
311+
// could acquire a GPU; if not, the existing CPU path runs as before.
312+
if (args["gpu-shader-blend"] === true) {
313+
process.env.HF_DAWN_WEBGPU = "1";
314+
}
315+
296316
// ── Validate max-concurrent-renders ─────────────────────────────────
297317
if (args["max-concurrent-renders"] != null) {
298318
const parsed = parseInt(args["max-concurrent-renders"], 10);

packages/cli/tsup.config.ts

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,12 @@ var __dirname = __hf_dirname(__filename);`,
5252
"esbuild",
5353
"giget",
5454
"postcss",
55+
// `webgpu` (Dawn) ships a 70+ MB native .dawn.node binary per
56+
// platform. Keeping it external means tsup won't try to inline it
57+
// and the CLI install resolves it (or doesn't, if the optionalDep
58+
// skipped) from the user's node_modules. The shader-blend worker
59+
// dynamically `import("webgpu")` and falls back to CPU on absence.
60+
"webgpu",
5561
],
5662
noExternal: [
5763
"@hyperframes/core",

packages/producer/build.mjs

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ await Promise.all([
4242
platform: "node",
4343
target: "node22",
4444
format: "esm",
45-
external: ["puppeteer", "esbuild", "postcss"],
45+
external: ["puppeteer", "esbuild", "postcss", "webgpu"],
4646
plugins: [workspaceAliasPlugin],
4747
minify: false,
4848
sourcemap: true,
@@ -54,7 +54,7 @@ await Promise.all([
5454
platform: "node",
5555
target: "node22",
5656
format: "esm",
57-
external: ["puppeteer", "esbuild", "postcss"],
57+
external: ["puppeteer", "esbuild", "postcss", "webgpu"],
5858
plugins: [workspaceAliasPlugin],
5959
minify: false,
6060
sourcemap: true,
@@ -70,7 +70,7 @@ await Promise.all([
7070
platform: "node",
7171
target: "node22",
7272
format: "esm",
73-
external: ["puppeteer", "esbuild", "postcss"],
73+
external: ["puppeteer", "esbuild", "postcss", "webgpu"],
7474
plugins: [workspaceAliasPlugin],
7575
minify: false,
7676
sourcemap: true,
@@ -86,7 +86,7 @@ await Promise.all([
8686
platform: "node",
8787
target: "node22",
8888
format: "esm",
89-
external: ["puppeteer", "esbuild", "postcss"],
89+
external: ["puppeteer", "esbuild", "postcss", "webgpu"],
9090
plugins: [workspaceAliasPlugin],
9191
minify: false,
9292
sourcemap: true,

packages/producer/package.json

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,9 @@
8181
"tsx": "^4.21.0",
8282
"typescript": "^5.7.2"
8383
},
84+
"optionalDependencies": {
85+
"webgpu": "^0.4.0"
86+
},
8487
"engines": {
8588
"node": ">=22"
8689
}
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
/**
2+
* Tests for the Dawn/WebGPU shader-blend compositor.
3+
*
4+
* We can't depend on a working GPU adapter in CI — the Linux sandbox has
5+
* no Vulkan driver. So these tests focus on the surface that must work
6+
* regardless of host:
7+
*
8+
* 1. `HF_DAWN_FORCE_FAIL=1` short-circuits init to a clean failure (the
9+
* env hook the CLI / worker rely on for fallback testability).
10+
* 2. `initGpuCompositor()` never throws. On a no-GPU host it returns
11+
* `{ ok: false, reason }` and the caller can fall back without
12+
* try/catch.
13+
* 3. When a GPU IS available (Vance's Mac, Linux+GPU), the compositor's
14+
* crossfade output matches the CPU canonical path within PSNR ≥ 50dB.
15+
* This branch is skipped when init fails — the test logs the reason
16+
* instead so a regression on Mac surfaces cleanly without breaking CI
17+
* elsewhere.
18+
*
19+
* Determinism note: we deliberately do NOT pin byte-equality with the CPU
20+
* shader. The whole point of the new path is f32 GPU math + u16 storage,
21+
* which differs from f64 CPU math at the LSB. PSNR is the right pin.
22+
*/
23+
24+
import { afterEach, beforeEach, describe, expect, it } from "vitest";
25+
import { crossfade } from "@hyperframes/engine/shader-transitions";
26+
import { initGpuCompositor } from "./shaderTransitionGpu.js";
27+
28+
const WIDTH = 32;
29+
const HEIGHT = 16;
30+
const PX = WIDTH * HEIGHT;
31+
const BYTES = PX * 6;
32+
33+
function fillGradient(): Buffer {
34+
const buf = Buffer.alloc(BYTES);
35+
for (let i = 0; i < PX; i++) {
36+
const o = i * 6;
37+
buf.writeUInt16LE((i * 1024) & 0xffff, o);
38+
buf.writeUInt16LE(((i * 2048) & 0xffff) ^ 0xa5a5, o + 2);
39+
buf.writeUInt16LE(((i * 4096) & 0xffff) ^ 0x5a5a, o + 4);
40+
}
41+
return buf;
42+
}
43+
44+
function fillSolid(r: number, g: number, b: number): Buffer {
45+
const buf = Buffer.alloc(BYTES);
46+
for (let i = 0; i < PX; i++) {
47+
const o = i * 6;
48+
buf.writeUInt16LE(r, o);
49+
buf.writeUInt16LE(g, o + 2);
50+
buf.writeUInt16LE(b, o + 4);
51+
}
52+
return buf;
53+
}
54+
55+
/**
56+
* Peak signal-to-noise ratio in dB between two rgb48le buffers (16-bit
57+
* channel depth → MAX = 65535). >= 50 dB is the acceptance bar for the
58+
* GPU path (still visually indistinguishable from f64 canonical; passes
59+
* the eye / objective metric for transition rendering).
60+
*/
61+
function psnrDb(a: Buffer, b: Buffer): number {
62+
if (a.length !== b.length) throw new Error("buffer length mismatch");
63+
const samples = a.length / 2;
64+
let sse = 0;
65+
for (let i = 0; i < samples; i++) {
66+
const av = a.readUInt16LE(i * 2);
67+
const bv = b.readUInt16LE(i * 2);
68+
const d = av - bv;
69+
sse += d * d;
70+
}
71+
if (sse === 0) return Infinity;
72+
const mse = sse / samples;
73+
const MAX = 65535;
74+
return 10 * Math.log10((MAX * MAX) / mse);
75+
}
76+
77+
describe("shaderTransitionGpu", () => {
78+
const originalForceFail = process.env.HF_DAWN_FORCE_FAIL;
79+
80+
beforeEach(() => {
81+
// Each test below sets its own value; reset between tests so they don't
82+
// bleed state. The module caches the loadWebgpu() promise, but each
83+
// suite-level test runs in a fresh vitest worker file so the cache is
84+
// only shared within a single `describe` — fine for these tests.
85+
delete process.env.HF_DAWN_FORCE_FAIL;
86+
});
87+
88+
afterEach(() => {
89+
if (originalForceFail === undefined) {
90+
delete process.env.HF_DAWN_FORCE_FAIL;
91+
} else {
92+
process.env.HF_DAWN_FORCE_FAIL = originalForceFail;
93+
}
94+
});
95+
96+
it("HF_DAWN_FORCE_FAIL short-circuits to a clean failure", async () => {
97+
process.env.HF_DAWN_FORCE_FAIL = "1";
98+
const result = await initGpuCompositor();
99+
expect(result.ok).toBe(false);
100+
if (!result.ok) {
101+
expect(result.reason).toMatch(/HF_DAWN_FORCE_FAIL/);
102+
}
103+
});
104+
105+
it("returns ok:false (never throws) on hosts without a GPU adapter", async () => {
106+
// No assertion on which branch we hit — we just assert the call never
107+
// throws and returns a structured result. On Vance's Mac this will
108+
// typically be `ok: true`; on the Linux sandbox it'll be
109+
// `{ ok: false, reason: "no GPU adapter..." }` or the
110+
// module-not-installed branch. Both are correct.
111+
const result = await initGpuCompositor();
112+
expect(typeof result).toBe("object");
113+
if (result.ok) {
114+
expect(typeof result.compositor.supportsShader).toBe("function");
115+
expect(result.compositor.supportsShader("crossfade")).toBe(true);
116+
expect(result.compositor.supportsShader("not-a-real-shader")).toBe(false);
117+
await result.compositor.dispose();
118+
} else {
119+
expect(typeof result.reason).toBe("string");
120+
expect(result.reason.length).toBeGreaterThan(0);
121+
}
122+
});
123+
124+
it("crossfade output matches CPU canonical within PSNR >= 50dB when a GPU is available", async () => {
125+
const result = await initGpuCompositor();
126+
if (!result.ok) {
127+
// Skipped — host has no GPU. Log so a regression on Mac (where the
128+
// adapter SHOULD be available) is visible in the test output.
129+
// eslint-disable-next-line no-console
130+
console.log(`[shaderTransitionGpu.test] GPU branch skipped: ${result.reason}`);
131+
return;
132+
}
133+
const compositor = result.compositor;
134+
try {
135+
const from = fillGradient();
136+
const to = fillSolid(40000, 5000, 25000);
137+
const outGpu = Buffer.alloc(BYTES);
138+
const outCpu = Buffer.alloc(BYTES);
139+
await compositor.blend("crossfade", from, to, outGpu, WIDTH, HEIGHT, 0.5);
140+
crossfade(from, to, outCpu, WIDTH, HEIGHT, 0.5);
141+
const psnr = psnrDb(outGpu, outCpu);
142+
// eslint-disable-next-line no-console
143+
console.log(`[shaderTransitionGpu.test] crossfade PSNR vs CPU: ${psnr.toFixed(2)} dB`);
144+
expect(psnr).toBeGreaterThanOrEqual(50);
145+
} finally {
146+
await compositor.dispose();
147+
}
148+
});
149+
});

0 commit comments

Comments
 (0)