Reusable llama.cpp web bridge runtime (JS + WASM).
This repository provides:
src/llama_webgpu_core.cpp(native bridge core)js/src/llama_webgpu_bridge.js(JS runtime wrapper source)js/llama_webgpu_bridge.js(generated bundled browser ESM wrapper)js/llama_webgpu_bridge.d.ts(public TypeScript declaration asset)docs/api.md(public JavaScript API reference)CMakeLists.txtfor Emscripten builds
Applications that consume the browser assets directly should start with the
LlamaWebGpuBridge public API reference. The generated
llama_webgpu_bridge.d.ts file is published with the JS/WASM assets; the API
reference explains the runtime behavior for model loading, generation,
tokenization, embeddings, multimodal projector support, state persistence,
metadata, cancellation, disposal, and worker-host bootstrap.
Requirements:
- Emscripten SDK (
emcmake,emcc) inPATH - Node.js/npm for the JS bridge build pipeline (
npm ci,npm run check:js) - llama.cpp source checkout matching
llama_cpp.versionor a compatible checkout exposingllama_state_save_file/llama_state_load_filewith the signatures used bysrc/llama_webgpu_core.cpp
Build command:
npm ci
npm run check:js
./scripts/build_bridge.sh./scripts/build_bridge.sh also runs the JS bridge build before copying wrapper
assets, but running npm run check:js explicitly is useful before PRs because it
performs TypeScript checkJs, regenerates the checked-in browser ESM wrapper and
declaration files with esbuild, and syntax-checks the generated bridge files.
Useful environment variables:
LLAMA_CPP_DIR(path to llama.cpp source)BUILD_DIR(cmake build dir)OUT_DIR(output directory; defaults todist/)WEBGPU_BRIDGE_BUILD_MEM64(1to also build optional wasm64 core assets)WEBGPU_BRIDGE_MEM64_MAX_MEMORY(optional wasm64 max linear memory bytes)WEBGPU_BRIDGE_PTHREADS(1/0, defaults to1)WEBGPU_BRIDGE_PTHREAD_POOL_SIZE(defaults to4)WEBGPU_BRIDGE_PTHREAD_POOL_SIZE_STRICT(defaults to0)
Notes:
- wasm64 builds default to
WEBGPU_BRIDGE_MEM64_MAX_MEMORY=12884901888(12 GiB). - Large single-file remote model loading requires a cross-origin isolated page
(
COOP/COEP) so worker-thread runtime paths are available. - pthread builds preallocate
WEBGPU_BRIDGE_PTHREAD_POOL_SIZEworkers and cap bridge-selected thread counts to that compiled pool size.WEBGPU_BRIDGE_PTHREAD_POOL_SIZE_STRICTdefaults to0so an unexpected over-pool request does not hard-abort the wasm runtime, but it can be overridden for stricter local diagnostics.
Build outputs:
dist/llama_webgpu_bridge.jsdist/llama_webgpu_bridge_worker.jsdist/llama_webgpu_bridge.d.tsdist/llama_webgpu_core.jsdist/llama_webgpu_core.wasm
Optional outputs (when WEBGPU_BRIDGE_BUILD_MEM64=1):
dist/llama_webgpu_core_mem64.jsdist/llama_webgpu_core_mem64.wasm
The bridge exposes llama.cpp session/state persistence through both direct runtime
and worker-backed LlamaWebGpuBridge instances.
API:
await bridge.stateSaveFile(path, tokens = []) -> trueawait bridge.stateLoadFile(path, tokenCapacity = bridge.getContextSize()) -> { tokens }await bridge.stateSaveBytes(tokens = []) -> Uint8Arrayawait bridge.stateLoadBytes(bytes, tokenCapacity = bridge.getContextSize()) -> { tokens }
stateSave* snapshots the current llama.cpp context; it does not tokenize or
evaluate the supplied tokens. Save only after the prompt/prefix you want to
restore has already been evaluated by the bridge, then pass the exact token
sequence for that evaluated prompt/prefix:
// After loadModelFromUrl(...) and after prompt/prefix evaluation:
const prefixTokens = await bridge.tokenize(prefixText, true);
await bridge.stateSaveFile('/prompt-state.bin', prefixTokens);
const restored = await bridge.stateLoadFile(
'/prompt-state.bin',
bridge.getContextSize(),
);
console.log(restored.tokens);
const bytes = await bridge.stateSaveBytes(prefixTokens);
await bridge.stateLoadBytes(bytes, bridge.getContextSize());State files are opaque llama.cpp state/session files. They are tied to the same model, llama.cpp build, and compatible runtime/model-load parameters. Loading a state file from a different model/build can fail.
The tokens argument is stored in the llama.cpp state/session file and is
returned by stateLoad*; it is not evaluated by stateSave* and is not validated
against the KV cache. Passing the wrong token list can make later prompt-prefix
reuse incorrect. Passing [] is allowed, but gives the bridge no restored
prefix-token metadata to reuse.
stateLoad* requires tokenCapacity to be positive, large enough for the stored
token list, and no larger than the active context size. If omitted, the JS API
uses bridge.getContextSize(). Empty stateLoadBytes input is rejected. All
four state methods require a loaded model.
stateSaveFile and stateLoadFile operate on the active WASMFS instance. In a
browser this filesystem is virtual and not durable by default, and worker-mode
paths live inside the worker runtime. Use stateSaveBytes and stateLoadBytes
when the application needs to persist snapshots in IndexedDB, OPFS, Cache API, or
another app-managed durable store.
State save/load is rejected while generation is active. On successful load the
bridge restores the prompt token list returned as { tokens }, so reissuing the
same prompt can reuse the loaded KV state via the existing prompt-prefix reuse
path.
This repo includes a wasm build gate in:
.github/workflows/ci.yml
It builds against the pinned llama.cpp tag in llama_cpp.version, runs the JS
bridge build/type-check gate, uploads build artifacts, and runs the static CI
reliability contract:
python3 scripts/verify_ci_reliability.pyThe reliability contract protects the browser smoke and workflow invariants that are easy to regress during agent-driven maintenance:
- both CI and publish workflows run
npm run check:js, which TypeScript-checks the JS source, regenerates the readable browser ESM wrapper/declaration files with esbuild, and then fails on any stale checked-in generated output viagit diff --exit-code; - both CI and publish workflows resolve the default llama.cpp tag from
llama_cpp.version, so tag-triggered asset publishes cannot silently rebuild against a stale workflow default; .github/workflows/auto_llama_cpp_update.ymlopens or updates one stableautomation/bump-llama-cppPR when a newer upstream release exists, with the upstream release notes, compare link, and commit range in the PR body, then dispatches the CI workflow on the automation branch soGITHUB_TOKENbranch updates still get a head-SHA validation run;- both CI and publish workflows opt into
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24to catch action-runtime deprecation issues early; - the state-persistence browser smoke supports an integrity-checked tiny GGUF model round trip;
- the CI model cache path expands
~before resolving so it matches theactions/cachedirectory; - browser smoke failures upload
state-persistence-smoke-artifactswith console logs, result JSON, and screenshots when available.
Run the model-backed smoke locally after building the bridge if a change touches state persistence, workers, browser smoke, or workflow diagnostics:
python3 scripts/state_persistence_browser_smoke.py \
--dist-dir /path/to/webgpu_bridge_dist \
--model-url https://huggingface.co/aladar/llama-2-tiny-random-GGUF/resolve/main/llama-2-tiny-random.gguf \
--model-sha256 81f226c62d28ed4a1a9b9fa080fcd9f0cc40e0f9d5680036583ff98fbcd035cb \
--model-cache-dir ~/.cache/llama-web-bridge/state-smoke-models \
--artifacts-dir /tmp/llama-web-bridge-state-smokeDo not commit downloaded GGUFs, Playwright screenshots, console logs, generated
dist/ assets, or Emscripten build/cache directories.
Published, versioned artifacts are consumed from:
leehack/llama-web-bridge-assets
Publish workflow:
.github/workflows/publish_assets.yml
Trigger modes:
- Automatic: push a
v*tag in this repo (for examplev0.1.5) - Manual: run workflow dispatch with explicit inputs
Required repository secret:
WEBGPU_BRIDGE_ASSETS_PAT(token with write access toleehack/llama-web-bridge-assets)
The publish workflow carries the resolved llama.cpp tag from the build job to
the release job as an explicit job output, so the asset release notes match the
manifest.json llama_cpp_tag value.
Example publish:
- Create/push a release tag in this repo (for example
v0.1.5) Publish Bridge Assetsruns automatically and publishes the same tag toleehack/llama-web-bridge-assets- Workflow also creates/updates the matching GitHub Release in
leehack/llama-web-bridge-assets
Manual override example:
- Run
Publish Bridge Assetsworkflow - Inputs:
assets_tag:v0.1.5assets_repo:leehack/llama-web-bridge-assetsllama_cpp_tag: leave empty to usellama_cpp.version, or set an explicit temporary override such asb9165
After publish, assets are CDN-available at:
https://cdn.jsdelivr.net/gh/leehack/llama-web-bridge-assets@<tag>/llama_webgpu_bridge.jshttps://cdn.jsdelivr.net/gh/leehack/llama-web-bridge-assets@<tag>/llama_webgpu_bridge_worker.jshttps://cdn.jsdelivr.net/gh/leehack/llama-web-bridge-assets@<tag>/llama_webgpu_bridge.d.tshttps://cdn.jsdelivr.net/gh/leehack/llama-web-bridge-assets@<tag>/llama_webgpu_core.jshttps://cdn.jsdelivr.net/gh/leehack/llama-web-bridge-assets@<tag>/llama_webgpu_core.wasm
When WEBGPU_BRIDGE_BUILD_MEM64=1, the assets repo also publishes optional
memory64 core artifacts:
https://cdn.jsdelivr.net/gh/leehack/llama-web-bridge-assets@<tag>/llama_webgpu_core_mem64.jshttps://cdn.jsdelivr.net/gh/leehack/llama-web-bridge-assets@<tag>/llama_webgpu_core_mem64.wasm
Note: CDN pinning fundamentally relies on git tags in the assets repo.
AGENTS.md: agent workflow and cross-repo handoffCONTRIBUTING.md: contributor setup/build/publish steps