Skip to content

Commit e9d9908

Browse files
authored
Merge pull request #117 from leehack/sync-webbridge-b9016
Sync WebGPU bridge assets to llama.cpp b9016
2 parents 5cf9be8 + b4258b2 commit e9d9908

14 files changed

Lines changed: 649 additions & 41 deletions

File tree

AGENTS.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,30 @@ dart pub global run coverage:format_coverage --lcov --in=coverage/test --out=cov
2525
dart run tool/testing/check_lcov_threshold.dart coverage/lcov.info 70
2626
```
2727

28+
### Local Chat App Web E2E
29+
Use the real chat app path for WebGPU bridge validation after bridge/runtime
30+
updates. This catches issues that direct bridge probes miss.
31+
32+
```bash
33+
cd example/chat_app
34+
flutter build web --base-href=/example/chat_app/build/web/
35+
cd ../..
36+
python3 tool/testing/serve_static_with_headers.py --directory . --port 7358
37+
38+
.venv-playwright/bin/python tool/testing/playwright_chat_app_real_model_smoke.py \
39+
http://127.0.0.1:7358/example/chat_app/build/web/ \
40+
--model-url http://127.0.0.1:7358/example/llamadart_server/models/Qwen3.5-0.8B-Q4_K_M.gguf \
41+
--expect 4
42+
```
43+
44+
When serving `build/web` under a repo-root path, build with the matching
45+
`--base-href`; otherwise Flutter resolves `flutter_bootstrap.js` and
46+
`webgpu_bridge/*` from `/`. On macOS headless Chromium, use the smoke script's
47+
default `--browser-angle auto` or pass `--browser-angle metal`; without Metal
48+
ANGLE the adapter may lack `shader-f16` and llama.cpp can abort in
49+
`ggml-webgpu` even for CPU/gpuLayers=0 runs. For larger models such as Gemma 4,
50+
pass `--mem64` and a smaller `--context-size` to keep the smoke bounded.
51+
2852
### CI Standards
2953
- `dart format --output=none --set-exit-if-changed .` checks formatting
3054
- `dart analyze` runs the linter

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,12 @@
33
* **Native runtime sync**:
44
* Updated native hook pinning to `leehack/llamadart-native@b9016`,
55
picking up the CUDA 12.8 Blackwell-capable native bundles.
6+
* Updated default web bridge asset pinning to
7+
`leehack/llama-web-bridge-assets@v0.1.13` (llama.cpp `b9016`) so
8+
native and web runtimes track the same upstream revision.
9+
* Picked up the bridge-side Qwen UTF-8 streaming stabilization and
10+
multimodal fallback narrowing, while preserving control-token output for
11+
parser consumers.
612
* **Load-time tuning knobs**:
713
* Added `ModelParams.useMmap` (default `true`) and
814
`ModelParams.useMlock` (default `false`), wired to
@@ -28,6 +34,9 @@
2834
`ModelParams.ropeFrequencyScale` (both nullable) for
2935
context-extension overrides on `llama_context_params.rope_freq_base` /
3036
`rope_freq_scale`. `null` keeps the model's trained values.
37+
* Forwarded native-compatible `ModelParams` load tuning knobs through the
38+
WebGPU bridge path, including `maxParallelSequences`, flash attention,
39+
KV-cache type, KV-unified, RoPE, split-mode, and main-GPU options.
3140
* **GPU device selection API**:
3241
* Added `ModelParams.mainGpu` and wired it to llama.cpp
3342
`llama_model_params.main_gpu`.

doc/webgpu_bridge.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ pipelines.
1919
`https://cdn.jsdelivr.net/gh/leehack/llama-web-bridge-assets@<tag>/llama_webgpu_bridge.js`
2020
2. Local fallback: `./webgpu_bridge/llama_webgpu_bridge.js`
2121

22-
Default pinned tag in the example is `v0.1.10`.
22+
Default pinned tag in the example is `v0.1.13`.
2323

2424
For broader browser coverage in this repository, fetched/local assets are patched
2525
to a universal Safari-compatible gate by default (`MIN_SAFARI_VERSION=170400`).
@@ -32,7 +32,7 @@ model bytes.
3232
To vendor pinned assets into local app web files:
3333

3434
```bash
35-
WEBGPU_BRIDGE_ASSETS_TAG=v0.1.10 ./scripts/fetch_webgpu_bridge_assets.sh
35+
WEBGPU_BRIDGE_ASSETS_TAG=v0.1.13 ./scripts/fetch_webgpu_bridge_assets.sh
3636
```
3737

3838
Optional compatibility env vars:
@@ -108,7 +108,7 @@ You can override CDN source/version before the bridge loader runs:
108108
```html
109109
<script>
110110
window.__llamadartBridgeAssetsRepo = 'leehack/llama-web-bridge-assets';
111-
window.__llamadartBridgeAssetsTag = 'v0.1.10';
111+
window.__llamadartBridgeAssetsTag = 'v0.1.13';
112112
</script>
113113
```
114114

@@ -124,7 +124,7 @@ window.LlamaWebGpuBridge = class LlamaWebGpuBridge {
124124

125125
`WebGpuLlamaBackend` can use these methods if present:
126126

127-
- `loadModelFromUrl(url, { nCtx, nThreads, nGpuLayers, useCache, progressCallback })`
127+
- `loadModelFromUrl(url, { nCtx, nThreads, nThreadsBatch, nBatch, nUbatch, nGpuLayers, nSeqMax, flashAttention, cacheTypeK, cacheTypeV, kvUnified, ropeFrequencyBase, ropeFrequencyScale, splitMode, mainGpu, useCache, forceRemoteFetchBackend, remoteFetchChunkBytes, progressCallback })`
128128
- `prefetchModelToCache(url, { useCache, force, cacheName, progressCallback })`
129129
- `evictModelFromCache(url, { cacheName })`
130130
- `loadMultimodalProjector(url)`

example/chat_app/web/index.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -215,8 +215,8 @@
215215
const bridgeAssetsTag =
216216
typeof configuredTag === 'string' && configuredTag.length > 0
217217
? configuredTag
218-
: 'v0.1.10';
219-
const localBridgeVersion = 'v0.1.10-local-20260308a';
218+
: 'v0.1.13';
219+
const localBridgeVersion = 'v0.1.13-local-b9016';
220220
window.__llamadartBridgeLocalVersion = localBridgeVersion;
221221

222222
const localBridgeUrl = `./webgpu_bridge/llama_webgpu_bridge.js?v=${localBridgeVersion}`;

lib/src/backends/llama_cpp/load_param_helpers.dart

Lines changed: 7 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -4,41 +4,18 @@
44

55
import '../../core/models/config/flash_attention.dart';
66
import '../../core/models/config/kv_cache_type.dart';
7+
import '../../core/models/config/llama_cpp_param_values.dart'
8+
as llama_cpp_values;
79
import '../../core/models/inference/model_params.dart';
810
import 'bindings.dart';
911

12+
export '../../core/models/config/llama_cpp_param_values.dart'
13+
show resolveFlashAttention;
14+
1015
/// Maps llamadart's [KvCacheType] enum to llama.cpp's `ggml_type`. Pure
1116
/// switch, no side effects.
1217
ggml_type ggmlTypeFor(KvCacheType type) {
13-
switch (type) {
14-
case KvCacheType.f16:
15-
return ggml_type.GGML_TYPE_F16;
16-
case KvCacheType.q8_0:
17-
return ggml_type.GGML_TYPE_Q8_0;
18-
case KvCacheType.q4_0:
19-
return ggml_type.GGML_TYPE_Q4_0;
20-
}
21-
}
22-
23-
/// Resolves the user-requested [FlashAttention] given the requested KV
24-
/// cache types. llama.cpp refuses non-F16 KV without flash attention, so
25-
/// `auto` is auto-promoted to `enabled` when either KV type isn't F16.
26-
/// Explicit `enabled` / `disabled` are passed through unchanged.
27-
///
28-
/// Pairing this with [ModelParams]'s constructor-side ArgumentError on
29-
/// `(non-F16 KV, FA disabled)` ensures the only ambiguous case (`auto`)
30-
/// gets resolved deterministically here.
31-
FlashAttention resolveFlashAttention({
32-
required FlashAttention requested,
33-
required KvCacheType cacheTypeK,
34-
required KvCacheType cacheTypeV,
35-
}) {
36-
final wantsKvQuantization =
37-
cacheTypeK != KvCacheType.f16 || cacheTypeV != KvCacheType.f16;
38-
if (requested == FlashAttention.auto && wantsKvQuantization) {
39-
return FlashAttention.enabled;
40-
}
41-
return requested;
18+
return ggml_type.fromValue(llama_cpp_values.ggmlTypeValueFor(type));
4219
}
4320

4421
/// Applies the user-controlled fields of [params] to a freshly-defaulted
@@ -57,7 +34,7 @@ FlashAttention applyContextParams(
5734
llama_context_params ctxParams,
5835
ModelParams params,
5936
) {
60-
final resolvedFlashAttn = resolveFlashAttention(
37+
final resolvedFlashAttn = llama_cpp_values.resolveFlashAttention(
6138
requested: params.flashAttention,
6239
cacheTypeK: params.cacheTypeK,
6340
cacheTypeV: params.cacheTypeV,

lib/src/backends/webgpu/interop.dart

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,15 @@ extension type WebGpuLoadModelOptions._(JSObject _) implements JSObject {
125125
@JS('nBatch') int? nBatch,
126126
@JS('nUbatch') int? nUbatch,
127127
@JS('nGpuLayers') int? nGpuLayers,
128+
@JS('nSeqMax') int? nSeqMax,
129+
@JS('flashAttention') int? flashAttention,
130+
@JS('cacheTypeK') int? cacheTypeK,
131+
@JS('cacheTypeV') int? cacheTypeV,
132+
@JS('kvUnified') bool? kvUnified,
133+
@JS('ropeFrequencyBase') double? ropeFrequencyBase,
134+
@JS('ropeFrequencyScale') double? ropeFrequencyScale,
135+
@JS('splitMode') int? splitMode,
136+
@JS('mainGpu') int? mainGpu,
128137
@JS('useCache') bool? useCache,
129138
@JS('forceRemoteFetchBackend') bool? forceRemoteFetchBackend,
130139
@JS('remoteFetchThresholdBytes') int? remoteFetchThresholdBytes,

lib/src/backends/webgpu/webgpu_backend.dart

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ import 'package:web/web.dart';
99

1010
import '../../core/models/chat/content_part.dart';
1111
import '../../core/models/config/gpu_backend.dart';
12+
import '../../core/models/config/llama_cpp_param_values.dart';
1213
import '../../core/models/config/log_level.dart';
1314
import '../../core/models/inference/generation_params.dart';
1415
import '../../core/models/inference/model_params.dart';
@@ -615,6 +616,20 @@ class WebGpuLlamaBackend
615616
return (nBatch: tunedBatch, nUbatch: tunedUbatch);
616617
}
617618

619+
int _webGpuFlashAttentionValue(ModelParams params) {
620+
return llamaFlashAttentionTypeValueFor(
621+
resolveFlashAttention(
622+
requested: params.flashAttention,
623+
cacheTypeK: params.cacheTypeK,
624+
cacheTypeV: params.cacheTypeV,
625+
),
626+
);
627+
}
628+
629+
bool? _webGpuKvUnifiedValue(ModelParams params) {
630+
return params.kvUnified ?? (params.maxParallelSequences > 1 ? true : null);
631+
}
632+
618633
int _resolveSafeRequestedGpuLayers({
619634
required String url,
620635
required ModelParams params,
@@ -815,6 +830,7 @@ class WebGpuLlamaBackend
815830
ModelParams params, {
816831
Function(double progress)? onProgress,
817832
}) async {
833+
params.validate();
818834
_preferMemory64Override = null;
819835
_forceRemoteFetchBackendOverride = null;
820836

@@ -931,6 +947,15 @@ class WebGpuLlamaBackend
931947
? params.microBatchSize
932948
: batchTuning.nUbatch,
933949
nGpuLayers: attempt.gpuLayers,
950+
nSeqMax: math.max(1, params.maxParallelSequences),
951+
flashAttention: _webGpuFlashAttentionValue(params),
952+
cacheTypeK: ggmlTypeValueFor(params.cacheTypeK),
953+
cacheTypeV: ggmlTypeValueFor(params.cacheTypeV),
954+
kvUnified: _webGpuKvUnifiedValue(params),
955+
ropeFrequencyBase: params.ropeFrequencyBase,
956+
ropeFrequencyScale: params.ropeFrequencyScale,
957+
splitMode: params.splitMode.llamaCppValue,
958+
mainGpu: params.mainGpu,
934959
useCache: true,
935960
forceRemoteFetchBackend: forceRemoteFetchBackend,
936961
remoteFetchChunkBytes: remoteFetchChunkBytesOverride,
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
import 'flash_attention.dart';
2+
import 'kv_cache_type.dart';
3+
4+
/// Maps llamadart's [KvCacheType] enum to llama.cpp's `ggml_type` value.
5+
int ggmlTypeValueFor(KvCacheType type) {
6+
return switch (type) {
7+
KvCacheType.f16 => 1,
8+
KvCacheType.q4_0 => 2,
9+
KvCacheType.q8_0 => 8,
10+
};
11+
}
12+
13+
/// Maps llamadart's [FlashAttention] enum to llama.cpp's flash-attention
14+
/// option value.
15+
int llamaFlashAttentionTypeValueFor(FlashAttention type) {
16+
return switch (type) {
17+
FlashAttention.auto => -1,
18+
FlashAttention.disabled => 0,
19+
FlashAttention.enabled => 1,
20+
};
21+
}
22+
23+
/// Resolves the user-requested [FlashAttention] given the requested KV cache
24+
/// types. llama.cpp refuses non-F16 KV without flash attention, so `auto` is
25+
/// auto-promoted to `enabled` when either KV type isn't F16. Explicit `enabled`
26+
/// or `disabled` values pass through unchanged.
27+
FlashAttention resolveFlashAttention({
28+
required FlashAttention requested,
29+
required KvCacheType cacheTypeK,
30+
required KvCacheType cacheTypeV,
31+
}) {
32+
final wantsKvQuantization =
33+
cacheTypeK != KvCacheType.f16 || cacheTypeV != KvCacheType.f16;
34+
if (requested == FlashAttention.auto && wantsKvQuantization) {
35+
return FlashAttention.enabled;
36+
}
37+
return requested;
38+
}

scripts/fetch_webgpu_bridge_assets.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ set -euo pipefail
44
ROOT_DIR="$(git rev-parse --show-toplevel)"
55
OUT_DIR="${WEBGPU_BRIDGE_OUT_DIR:-$ROOT_DIR/example/chat_app/web/webgpu_bridge}"
66
ASSETS_REPO="${WEBGPU_BRIDGE_ASSETS_REPO:-leehack/llama-web-bridge-assets}"
7-
ASSETS_TAG="${WEBGPU_BRIDGE_ASSETS_TAG:-v0.1.10}"
7+
ASSETS_TAG="${WEBGPU_BRIDGE_ASSETS_TAG:-v0.1.13}"
88
CDN_BASE="${WEBGPU_BRIDGE_CDN_BASE:-https://cdn.jsdelivr.net/gh/${ASSETS_REPO}@${ASSETS_TAG}}"
99
PATCH_SAFARI_COMPAT="${WEBGPU_BRIDGE_PATCH_SAFARI_COMPAT:-1}"
1010
MIN_SAFARI_VERSION="${WEBGPU_BRIDGE_MIN_SAFARI_VERSION:-170400}"
@@ -14,7 +14,7 @@ if [[ "${1:-}" == "--help" || "${1:-}" == "-h" ]]; then
1414
Downloads prebuilt WebGPU bridge assets into the chat_app web directory.
1515
1616
Default source:
17-
https://cdn.jsdelivr.net/gh/leehack/llama-web-bridge-assets@v0.1.10
17+
https://cdn.jsdelivr.net/gh/leehack/llama-web-bridge-assets@v0.1.13
1818
1919
Environment variables:
2020
WEBGPU_BRIDGE_ASSETS_REPO Asset repo in owner/repo format
@@ -28,7 +28,7 @@ Usage:
2828
./scripts/fetch_webgpu_bridge_assets.sh
2929
3030
Examples:
31-
WEBGPU_BRIDGE_ASSETS_TAG=v0.1.10 ./scripts/fetch_webgpu_bridge_assets.sh
31+
WEBGPU_BRIDGE_ASSETS_TAG=v0.1.13 ./scripts/fetch_webgpu_bridge_assets.sh
3232
WEBGPU_BRIDGE_ASSETS_REPO=acme/llama-web-bridge-assets WEBGPU_BRIDGE_ASSETS_TAG=v2 ./scripts/fetch_webgpu_bridge_assets.sh
3333
USAGE
3434
exit 0

0 commit comments

Comments
 (0)