§-FLAG (PFL-1): Non-Qwen scorer paths COMPILE but are UNVALIDATED. Qwen3.x is the only validated scorer architecture. Gemma3, Llama, Qwen2, and Mistral code paths were generalized in
500046b0bbut have not had a live end-to-end PPL or acceptance-rate gate run. Do not deploy non-Qwen PFlash scorers in production; treat them as experimental pending validation sub-2 validation.
| Value | |
|---|---|
| What it is | Prompt compression via a small scorer model; reduces the token budget fed to the target model's prefill |
| Upstream | buun, branch experiment/SD-089-pflash (see Provenance) |
| Flag | --pflash-scorer <scorer-model-dir> (minimum required flag) |
| Validated arch | Qwen3.x scorer (Qwen3.5-0.8B recommended) |
| Key tunable | --pflash-keep-ratio 0.05 (default: keep 5% of tokens above --pflash-min-tokens) |
| Backend | CPU (scorer); CUDA/HIP GPU via pflash-graph.cpp (abe0bb81a) — Vulkan builds fall back to CPU |
| CLI commits | 92c37266f (CLI wire), 076f8c069 (server path) |
| Scorer generalization | 500046b0b (non-Qwen arch support — §-FLAG unvalidated) |
PFlash is ported from the buun fork
(spiritbuun/buun-llama-cpp,
remote buun), branch experiment/SD-089-pflash — verified 2026-06-22
(common/pflash*.cpp/.h present at synced ref 2aeee7d3f). The scorer is a
standard small model directory (e.g. Qwen3.5-0.8B), supplied at runtime via
--pflash-scorer; there is no separate converter or shipped weight artifact.
This fork's additions on top of buun's CPU baseline: the HIP/CUDA GPU scorer
(pflash-graph.cpp, ~24× speedup), the Vulkan iGPU fallback, the LRU scorer
cache, on-disk persistence, and the non-Qwen scorer generalization (the latter
still §-FLAG unvalidated — see PFL-1 above). See the canonical
PROVENANCE.md.
PFlash runs a small scorer model over the prompt tokens before the main decode.
The scorer assigns an importance weight to each token; a top-K selection retains
the most important tokens up to keep_ratio × prompt_length (floored at
--pflash-min-tokens). The compressed token list replaces the original prompt
for the target model's prefill. The scorer is a standard causal LM loaded via
the fork's loader; only its logit scores (not sampled tokens) are used.
The scorer is unrelated to speculative decode — PFlash runs during prefill, before any draft model is involved.
| Flag | Default | Description |
|---|---|---|
--pflash-scorer PATH |
— | Path to the scorer model (GGUF or directory). Required to enable PFlash. |
--pflash-keep-ratio F |
0.05 |
Fraction of tokens to retain above the --pflash-min-tokens floor |
--pflash-alpha F |
0.12 |
Scorer weighting exponent; controls sharpness of importance distribution |
--pflash-min-tokens N |
8192 |
PFlash only activates if the prompt exceeds this length (no-op on short prompts) |
--pflash-scorer-cache-size N |
64 |
LRU cache entries for scorer state (0 = disabled) |
--pflash-cache-dir PATH |
— | On-disk scorer cache directory; cache is auto-invalidated when scorer path or mtime changes |
Minimal invocation (CLI):
llama-speculative-simple \
-m target.gguf \
--pflash-scorer scorer-qwen3.5-0.8b-q8_0.gguf \
--pflash-keep-ratio 0.05 \
--pflash-min-tokens 8192 \
-fa on -ngl 999 --no-mmap \
-p "$(cat long-prompt.txt)"Minimal invocation (server):
llama-server \
-m target.gguf \
--pflash-scorer scorer-qwen3.5-0.8b-q8_0.gguf \
--pflash-keep-ratio 0.05 \
-fa on -ngl 999 --no-mmapPFlash is active only when --pflash-scorer is set and the prompt exceeds
--pflash-min-tokens. Below that threshold the path is a no-op; no scorer
inference is done.
| Phase | Commits | What it added | Status |
|---|---|---|---|
| Phase 1 — base port | buun master |
Scorer loader (pflash-loader.cpp), graph builder (pflash-graph.cpp), CPU scorer path |
✅ Ported |
| Phase 2A — CPU baseline | — | Scorer model weight storage on CPU; 9.89s → ~0.41s per scorer pass | ✅ Shipped |
| Phase 3 — HIP GPU scorer | abe0bb81a |
GPU scorer compute via ggml_backend_dev_by_type(GPU) in pflash-loader.cpp + pflash-graph.cpp; CPU fallback retained for Vulkan-only builds |
✅ Shipped 2026-05-19 |
| CLI wire | 92c37266f |
--pflash-* flags wired in common/arg.cpp; compression logged as pflash: N -> M tokens (X% kept) |
✅ Shipped 2026-05-31 |
| Server path | 076f8c069 |
PFlash wired into llama-server prefill path |
✅ Shipped |
| Scorer generalization | 500046b0b |
Non-Qwen arch support (llama_model_arch switch covering qwen3/qwen35/qwen2/llama/mistral3/mistral4/gemma3/gemma4); NULL-deref guards |
✅ Shipped 2026-05-31 — §-FLAG: unvalidated (see top of doc) |
| Phase 4b / 4c | — | Follow-up scorer-path refinements | 🔄 Deferred |
The scorer generalization in 500046b0b adds dispatch paths for
non-Qwen architectures by extending the llama_model_arch switch in
common/pflash-score.cpp (and common/pflash.cpp). The Qwen3.x regression test is byte-identical
after the change. However, no live scorer gate (PPL measurement or
compression quality check) has been run for Gemma3, Llama, Qwen2, or Mistral
scorer models.
Known risks:
- G4/G5 architecture field lookups have NULL-deref guards added but are not live-tested.
- Token-importance calibration constants are tuned on Qwen3.5; non-Qwen models
may require different
--pflash-alphavalues.
Tracking: PFlash non-Qwen live-scorer validation.
- CUDA/HIP: Full GPU scorer path (
abe0bb81a). Recommended for performance. - Vulkan-only builds: CPU scorer fallback. Scorer throughput is lower;
--pflash-scorer-cache-sizehelps amortize cost across requests. - ROCm: Tested on gfx1150 (Strix Point); no known issues.
- Upstream source: buun
master(spiritbuun/buun-llama-cpp) - Scorer sources:
common/pflash-loader.cpp,common/pflash-graph.cpp,common/pflash-score.cpp,common/pflash.cpp - Feature index: docs/features/README.md
- Related docs (this repo):
- DFlash speculative decode — buun-sourced spec-decode companion
- InnerQ KV cache — complementary KV compression