PFlash Prompt Compression

§-FLAG (PFL-1): Non-Qwen scorer paths COMPILE but are UNVALIDATED. Qwen3.x is the only validated scorer architecture. Gemma3, Llama, Qwen2, and Mistral code paths were generalized in 500046b0b but have not had a live end-to-end PPL or acceptance-rate gate run. Do not deploy non-Qwen PFlash scorers in production; treat them as experimental pending validation sub-2 validation.

At a glance

	Value
What it is	Prompt compression via a small scorer model; reduces the token budget fed to the target model's prefill
Upstream	buun, branch `experiment/SD-089-pflash` (see Provenance)
Flag	`--pflash-scorer <scorer-model-dir>` (minimum required flag)
Validated arch	Qwen3.x scorer (Qwen3.5-0.8B recommended)
Key tunable	`--pflash-keep-ratio 0.05` (default: keep 5% of tokens above `--pflash-min-tokens`)
Backend	CPU (scorer); CUDA/HIP GPU via `pflash-graph.cpp` (`abe0bb81a`) — Vulkan builds fall back to CPU
CLI commits	`92c37266f` (CLI wire), `076f8c069` (server path)
Scorer generalization	`500046b0b` (non-Qwen arch support — §-FLAG unvalidated)

Provenance

PFlash is ported from the buun fork (spiritbuun/buun-llama-cpp, remote buun), branch experiment/SD-089-pflash — verified 2026-06-22 (common/pflash*.cpp/.h present at synced ref 2aeee7d3f). The scorer is a standard small model directory (e.g. Qwen3.5-0.8B), supplied at runtime via --pflash-scorer; there is no separate converter or shipped weight artifact.

This fork's additions on top of buun's CPU baseline: the HIP/CUDA GPU scorer (pflash-graph.cpp, ~24× speedup), the Vulkan iGPU fallback, the LRU scorer cache, on-disk persistence, and the non-Qwen scorer generalization (the latter still §-FLAG unvalidated — see PFL-1 above). See the canonical PROVENANCE.md.

How PFlash works

PFlash runs a small scorer model over the prompt tokens before the main decode. The scorer assigns an importance weight to each token; a top-K selection retains the most important tokens up to keep_ratio × prompt_length (floored at --pflash-min-tokens). The compressed token list replaces the original prompt for the target model's prefill. The scorer is a standard causal LM loaded via the fork's loader; only its logit scores (not sampled tokens) are used.

The scorer is unrelated to speculative decode — PFlash runs during prefill, before any draft model is involved.

CLI flags

Flag	Default	Description
`--pflash-scorer PATH`	—	Path to the scorer model (GGUF or directory). Required to enable PFlash.
`--pflash-keep-ratio F`	`0.05`	Fraction of tokens to retain above the `--pflash-min-tokens` floor
`--pflash-alpha F`	`0.12`	Scorer weighting exponent; controls sharpness of importance distribution
`--pflash-min-tokens N`	`8192`	PFlash only activates if the prompt exceeds this length (no-op on short prompts)
`--pflash-scorer-cache-size N`	`64`	LRU cache entries for scorer state (0 = disabled)
`--pflash-cache-dir PATH`	—	On-disk scorer cache directory; cache is auto-invalidated when scorer path or mtime changes

Minimal invocation (CLI):

llama-speculative-simple \
    -m target.gguf \
    --pflash-scorer scorer-qwen3.5-0.8b-q8_0.gguf \
    --pflash-keep-ratio 0.05 \
    --pflash-min-tokens 8192 \
    -fa on -ngl 999 --no-mmap \
    -p "$(cat long-prompt.txt)"

Minimal invocation (server):

llama-server \
    -m target.gguf \
    --pflash-scorer scorer-qwen3.5-0.8b-q8_0.gguf \
    --pflash-keep-ratio 0.05 \
    -fa on -ngl 999 --no-mmap

PFlash is active only when --pflash-scorer is set and the prompt exceeds --pflash-min-tokens. Below that threshold the path is a no-op; no scorer inference is done.

Phase status

Phase	Commits	What it added	Status
Phase 1 — base port	buun `master`	Scorer loader (`pflash-loader.cpp`), graph builder (`pflash-graph.cpp`), CPU scorer path	✅ Ported
Phase 2A — CPU baseline	—	Scorer model weight storage on CPU; 9.89s → ~0.41s per scorer pass	✅ Shipped
Phase 3 — HIP GPU scorer	`abe0bb81a`	GPU scorer compute via `ggml_backend_dev_by_type(GPU)` in `pflash-loader.cpp` + `pflash-graph.cpp`; CPU fallback retained for Vulkan-only builds	✅ Shipped 2026-05-19
CLI wire	`92c37266f`	`--pflash-*` flags wired in `common/arg.cpp`; compression logged as `pflash: N -> M tokens (X% kept)`	✅ Shipped 2026-05-31
Server path	`076f8c069`	PFlash wired into `llama-server` prefill path	✅ Shipped
Scorer generalization	`500046b0b`	Non-Qwen arch support (`llama_model_arch` switch covering qwen3/qwen35/qwen2/llama/mistral3/mistral4/gemma3/gemma4); NULL-deref guards	✅ Shipped 2026-05-31 — §-FLAG: unvalidated (see top of doc)
Phase 4b / 4c	—	Follow-up scorer-path refinements	🔄 Deferred

§-FLAG detail (PFL-1)

The scorer generalization in 500046b0b adds dispatch paths for non-Qwen architectures by extending the llama_model_arch switch in common/pflash-score.cpp (and common/pflash.cpp). The Qwen3.x regression test is byte-identical after the change. However, no live scorer gate (PPL measurement or compression quality check) has been run for Gemma3, Llama, Qwen2, or Mistral scorer models.

Known risks:

G4/G5 architecture field lookups have NULL-deref guards added but are not live-tested.
Token-importance calibration constants are tuned on Qwen3.5; non-Qwen models may require different --pflash-alpha values.

Tracking: PFlash non-Qwen live-scorer validation.

Backend notes

CUDA/HIP: Full GPU scorer path (abe0bb81a). Recommended for performance.
Vulkan-only builds: CPU scorer fallback. Scorer throughput is lower; --pflash-scorer-cache-size helps amortize cost across requests.
ROCm: Tested on gfx1150 (Strix Point); no known issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PFlash Prompt Compression

At a glance

Provenance

How PFlash works

CLI flags

Phase status

§-FLAG detail (PFL-1)

Backend notes

Further reading

FilesExpand file tree

pflash.md

Latest commit

History

pflash.md

File metadata and controls

PFlash Prompt Compression

At a glance

Provenance

How PFlash works

CLI flags

Phase status

§-FLAG detail (PFL-1)

Backend notes

Further reading