This is the practical BeeLlama tuning reference. It covers the arguments that change model loading, GPU placement, context behavior, prompt caching, KV-cache precision, DFlash/speculative decoding, multimodal behavior, sampling, reasoning, server endpoints, and logs. It is not a byte-for-byte replacement for --help; it is the single page to read before tuning a real BeeLlama server.
For the feature overview and public-repo comparison, read beellama-features.md.
BeeLlama tuning is easiest if you choose settings in this order:
- Load the target model and optional multimodal projector.
- Decide GPU/offload placement for the target model and draft model.
- Set context length, batch sizes, KV precision, and prompt-cache behavior.
- Enable DFlash or another speculative backend.
- Tune sampling, chat/reasoning, server endpoints, and logs.
The DFlash controls have several similar names. These are the common confusion points:
| Setting | It controls | It does not control |
|---|---|---|
--ctx-size |
Target model context length | DFlash cross-attention window |
--spec-draft-ctx-size |
Draft model context allocation | Target context length |
--spec-dflash-cross-ctx |
Recent target hidden-state tokens visible to DFlash | Server KV context length |
--spec-draft-n-max |
Main-path draft-token ceiling | Extra tree branches |
--spec-draft-n-min |
Minimum draft size required before using a draft | Minimum response length |
--spec-branch-budget |
Extra DDTree branch nodes beyond the main path | Main-path draft depth |
--spec-draft-top-k |
Candidates per position for tree mode | Sampling --top-k |
--spec-draft-temp |
Drafter sampling temperature | Target sampler temperature unless auto is used |
--spec-dm-* |
Runtime controller that can lower/raise active draft depth | Static parser default for --spec-draft-n-max |
This example mirrors the important shape of a typical DFlash launch script while using placeholder paths:
llama-server \
-m "path/to/target.gguf" \
--mmproj "path/to/mmproj.gguf" \
--no-mmproj-offload \
--spec-draft-model "path/to/drafter.gguf" \
--spec-type dflash \
--spec-dflash-cross-ctx 1024 \
--port 8082 \
-np 1 \
--kv-unified \
-ngl all \
--spec-draft-ngl all \
-b 2048 -ub 512 \
--ctx-size 102400 \
--cache-type-k q5_0 --cache-type-v q4_0 \
--flash-attn on \
--cache-ram 0 \
--jinja \
--no-mmap --mlock \
--no-host --metrics \
--log-timestamps --log-prefix --log-colors off \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking":true}' \
--temp 0.6 --top-k 20 --min-p 0.0What this shape means:
| Block | Purpose |
|---|---|
-m, --mmproj, --no-mmproj-offload |
Load the target model and projector, while keeping the projector off GPU. |
--spec-type dflash, --spec-draft-model, --spec-draft-n-max, --spec-dflash-cross-ctx |
Enable flat DFlash drafting and choose draft depth/window. |
-np 1, --kv-unified, --ctx-size |
Run one server slot with explicit unified KV and a large target context. |
-ngl all, --spec-draft-ngl all |
Fully offload target and draft models when devices can hold them. |
-b, -ub |
Set target prompt prefill batching. Bee keeps upstream defaults unless you override them. |
--cache-type-k, --cache-type-v |
Use asymmetric KV precision. |
--cache-ram 0 |
Disable the server prompt-cache RAM subsystem. Live-slot prefix reuse still works. |
--no-mmap, --mlock, --no-host |
Prefer locked model memory and direct backend buffer behavior over filesystem cache behavior. |
--metrics, --log-* |
Expose metrics and make logs easier to capture. |
--reasoning on, --chat-template-kwargs, --temp, --top-k, --min-p |
Control thinking/template behavior and target sampling. |
| Arg | Code default | When to use |
|---|---|---|
-m, --model FNAME |
Required for normal local runs | Target model GGUF path. |
-mu, --model-url URL |
Unset | Download/load a target model from URL. |
-hf, -hfr, --hf-repo <user>/<model>[:quant] |
Unset | Resolve target model from Hugging Face. |
-hff, --hf-file FILE |
Unset | Override the file chosen from --hf-repo. |
--models-dir PATH |
Unset | Server router/model directory mode. |
--models-preset PATH |
Unset | Server model preset file. |
--models-max N |
4 |
Maximum loaded models in router mode. |
--models-autoload, --no-models-autoload |
Enabled | Whether router models autoload. |
Draft model loading uses parallel draft-specific names:
| Arg | Code default | When to use |
|---|---|---|
--spec-draft-model, -md, --model-draft FNAME |
Unset | Draft model GGUF path. Required for DFlash unless loaded through draft HF args. |
--spec-draft-hf, -hfd, -hfrd, --hf-repo-draft <user>/<model>[:quant] |
Unset | Resolve draft model from Hugging Face. |
Bee auto-detects DFlash when a loaded draft model reports a DFlash block size and --spec-type was not already dflash.
| Arg | Code default | When to use |
|---|---|---|
-ngl, --gpu-layers, `--n-gpu-layers N |
auto | all` |
--spec-draft-ngl, -ngld, --gpu-layers-draft, `--n-gpu-layers-draft N |
auto | all` |
-dev, --device dev1,dev2,... |
All usable devices | Restrict target model devices. |
--spec-draft-device, -devd, --device-draft dev1,dev2,... |
Inherited/available devices | Restrict draft model devices. For DFlash, a single explicit GPU also pins the target output tensor to that GPU before target load, because the drafter shares the target output tensor. |
-sm, `--split-mode none |
layer | row |
-mg, --main-gpu INDEX |
0 |
Main GPU for scratch/small tensors. |
-ts, --tensor-split N0,N1,... |
Unset | Manual tensor split across devices. |
-fit, `--fit on |
off` | Enabled |
-fitp, `--fit-print on |
off` | Disabled |
-fitc, --fit-ctx N |
4096 |
Minimum context size that --fit may choose. |
-fitt, --fit-target MiB0,MiB1,... |
1024 MiB margin per device |
Free-memory margin per device for --fit. |
-nkvo, --no-kv-offload |
Off | Keep KV cache off GPU. Usually hurts GPU-serving speed, but can be needed under VRAM pressure. |
--no-host |
Off | Bypass host buffer so extra backend buffers can be used. Used in the launch script. |
--no-fused-gdn |
Off | Disable fused Gated Delta Net kernels. Use only to isolate backend/kernel issues. |
CPU/thread controls:
| Arg | Code default | When to use |
|---|---|---|
-t, --threads N |
-1 auto |
CPU threads for generation. The launch script uses -t 8. |
-tb, --threads-batch N |
Same as --threads |
CPU threads for prompt/batch processing. |
--spec-draft-threads, -td, --threads-draft N |
Same as target threads | Draft model CPU generation threads. |
--spec-draft-threads-batch, -tbd, --threads-batch-draft N |
Same as draft threads | Draft model prompt/batch threads. |
--cpu-mask, --cpu-range, --cpu-strict |
No strict placement | Pin target CPU work. |
| Draft CPU affinity variants | Inherit target CPU settings | Use --spec-draft-cpu-mask, --spec-draft-cpu-range, --spec-draft-cpu-strict, and batch variants when the drafter needs separate CPU placement. |
--prio, --prio-batch, draft priority variants |
Normal | Change scheduler priority. Avoid realtime unless you know the host can tolerate it. |
--poll, --poll-batch, draft poll variants |
50 for base poll |
Busy-wait level. Higher can reduce latency at CPU cost. |
Advanced draft placement:
| Arg | Use |
|---|---|
--spec-draft-override-tensor, -otd, --override-tensor-draft |
Override draft tensor buffer placement by tensor-name pattern. |
--spec-draft-cpu-moe, -cmoed, --cpu-moe-draft |
Keep draft MoE tensors on CPU. |
--spec-draft-n-cpu-moe, --spec-draft-ncmoe, -ncmoed, --n-cpu-moe-draft N |
Keep only some draft MoE layers on CPU. |
| Arg | Code default | When to use |
|---|---|---|
-c, --ctx-size N |
0 means model-trained context |
Target model context allocation. Large values need KV memory. |
-n, --predict, --n-predict N |
-1 no limit |
Max new tokens for CLI/server defaults where applicable. |
-b, --batch-size N |
2048 |
Logical prompt batch size. Larger improves prefill until memory/backend limits. |
-ub, --ubatch-size N |
512 |
Physical microbatch size. Lower when VRAM spikes or DFlash graph memory grows. |
--keep N |
0 |
Tokens to keep from the initial prompt when context is managed. |
-np, --parallel N |
1 |
Server slots. In server help, -1 means auto. More slots need more KV and DFlash state. |
-cb, --cont-batching; -nocb, --no-cont-batching |
Enabled | Continuous batching for server decode. |
--context-shift, --no-context-shift |
Disabled | Infinite generation via KV shifting when supported. Disabled under multimodal. |
--swa-full |
Disabled | Full-size SWA cache for models where it is supported. |
-ctxcp, --ctx-checkpoints, --swa-checkpoints N |
32 |
Max prompt context checkpoints per slot. Important for long prompt reuse and contexts where speculative decoding needs checkpoints. |
-cms, --checkpoint-min-step N |
256 |
Minimum spacing between prompt context checkpoints. 0 disables the spacing limit. |
-cram, --cache-ram N |
8192 MiB |
Server prompt-cache RAM limit. -1 no limit, 0 disables the prompt cache subsystem. |
--cache-prompt, --no-cache-prompt |
Enabled | Whether request prompts can use prompt caching. |
--cache-reuse N |
0 disabled |
Minimum chunk size to reuse from cache through KV shifting. Disabled under multimodal and unsupported contexts. |
-kvu, --kv-unified; -no-kvu, --no-kv-unified |
Raw default false; help notes enabled if slots are auto | Single unified KV buffer shared across sequences. Required by idle-slot cache. |
--cache-idle-slots, --no-cache-idle-slots |
Enabled in params, but requires unified KV and cache RAM | Save and clear idle slots when a new task starts. Disabled by the server if requirements are not met. |
DFlash-specific parser defaults:
| Situation | Effective behavior |
|---|---|
--spec-type dflash and no --spec-draft-ctx-size |
Bee sets draft context to 256. |
--spec-type dflash and no explicit -b |
Bee keeps the upstream target batch default, currently 2048. |
--spec-type dflash and no explicit -ub |
Bee keeps the upstream target microbatch default, currently 512. |
You pass explicit -b or -ub |
Bee keeps your values. |
DFlash no longer lowers the target batch defaults at parse time. If memory is tight, lower -b or -ub explicitly after measuring the model, context, KV type, and draft placement.
K and V cache precision are independent:
--cache-type-k turbo4 --cache-type-v turbo3_tcq
--cache-type-k q8_0 --cache-type-v turbo3_tcq
--cache-type-k f16 --cache-type-v f16Accepted KV cache type names in the current parser include:
| Family | Values |
|---|---|
| Floating | f32, f16, bf16 |
| Upstream quantized | q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1 |
| Bee/Turbo lineage | turbo2, turbo3, turbo4, turbo2_tcq, turbo3_tcq |
Bee fork cache storage:
| Type | Storage | Best first use |
|---|---|---|
turbo4 |
4.125 bpv |
Strong first compressed K-cache candidate. |
turbo3 |
3.125 bpv |
More compression than turbo4; uses QK_TURBO3 = 128 in Bee. |
turbo2 |
2.125 bpv |
Stronger compression, higher quality risk. |
turbo3_tcq |
3.25 bpv |
TCQ path, commonly useful for V in Bee experiments. |
turbo2_tcq |
2.25 bpv |
Most compressed TCQ path; verify carefully. |
For current long-context Qwen and Gemma DFlash serving, the common asymmetric choice is:
--cache-type-k q5_0 --cache-type-v q4_0Do not assume enum compatibility with TheTom's public TurboQuant fork. Bee uses the buun enum order for Turbo/TCQ cache types while keeping a 128-value turbo3 block. Bee keeps TCQ cache IDs at 45 and 46; Tom's TQ3_1S and TQ4_1S model weight formats use new GGML type IDs 47 and 48.
Tom's TQ3_1S and TQ4_1S are model weight formats, not KV-cache types. Use them through llama-quantize:
llama-quantize model.f16.gguf model.tq3_1s.gguf TQ3_1S
llama-quantize model.f16.gguf model.tq4_1s.gguf TQ4_1SThe CLI file-type IDs are 43 for TQ3_1S and 44 for TQ4_1S; the serialized GGML type IDs are 47 and 48. This avoids interpreting existing Bee/Buun TCQ cache tensors as Tom weight tensors. The current port includes CPU quantize/dequantize and CPU fallback matmul support; backend acceleration is not claimed unless that backend has been built and tested separately.
| Arg | Default | When to use |
|---|---|---|
--mmap, --no-mmap |
mmap enabled | --no-mmap avoids relying on filesystem cache behavior; the launch script uses it. |
--mlock |
Disabled | Ask the OS to keep model memory resident. Useful when paging would ruin latency. |
-dio, --direct-io |
Disabled | Read model data without normal buffered I/O. |
--no-mmap --mlock is a deliberate "keep the model resident" style. It can fail or be limited by OS permissions and available RAM.
| Arg | Default | When to use |
|---|---|---|
-fa, `--flash-attn on |
off | auto` |
--no-fused-gdn |
Off | Disable fused GDN kernels for debugging or backend compatibility. |
For Turbo/TCQ cache types, verify the backend path you use. A cache type being accepted by the parser does not prove every backend/kernel combination is fast or supported.
Select the speculative backend:
| Arg | Code default | Values |
|---|---|---|
--spec-type MODE |
none |
none, ngram-cache, ngram-simple, ngram-map-k, ngram-map-k4v, ngram-mod, suffix, copyspec, recycle, dflash |
DFlash core args:
| Arg | Code default | Use |
|---|---|---|
--spec-draft-model, -md, --model-draft |
Unset | Draft GGUF path. DFlash needs a DFlash-compatible draft model. |
--spec-draft-ctx-size, -cd, --ctx-size-draft |
Raw 0; DFlash effective 256 if omitted |
Draft context allocation. Not the same as target --ctx-size. |
--spec-draft-ngl, -ngld |
auto |
Draft model GPU layers. |
--spec-draft-type-k, -ctkd, --cache-type-k-draft |
f16 |
Draft model K cache precision. |
--spec-draft-type-v, -ctvd, --cache-type-v-draft |
f16 |
Draft model V cache precision. |
--spec-draft-n-max |
Raw upstream 3; DFlash effective 16 if omitted |
Base max main-path draft tokens. |
--spec-draft-n-min |
0 |
Minimum draft length required before speculative verification is used. |
--spec-branch-budget |
0 |
DDTree branch nodes beyond the main draft path. 0 means flat DFlash. |
--spec-draft-top-k |
1 |
Candidates per draft position for tree mode. Forced to 1 when branch budget is 0. |
--spec-draft-p-split, --draft-p-split |
0.10 |
Probability threshold for creating tree branches. |
--spec-draft-p-min, --draft-p-min |
0.0 |
Minimum draft probability gate. |
--spec-draft-temp T |
0.0 |
DFlash drafter temperature. 0 greedy, positive uses sampled/Gumbel path, auto mirrors target temp. |
--spec-dflash-cross-ctx N |
512 |
Recent target hidden-state tokens visible to the DFlash drafter. |
--spec-dflash-max-slots N |
Match -np |
Max server slots with DFlash state; higher slots fall back to non-speculative decoding. Use this to cap DFlash below -np. |
DFlash-only args such as --spec-branch-budget, --spec-draft-top-k, --spec-draft-temp, --spec-dflash-*, and --spec-dm-* do not change MTP or other non-DFlash speculative modes. If they are passed without --spec-type dflash, Bee preserves them while a supplied draft model can still auto-detect as DFlash; otherwise it emits a warning and ignores them.
Flat DFlash:
llama-server -m target.gguf \
--spec-type dflash \
--spec-draft-model draft-dflash.gguf \
--spec-draft-n-max 16 \
--spec-branch-budget 0 \
--spec-dflash-cross-ctx 512 \
-ngl all \
--spec-draft-ngl allTree DFlash:
llama-server -m target.gguf \
--spec-type dflash \
--spec-draft-model draft-dflash.gguf \
--spec-draft-n-max 16 \
--spec-branch-budget 4 \
--spec-draft-top-k 4 \
--spec-draft-p-split 0.10Sampled DFlash:
llama-server -m target.gguf \
--spec-type dflash \
--spec-draft-model draft-dflash.gguf \
--spec-draft-temp auto \
--temp 0.6Sampled DFlash activates the rejection-sampling path when both drafter and target temperature exceed zero. Draft log-probabilities must be available for rejection sampling to produce correct output. Without those conditions, Bee uses the normal greedy speculative path.
Multi-GPU DFlash:
In v0.3.0, DFlash GPU cross rings, graph hidden capture, recurrent tape buffers, conv replay, and direct GDN replay are attempted by default on split CUDA/ROCm target placement. The runtime allocates capture and tape buffers on each layer's backend device and synchronizes every backend touched by a DFlash capture or replay. Unsupported placement, host recurrent state, missing backend helpers, or pointer-device validation failure falls back to the CPU/eval-callback path instead of forcing one GPU path.
Use GGML_DFLASH_MULTI_GPU_TAPE=0 to disable this default-on multi-GPU path. GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=0 is accepted as a compatibility spelling from PR #32 review/testing notes. Both variables are kill switches only; unset means enabled.
Adaptive Draft-Max is enabled by default for DFlash. It can reduce the active draft depth below --spec-draft-n-max, turn speculation off after consecutive below-threshold cycles, and probe again periodically.
| Arg | Code default | Use |
|---|---|---|
--spec-dm-adaptive, --no-spec-dm-adaptive |
Enabled | Enable or disable adaptive depth. Disable for fixed-depth benchmarks. |
--spec-dm-controller MODE |
profit |
profit or fringe. |
--spec-dm-probe-interval N |
16 |
Minimum cycles to wait before trying a speculative cycle when speculation is off; profit backs off after failed wake probes. |
--spec-dm-probe-fraction F |
0.25 |
Fraction of base_n_max to use as the probe depth when speculation is off. |
--spec-dm-explore-interval N |
12 |
Draft at a higher depth every N cycles to collect timing data beyond the current n_max. |
--spec-dm-off-dwell N |
8 |
Consecutive cycles below the profit/fringe threshold before speculation is disabled. |
--spec-dm-fringe-min F |
0.30 |
Fringe rate below which n_max drops toward 0 (after off-dwell). |
--spec-dm-fringe-max F |
0.50 |
Fringe rate at or above which full base_n_max is used. |
--spec-dm-min-reach N |
3 |
Minimum samples at a new draft position before fringe can promote to it. |
--spec-dm-profit-min F |
0.05 |
Minimum relative throughput improvement over no-spec baseline before dwell clears. |
--spec-dm-profit-raise-margin F |
0.05 |
Relative margin a higher depth must exceed to replace the current depth. |
--spec-dm-profit-lower-margin F |
0.05 |
Relative margin a lower depth must exceed to replace the current depth. |
--spec-dm-profit-ewma-alpha F |
0.15 |
Smoothing factor for acceptance and timing running averages. |
--spec-dm-profit-min-samples N |
3 |
Minimum observations per position/depth before scoring that depth as ready. |
--spec-dm-profit-warmup N |
0 |
Minimum measured samples for each initial positive-depth profit probe after the no-spec baseline is seeded (0 = use --spec-dm-profit-min-samples). |
--spec-dm-profit-baseline-interval N |
1024 |
Active speculative cycles between no-spec baseline reprobes (0 = disabled). |
Use profit for normal serving. Use fringe when you want behavior tied more directly to observed draft acceptance near the active tail. Use --no-spec-dm-adaptive --spec-draft-n-max N when comparing fixed draft depths.
| Mode | Key args | Code defaults | When to use |
|---|---|---|---|
ngram-cache |
--lookup-cache-static, --lookup-cache-dynamic |
Paths unset | You have reusable lookup cache files. |
ngram-mod |
--spec-ngram-mod-n-min, --spec-ngram-mod-n-max, --spec-ngram-mod-n-match |
48, 64, 24 |
Repeated-context workloads with controllable draft length. |
ngram-simple |
--spec-ngram-simple-size-n, --spec-ngram-simple-size-m, --spec-ngram-simple-min-hits |
12, 48, 1 |
Simple no-draft repeated-context baseline. |
ngram-map-k |
--spec-ngram-map-k-size-n, --spec-ngram-map-k-size-m, --spec-ngram-map-k-min-hits |
12, 48, 1 |
Map-backed ngram proposal tuning. |
ngram-map-k4v |
--spec-ngram-map-k4v-size-n, --spec-ngram-map-k4v-size-m, --spec-ngram-map-k4v-min-hits |
12, 48, 1 |
Alternative map-backed ngram proposal tuning. |
copyspec |
No public CLI gamma knob in current parser | Gamma 6 in params |
Copy repeated prompt/context spans without a model drafter. |
recycle |
No public CLI k knob in current parser | k=8 successors |
Repeated token-successor patterns. |
suffix |
No public CLI suffix knobs in current parser | depth 64, factor 2.0, offset 0.0, min prob 0.1 |
Repeated suffix continuations. |
Presets:
| Arg | Behavior |
|---|---|
--spec-default |
Sets ngram-mod with match 24, min 48, max 64. |
Removed generic ngram args:
| Removed arg | Use instead |
|---|---|
--spec-ngram-size-n |
The backend-specific --spec-ngram-*-size-n or --spec-ngram-mod-n-match. |
--spec-ngram-size-m |
The backend-specific --spec-ngram-*-size-m. |
--spec-ngram-min-hits |
The backend-specific --spec-ngram-*-min-hits. |
Accepted aliases:
| Alias | Current target |
|---|---|
--draft-p-split |
--spec-draft-p-split |
--draft-p-min |
--spec-draft-p-min |
Removed in v0.3.0:
| Removed arg | Use instead |
|---|---|
--draft, --draft-n, --draft-max |
--spec-draft-n-max |
--draft-min, --draft-n-min |
--spec-draft-n-min |
--draft-topk |
--spec-draft-top-k |
--draft-model |
--spec-draft-model, -md, or upstream --model-draft |
--dflash-max-slots |
--spec-dflash-max-slots |
--tree-budget TOTAL |
--spec-branch-budget N branch nodes beyond the main draft path |
--spec-dflash-default |
--spec-type dflash --spec-draft-model ... with explicit DFlash args |
--spec-draft-replace, --spec-replace |
No replacement; the parsed replacement list was unused |
Names from older buun-era experiments that are not accepted by the current Bee parser include:
--draft-temp
--dflash-cross-ctx
--dm-adaptive
--dm-ar-up
--dm-ar-down
--dm-ar-window
--dm-probe-interval
--dm-probe-fraction
--spec-dm-ar-up
--spec-dm-ar-down
--spec-dm-ar-window
Use the canonical --spec-* names in new commands.
| Arg | Default | Use |
|---|---|---|
-mm, --mmproj FILE |
Unset | Multimodal projector path. |
-mmu, --mmproj-url URL |
Unset | Projector URL. |
--mmproj-auto, --no-mmproj, --no-mmproj-auto |
Auto enabled | Use projector file if available, especially with HF model loading. |
--mmproj-offload, --no-mmproj-offload |
Offload enabled | Put projector on GPU or keep it off GPU. The launch script uses --no-mmproj-offload. |
--image, --audio FILE |
Unset | CLI multimodal input files. |
Runtime compatibility rules when --mmproj is loaded:
- context shift is disabled
- cache reuse is disabled
- flat DFlash can remain active
--spec-branch-budgetis forced to0under DFlash- non-DFlash speculative types are set to none
This means DFlash plus multimodal is supported only as flat DFlash in the current server path.
Chat/template args:
| Arg | Default | Use |
|---|---|---|
--jinja, --no-jinja |
Enabled | Jinja chat template engine. |
--chat-template TEMPLATE |
Model metadata | Inline/custom template name or content depending on parser path. |
--chat-template-file FILE |
Unset | Load a custom chat template from file. |
--chat-template-kwargs JSON |
Unset | Template kwargs. The launch script uses {"preserve_thinking":true}. |
| `--reasoning on | off | auto, -rea` |
| `--reasoning-format none | deepseek | deepseek-legacy` |
--reasoning-budget N |
-1 unrestricted |
0 ends thinking immediately, positive values cap thinking tokens. |
--reasoning-budget-message MESSAGE |
None | Message injected when the reasoning budget is exhausted. |
Sampling args people commonly tune:
| Arg | Code default | Use |
|---|---|---|
--temp, --temperature |
0.80 |
Target sampler temperature. 0 means greedy. |
--top-k N |
40 |
Keep top K target tokens before later samplers. This is not DFlash --spec-draft-top-k. |
--top-p N |
0.95 |
Nucleus sampling; 1.0 disables. |
--min-p N |
0.05 |
Minimum probability relative to best token; 0.0 disables. |
--top-nsigma, --top-n-sigma N |
-1.0 |
Sigma-based filter; negative disables. |
--xtc-probability N |
0.0 |
XTC probability; 0.0 disables. |
--xtc-threshold N |
0.10 |
XTC threshold. |
--typical, --typical-p N |
1.0 |
Locally typical sampling; 1.0 disables. |
--repeat-last-n N |
64 |
Repetition penalty window; -1 means context size. |
--repeat-penalty N |
1.0 |
Repetition penalty; 1.0 disables. |
--presence-penalty N |
0.0 |
Presence penalty. |
--frequency-penalty N |
0.0 |
Frequency penalty. |
--dry-multiplier N |
0.0 |
DRY repetition penalty multiplier; 0.0 disables. |
--dry-base N |
1.75 |
DRY base. |
--dry-allowed-length N |
2 |
DRY allows this repetition length before penalty. |
--dry-penalty-last-n N |
-1 |
DRY scan window; -1 means context size. |
--adaptive-target N |
-1.0 |
Adaptive-p target probability; negative disables. |
--adaptive-decay N |
0.90 |
Adaptive-p decay. |
--dynatemp-range N |
0.0 |
Dynamic temperature range; 0.0 disables. |
--dynatemp-exp N |
1.0 |
Dynamic temperature exponent. |
--mirostat N |
0 |
0 disabled, 1 Mirostat, 2 Mirostat 2.0. |
--mirostat-lr N |
0.10 |
Mirostat learning rate. |
--mirostat-ent N |
5.00 |
Mirostat target entropy. |
If you use --spec-draft-temp auto, changing target --temp also changes DFlash drafter temperature. If --spec-draft-temp is a number, target --temp and drafter temperature are independent.
Bee-specific guard args:
| CLI arg | JSON field | Code default | Use |
|---|---|---|---|
--reasoning-loop-guard MODE |
reasoning_loop_guard |
force-close |
off, force-close, or stop. |
--reasoning-loop-min-tokens N |
reasoning_loop_min_tokens |
1024 |
Do not check before this many hidden reasoning tokens. |
--reasoning-loop-window N |
reasoning_loop_window |
2048 |
Tail window scanned for repeated loops. |
--reasoning-loop-max-period N |
reasoning_loop_max_period |
512 |
Largest periodic loop length checked. |
--reasoning-loop-min-coverage N |
reasoning_loop_min_coverage |
768 |
Minimum repeated coverage needed to trigger. |
--reasoning-loop-check-interval N |
reasoning_loop_check_interval |
32 |
Accepted-token interval between checks. |
--reasoning-loop-interventions N |
reasoning_loop_interventions |
1 |
Force-close interventions before stop behavior. |
Validation rules in the server require positive windows/periods/coverage/intervals, window >= min_coverage, max_period <= window / 3, and min_tokens >= min_coverage.
| Arg | Code default | Use |
|---|---|---|
--host HOST |
127.0.0.1 |
Server bind host. |
--port PORT |
8080 |
Server port. |
-to, --timeout N |
600 |
Read timeout. |
--threads-http N |
-1 |
HTTP worker threads. |
--api-key KEY |
Unset | Require API key. Do not put secrets in committed scripts. |
--api-key-file FNAME |
Unset | Read API key from file. Prefer this over command-line secrets. |
--ssl-key-file FNAME |
Unset | TLS key file. |
--ssl-cert-file FNAME |
Unset | TLS cert file. |
--metrics |
Disabled | Enable Prometheus-compatible metrics endpoint. |
--props |
Disabled | Enable changing global properties via POST /props. |
--slots, --no-slots |
Enabled | Expose or hide slot monitoring endpoint. |
--slot-save-path PATH |
Unset | Enable slot save/restore actions. |
--media-path PATH |
Unset | Path for served/uploaded media handling. |
--webui, --no-webui |
Enabled | Enable built-in Web UI. |
--webui-config JSON, --webui-config-file PATH |
Unset | Web UI settings. |
--webui-mcp-proxy, --no-webui-mcp-proxy |
Disabled | Experimental MCP CORS proxy; do not enable for untrusted environments. |
| Arg | Use |
|---|---|
--log-timestamps |
Include timestamps in logs. |
--log-prefix |
Include log prefixes. |
| `--log-colors on | off |
-lv, --verbosity N, --log-verbosity N |
Log threshold. 0 generic, 1 error, 2 warning, 3 info, 4 debug. Also available as LLAMA_LOG_VERBOSITY. |
-v, --verbose, --log-verbose |
Set verbosity to the maximum debug level. |
The launch script uses all three so captured logs are stable:
--log-timestamps --log-prefix --log-colors offRoutine per-ubatch decode ubatch timing and the non-profile spec cycle summary are debug-level logs. Use --verbosity 4 or --verbose when you want those lines.
DFlash diagnostic environment variables:
| Env var | Use |
|---|---|
GGML_DFLASH_PROFILE=default,prefill |
Enables summary, replay, copy, verify, and prefill diagnostics without trace logging. 1, on, true, and default mean summary,replay,copy,verify. |
GGML_DFLASH_PROFILE_SYNC_SPLIT=1 |
Forces a diagnostic scheduler sync after verifier graph compute so decode wait time is separated from compact logits copy time. This changes timing shape and is for profiling only. |
GGML_DFLASH_DEBUG=1 |
Enables DFlash debug logs such as prefill route/capture decisions. |
GGML_DFLASH_CRASH_TRACE=1 |
Enables high-volume crash breadcrumbs around recurrent backup and decode sync points. |
GGML_DFLASH_INPUT_DEBUG=1 |
Dumps DFlash drafter input metadata for input-shape debugging. |
GGML_DFLASH_VERBOSE_CONTRACT=1 |
Logs extra drafter/target contract details during DFlash setup. |
GGML_DFLASH_FORCE_CPU_CROSS=1 |
Force the CPU hidden-state cross path even when the GPU ring is available. |
GGML_DFLASH_VERIFY_PAD=1 |
Re-enable diagnostic verifier padding to the active draft depth. Default is off because padded rows consume target verify time but are not sampled or accepted. |
GGML_DFLASH_SHARED_DRAFT_BATCH=0 |
Disable shared multi-slot DFlash drafter batching and fall back to per-slot cached drafting. By default, flat multi-slot DFlash uses a composite batched K/V projection cache when available. |
GGML_DFLASH_GPU_RING=0 |
Disable the GPU cross-attention ring and force the CPU ring path. |
GGML_DFLASH_MULTI_GPU_TAPE=0 |
Disable default-on multi-GPU DFlash GPU ring, hidden capture, tape, and replay. Use to force the CPU/eval-callback fallback for split target placement. |
GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=0 |
Compatibility spelling for the same multi-GPU DFlash kill switch. |
GGML_DFLASH_MAX_CTX=N |
Cap the DFlash cross-attention context length. 0 removes the cap. |
| `GGML_DFLASH_KV_CACHE_MODE=k | v |
Server requests can override several defaults without restarting the process.
Speculative fields:
| JSON field | Meaning |
|---|---|
speculative.n_min |
Per-request minimum draft length. |
speculative.n_max |
Per-request max main-path draft length. |
speculative.branch_budget |
Per-request branch budget. |
speculative.p_min |
Per-request draft probability gate. |
Prompt/cache fields:
| JSON field | Meaning |
|---|---|
cache_prompt |
Whether the request can use prompt cache. |
n_cache_reuse |
Minimum chunk size for KV-shift cache reuse. |
Reasoning loop fields are listed in the loop-guard section. Sampling fields such as temperature, top_k, top_p, min_p, mirostat, adaptive_target, and adaptive_decay are also read from request JSON in the server task parser.
llama-server -m target.gguf \
--spec-type dflash \
--spec-draft-model draft-dflash.gguf \
--spec-draft-n-max 16 \
--spec-dflash-cross-ctx 1024 \
--spec-branch-budget 0 \
--ctx-size 51200 \
-ngl all \
--spec-draft-ngl all \
-b 2048 \
-ub 512 \
--kv-unified \
--cache-type-k turbo4 \
--cache-type-v turbo3_tcq \
--flash-attn onllama-server -m target.gguf \
--spec-type dflash \
--spec-draft-model draft-dflash.gguf \
--spec-draft-n-max 16 \
--spec-branch-budget 4 \
--spec-draft-top-k 4 \
--spec-draft-p-split 0.10 \
--spec-dm-controller profitMeasure tree mode against flat DFlash and no-spec baselines. Tree nodes consume batch/microbatch capacity and extra verification work.
llama-server -m target.gguf \
--mmproj mmproj.gguf \
--spec-type dflash \
--spec-draft-model draft-dflash.gguf \
--spec-branch-budget 0Do not expect tree DFlash, context shift, or cache reuse to survive multimodal initialization in the current server path.
llama-server -m model.gguf \
--spec-type ngram-mod \
--spec-ngram-mod-n-match 24 \
--spec-ngram-mod-n-min 48 \
--spec-ngram-mod-n-max 64llama-server -m model.gguf \
--reasoning on \
--reasoning-loop-guard force-close \
--reasoning-loop-min-tokens 1024 \
--reasoning-loop-window 2048 \
--reasoning-loop-interventions 1When changing DFlash, cache precision, context length, or batch size, compare against a no-spec baseline and keep all other inputs fixed:
- target model file
- draft model file, if any
- exact command line
- commit ID and dirty-worktree status
- prompt or prompt hash
- context length, cache types, batch sizes, and slot count
- target sampling settings
- prompt TPS, generation TPS, wall time, draft count, accepted count, and peak memory
Do not carry performance numbers between machines, model files, backends, or dirty worktrees without rerunning the measurement.