BeeLlama Args Reference

This is the practical BeeLlama tuning reference. It covers the arguments that change model loading, GPU placement, context behavior, prompt caching, KV-cache precision, DFlash/speculative decoding, multimodal behavior, sampling, reasoning, server endpoints, and logs. It is not a byte-for-byte replacement for --help; it is the single page to read before tuning a real BeeLlama server.

For the feature overview and public-repo comparison, read beellama-features.md.

Start With This Mental Model

BeeLlama tuning is easiest if you choose settings in this order:

Load the target model and optional multimodal projector.
Decide GPU/offload placement for the target model and draft model.
Set context length, batch sizes, KV precision, and prompt-cache behavior.
Enable DFlash or another speculative backend.
Tune sampling, chat/reasoning, server endpoints, and logs.

The DFlash controls have several similar names. These are the common confusion points:

Setting	It controls	It does not control
`--ctx-size`	Target model context length	DFlash cross-attention window
`--spec-draft-ctx-size`	Draft model context allocation	Target context length
`--spec-dflash-cross-ctx`	Recent target hidden-state tokens visible to DFlash	Server KV context length
`--spec-draft-n-max`	Main-path draft-token ceiling	Extra tree branches
`--spec-draft-n-min`	Minimum draft size required before using a draft	Minimum response length
`--spec-branch-budget`	Extra DDTree branch nodes beyond the main path	Main-path draft depth
`--spec-draft-top-k`	Candidates per position for tree mode	Sampling `--top-k`
`--spec-draft-temp`	Drafter sampling temperature	Target sampler temperature unless `auto` is used
`--spec-dm-*`	Runtime controller that can lower/raise active draft depth	Static parser default for `--spec-draft-n-max`

Launch Shape

This example mirrors the important shape of a typical DFlash launch script while using placeholder paths:

llama-server \
  -m "path/to/target.gguf" \
  --mmproj "path/to/mmproj.gguf" \
  --no-mmproj-offload \
  --spec-draft-model "path/to/drafter.gguf" \
  --spec-type dflash \
  --spec-dflash-cross-ctx 1024 \
  --port 8082 \
  -np 1 \
  --kv-unified \
  -ngl all \
  --spec-draft-ngl all \
  -b 2048 -ub 512 \
  --ctx-size 102400 \
  --cache-type-k q5_0 --cache-type-v q4_0 \
  --flash-attn on \
  --cache-ram 0 \
  --jinja \
  --no-mmap --mlock \
  --no-host --metrics \
  --log-timestamps --log-prefix --log-colors off \
  --reasoning on \
  --chat-template-kwargs '{"preserve_thinking":true}' \
  --temp 0.6 --top-k 20 --min-p 0.0

What this shape means:

Block	Purpose
`-m`, `--mmproj`, `--no-mmproj-offload`	Load the target model and projector, while keeping the projector off GPU.
`--spec-type dflash`, `--spec-draft-model`, `--spec-draft-n-max`, `--spec-dflash-cross-ctx`	Enable flat DFlash drafting and choose draft depth/window.
`-np 1`, `--kv-unified`, `--ctx-size`	Run one server slot with explicit unified KV and a large target context.
`-ngl all`, `--spec-draft-ngl all`	Fully offload target and draft models when devices can hold them.
`-b`, `-ub`	Set target prompt prefill batching. Bee keeps upstream defaults unless you override them.
`--cache-type-k`, `--cache-type-v`	Use asymmetric KV precision.
`--cache-ram 0`	Disable the server prompt-cache RAM subsystem. Live-slot prefix reuse still works.
`--no-mmap`, `--mlock`, `--no-host`	Prefer locked model memory and direct backend buffer behavior over filesystem cache behavior.
`--metrics`, `--log-*`	Expose metrics and make logs easier to capture.
`--reasoning on`, `--chat-template-kwargs`, `--temp`, `--top-k`, `--min-p`	Control thinking/template behavior and target sampling.

Model Loading

Arg	Code default	When to use
`-m`, `--model FNAME`	Required for normal local runs	Target model GGUF path.
`-mu`, `--model-url URL`	Unset	Download/load a target model from URL.
`-hf`, `-hfr`, `--hf-repo <user>/<model>[:quant]`	Unset	Resolve target model from Hugging Face.
`-hff`, `--hf-file FILE`	Unset	Override the file chosen from `--hf-repo`.
`--models-dir PATH`	Unset	Server router/model directory mode.
`--models-preset PATH`	Unset	Server model preset file.
`--models-max N`	`4`	Maximum loaded models in router mode.
`--models-autoload`, `--no-models-autoload`	Enabled	Whether router models autoload.

Draft model loading uses parallel draft-specific names:

Arg	Code default	When to use
`--spec-draft-model`, `-md`, `--model-draft FNAME`	Unset	Draft model GGUF path. Required for DFlash unless loaded through draft HF args.
`--spec-draft-hf`, `-hfd`, `-hfrd`, `--hf-repo-draft <user>/<model>[:quant]`	Unset	Resolve draft model from Hugging Face.

Bee auto-detects DFlash when a loaded draft model reports a DFlash block size and --spec-type was not already dflash.

GPU, CPU, And Offload Placement

Arg	Code default	When to use
`-ngl`, `--gpu-layers`, `--n-gpu-layers N	auto	all`
`--spec-draft-ngl`, `-ngld`, `--gpu-layers-draft`, `--n-gpu-layers-draft N	auto	all`
`-dev`, `--device dev1,dev2,...`	All usable devices	Restrict target model devices.
`--spec-draft-device`, `-devd`, `--device-draft dev1,dev2,...`	Inherited/available devices	Restrict draft model devices. For DFlash, a single explicit GPU also pins the target output tensor to that GPU before target load, because the drafter shares the target output tensor.
`-sm`, `--split-mode none	layer	row
`-mg`, `--main-gpu INDEX`	`0`	Main GPU for scratch/small tensors.
`-ts`, `--tensor-split N0,N1,...`	Unset	Manual tensor split across devices.
`-fit`, `--fit on	off`	Enabled
`-fitp`, `--fit-print on	off`	Disabled
`-fitc`, `--fit-ctx N`	`4096`	Minimum context size that `--fit` may choose.
`-fitt`, `--fit-target MiB0,MiB1,...`	`1024` MiB margin per device	Free-memory margin per device for `--fit`.
`-nkvo`, `--no-kv-offload`	Off	Keep KV cache off GPU. Usually hurts GPU-serving speed, but can be needed under VRAM pressure.
`--no-host`	Off	Bypass host buffer so extra backend buffers can be used. Used in the launch script.
`--no-fused-gdn`	Off	Disable fused Gated Delta Net kernels. Use only to isolate backend/kernel issues.

CPU/thread controls:

Arg	Code default	When to use
`-t`, `--threads N`	`-1` auto	CPU threads for generation. The launch script uses `-t 8`.
`-tb`, `--threads-batch N`	Same as `--threads`	CPU threads for prompt/batch processing.
`--spec-draft-threads`, `-td`, `--threads-draft N`	Same as target threads	Draft model CPU generation threads.
`--spec-draft-threads-batch`, `-tbd`, `--threads-batch-draft N`	Same as draft threads	Draft model prompt/batch threads.
`--cpu-mask`, `--cpu-range`, `--cpu-strict`	No strict placement	Pin target CPU work.
Draft CPU affinity variants	Inherit target CPU settings	Use `--spec-draft-cpu-mask`, `--spec-draft-cpu-range`, `--spec-draft-cpu-strict`, and batch variants when the drafter needs separate CPU placement.
`--prio`, `--prio-batch`, draft priority variants	Normal	Change scheduler priority. Avoid realtime unless you know the host can tolerate it.
`--poll`, `--poll-batch`, draft poll variants	`50` for base poll	Busy-wait level. Higher can reduce latency at CPU cost.

Advanced draft placement:

Arg	Use
`--spec-draft-override-tensor`, `-otd`, `--override-tensor-draft`	Override draft tensor buffer placement by tensor-name pattern.
`--spec-draft-cpu-moe`, `-cmoed`, `--cpu-moe-draft`	Keep draft MoE tensors on CPU.
`--spec-draft-n-cpu-moe`, `--spec-draft-ncmoe`, `-ncmoed`, `--n-cpu-moe-draft N`	Keep only some draft MoE layers on CPU.

Context, Batch, And Prompt State

Arg	Code default	When to use
`-c`, `--ctx-size N`	`0` means model-trained context	Target model context allocation. Large values need KV memory.
`-n`, `--predict`, `--n-predict N`	`-1` no limit	Max new tokens for CLI/server defaults where applicable.
`-b`, `--batch-size N`	`2048`	Logical prompt batch size. Larger improves prefill until memory/backend limits.
`-ub`, `--ubatch-size N`	`512`	Physical microbatch size. Lower when VRAM spikes or DFlash graph memory grows.
`--keep N`	`0`	Tokens to keep from the initial prompt when context is managed.
`-np`, `--parallel N`	`1`	Server slots. In server help, `-1` means auto. More slots need more KV and DFlash state.
`-cb`, `--cont-batching`; `-nocb`, `--no-cont-batching`	Enabled	Continuous batching for server decode.
`--context-shift`, `--no-context-shift`	Disabled	Infinite generation via KV shifting when supported. Disabled under multimodal.
`--swa-full`	Disabled	Full-size SWA cache for models where it is supported.
`-ctxcp`, `--ctx-checkpoints`, `--swa-checkpoints N`	`32`	Max prompt context checkpoints per slot. Important for long prompt reuse and contexts where speculative decoding needs checkpoints.
`-cms`, `--checkpoint-min-step N`	`256`	Minimum spacing between prompt context checkpoints. `0` disables the spacing limit.
`-cram`, `--cache-ram N`	`8192` MiB	Server prompt-cache RAM limit. `-1` no limit, `0` disables the prompt cache subsystem.
`--cache-prompt`, `--no-cache-prompt`	Enabled	Whether request prompts can use prompt caching.
`--cache-reuse N`	`0` disabled	Minimum chunk size to reuse from cache through KV shifting. Disabled under multimodal and unsupported contexts.
`-kvu`, `--kv-unified`; `-no-kvu`, `--no-kv-unified`	Raw default false; help notes enabled if slots are auto	Single unified KV buffer shared across sequences. Required by idle-slot cache.
`--cache-idle-slots`, `--no-cache-idle-slots`	Enabled in params, but requires unified KV and cache RAM	Save and clear idle slots when a new task starts. Disabled by the server if requirements are not met.

DFlash-specific parser defaults:

Situation	Effective behavior
`--spec-type dflash` and no `--spec-draft-ctx-size`	Bee sets draft context to `256`.
`--spec-type dflash` and no explicit `-b`	Bee keeps the upstream target batch default, currently `2048`.
`--spec-type dflash` and no explicit `-ub`	Bee keeps the upstream target microbatch default, currently `512`.
You pass explicit `-b` or `-ub`	Bee keeps your values.

DFlash no longer lowers the target batch defaults at parse time. If memory is tight, lower -b or -ub explicitly after measuring the model, context, KV type, and draft placement.

KV Cache Precision

K and V cache precision are independent:

--cache-type-k turbo4 --cache-type-v turbo3_tcq
--cache-type-k q8_0   --cache-type-v turbo3_tcq
--cache-type-k f16    --cache-type-v f16

Accepted KV cache type names in the current parser include:

Family	Values
Floating	`f32`, `f16`, `bf16`
Upstream quantized	`q8_0`, `q4_0`, `q4_1`, `iq4_nl`, `q5_0`, `q5_1`
Bee/Turbo lineage	`turbo2`, `turbo3`, `turbo4`, `turbo2_tcq`, `turbo3_tcq`

Bee fork cache storage:

Type	Storage	Best first use
`turbo4`	`4.125` bpv	Strong first compressed K-cache candidate.
`turbo3`	`3.125` bpv	More compression than `turbo4`; uses `QK_TURBO3 = 128` in Bee.
`turbo2`	`2.125` bpv	Stronger compression, higher quality risk.
`turbo3_tcq`	`3.25` bpv	TCQ path, commonly useful for V in Bee experiments.
`turbo2_tcq`	`2.25` bpv	Most compressed TCQ path; verify carefully.

For current long-context Qwen and Gemma DFlash serving, the common asymmetric choice is:

--cache-type-k q5_0 --cache-type-v q4_0

Do not assume enum compatibility with TheTom's public TurboQuant fork. Bee uses the buun enum order for Turbo/TCQ cache types while keeping a 128-value turbo3 block. Bee keeps TCQ cache IDs at 45 and 46; Tom's TQ3_1S and TQ4_1S model weight formats use new GGML type IDs 47 and 48.

Model Weight Quantization

Tom's TQ3_1S and TQ4_1S are model weight formats, not KV-cache types. Use them through llama-quantize:

llama-quantize model.f16.gguf model.tq3_1s.gguf TQ3_1S
llama-quantize model.f16.gguf model.tq4_1s.gguf TQ4_1S

The CLI file-type IDs are 43 for TQ3_1S and 44 for TQ4_1S; the serialized GGML type IDs are 47 and 48. This avoids interpreting existing Bee/Buun TCQ cache tensors as Tom weight tensors. The current port includes CPU quantize/dequantize and CPU fallback matmul support; backend acceleration is not claimed unless that backend has been built and tested separately.

Model Memory Files And OS Cache

Arg	Default	When to use
`--mmap`, `--no-mmap`	mmap enabled	`--no-mmap` avoids relying on filesystem cache behavior; the launch script uses it.
`--mlock`	Disabled	Ask the OS to keep model memory resident. Useful when paging would ruin latency.
`-dio`, `--direct-io`	Disabled	Read model data without normal buffered I/O.

--no-mmap --mlock is a deliberate "keep the model resident" style. It can fail or be limited by OS permissions and available RAM.

Flash Attention And Backend Behavior

Arg	Default	When to use
`-fa`, `--flash-attn on	off	auto`
`--no-fused-gdn`	Off	Disable fused GDN kernels for debugging or backend compatibility.

For Turbo/TCQ cache types, verify the backend path you use. A cache type being accepted by the parser does not prove every backend/kernel combination is fast or supported.

DFlash And Speculative Decoding

Select the speculative backend:

Arg	Code default	Values
`--spec-type MODE`	`none`	`none`, `ngram-cache`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, `suffix`, `copyspec`, `recycle`, `dflash`

DFlash core args:

Arg	Code default	Use
`--spec-draft-model`, `-md`, `--model-draft`	Unset	Draft GGUF path. DFlash needs a DFlash-compatible draft model.
`--spec-draft-ctx-size`, `-cd`, `--ctx-size-draft`	Raw `0`; DFlash effective `256` if omitted	Draft context allocation. Not the same as target `--ctx-size`.
`--spec-draft-ngl`, `-ngld`	`auto`	Draft model GPU layers.
`--spec-draft-type-k`, `-ctkd`, `--cache-type-k-draft`	`f16`	Draft model K cache precision.
`--spec-draft-type-v`, `-ctvd`, `--cache-type-v-draft`	`f16`	Draft model V cache precision.
`--spec-draft-n-max`	Raw upstream `3`; DFlash effective `16` if omitted	Base max main-path draft tokens.
`--spec-draft-n-min`	`0`	Minimum draft length required before speculative verification is used.
`--spec-branch-budget`	`0`	DDTree branch nodes beyond the main draft path. `0` means flat DFlash.
`--spec-draft-top-k`	`1`	Candidates per draft position for tree mode. Forced to `1` when branch budget is `0`.
`--spec-draft-p-split`, `--draft-p-split`	`0.10`	Probability threshold for creating tree branches.
`--spec-draft-p-min`, `--draft-p-min`	`0.0`	Minimum draft probability gate.
`--spec-draft-temp T`	`0.0`	DFlash drafter temperature. `0` greedy, positive uses sampled/Gumbel path, `auto` mirrors target temp.
`--spec-dflash-cross-ctx N`	`512`	Recent target hidden-state tokens visible to the DFlash drafter.
`--spec-dflash-max-slots N`	Match `-np`	Max server slots with DFlash state; higher slots fall back to non-speculative decoding. Use this to cap DFlash below `-np`.

DFlash-only args such as --spec-branch-budget, --spec-draft-top-k, --spec-draft-temp, --spec-dflash-*, and --spec-dm-* do not change MTP or other non-DFlash speculative modes. If they are passed without --spec-type dflash, Bee preserves them while a supplied draft model can still auto-detect as DFlash; otherwise it emits a warning and ignores them.

Flat DFlash:

llama-server -m target.gguf \
  --spec-type dflash \
  --spec-draft-model draft-dflash.gguf \
  --spec-draft-n-max 16 \
  --spec-branch-budget 0 \
  --spec-dflash-cross-ctx 512 \
  -ngl all \
  --spec-draft-ngl all

Tree DFlash:

llama-server -m target.gguf \
  --spec-type dflash \
  --spec-draft-model draft-dflash.gguf \
  --spec-draft-n-max 16 \
  --spec-branch-budget 4 \
  --spec-draft-top-k 4 \
  --spec-draft-p-split 0.10

Sampled DFlash:

llama-server -m target.gguf \
  --spec-type dflash \
  --spec-draft-model draft-dflash.gguf \
  --spec-draft-temp auto \
  --temp 0.6

Sampled DFlash activates the rejection-sampling path when both drafter and target temperature exceed zero. Draft log-probabilities must be available for rejection sampling to produce correct output. Without those conditions, Bee uses the normal greedy speculative path.

Multi-GPU DFlash:

In v0.3.0, DFlash GPU cross rings, graph hidden capture, recurrent tape buffers, conv replay, and direct GDN replay are attempted by default on split CUDA/ROCm target placement. The runtime allocates capture and tape buffers on each layer's backend device and synchronizes every backend touched by a DFlash capture or replay. Unsupported placement, host recurrent state, missing backend helpers, or pointer-device validation failure falls back to the CPU/eval-callback path instead of forcing one GPU path.

Use GGML_DFLASH_MULTI_GPU_TAPE=0 to disable this default-on multi-GPU path. GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=0 is accepted as a compatibility spelling from PR #32 review/testing notes. Both variables are kill switches only; unset means enabled.

Adaptive Draft-Max

Adaptive Draft-Max is enabled by default for DFlash. It can reduce the active draft depth below --spec-draft-n-max, turn speculation off after consecutive below-threshold cycles, and probe again periodically.

Arg	Code default	Use
`--spec-dm-adaptive`, `--no-spec-dm-adaptive`	Enabled	Enable or disable adaptive depth. Disable for fixed-depth benchmarks.
`--spec-dm-controller MODE`	`profit`	`profit` or `fringe`.
`--spec-dm-probe-interval N`	`16`	Minimum cycles to wait before trying a speculative cycle when speculation is off; `profit` backs off after failed wake probes.
`--spec-dm-probe-fraction F`	`0.25`	Fraction of base_n_max to use as the probe depth when speculation is off.
`--spec-dm-explore-interval N`	`12`	Draft at a higher depth every N cycles to collect timing data beyond the current n_max.
`--spec-dm-off-dwell N`	`8`	Consecutive cycles below the profit/fringe threshold before speculation is disabled.
`--spec-dm-fringe-min F`	`0.30`	Fringe rate below which n_max drops toward 0 (after off-dwell).
`--spec-dm-fringe-max F`	`0.50`	Fringe rate at or above which full base_n_max is used.
`--spec-dm-min-reach N`	`3`	Minimum samples at a new draft position before fringe can promote to it.
`--spec-dm-profit-min F`	`0.05`	Minimum relative throughput improvement over no-spec baseline before dwell clears.
`--spec-dm-profit-raise-margin F`	`0.05`	Relative margin a higher depth must exceed to replace the current depth.
`--spec-dm-profit-lower-margin F`	`0.05`	Relative margin a lower depth must exceed to replace the current depth.
`--spec-dm-profit-ewma-alpha F`	`0.15`	Smoothing factor for acceptance and timing running averages.
`--spec-dm-profit-min-samples N`	`3`	Minimum observations per position/depth before scoring that depth as ready.
`--spec-dm-profit-warmup N`	`0`	Minimum measured samples for each initial positive-depth profit probe after the no-spec baseline is seeded (0 = use --spec-dm-profit-min-samples).
`--spec-dm-profit-baseline-interval N`	`1024`	Active speculative cycles between no-spec baseline reprobes (0 = disabled).

Use profit for normal serving. Use fringe when you want behavior tied more directly to observed draft acceptance near the active tail. Use --no-spec-dm-adaptive --spec-draft-n-max N when comparing fixed draft depths.

Other Speculative Backends

Mode	Key args	Code defaults	When to use
`ngram-cache`	`--lookup-cache-static`, `--lookup-cache-dynamic`	Paths unset	You have reusable lookup cache files.
`ngram-mod`	`--spec-ngram-mod-n-min`, `--spec-ngram-mod-n-max`, `--spec-ngram-mod-n-match`	`48`, `64`, `24`	Repeated-context workloads with controllable draft length.
`ngram-simple`	`--spec-ngram-simple-size-n`, `--spec-ngram-simple-size-m`, `--spec-ngram-simple-min-hits`	`12`, `48`, `1`	Simple no-draft repeated-context baseline.
`ngram-map-k`	`--spec-ngram-map-k-size-n`, `--spec-ngram-map-k-size-m`, `--spec-ngram-map-k-min-hits`	`12`, `48`, `1`	Map-backed ngram proposal tuning.
`ngram-map-k4v`	`--spec-ngram-map-k4v-size-n`, `--spec-ngram-map-k4v-size-m`, `--spec-ngram-map-k4v-min-hits`	`12`, `48`, `1`	Alternative map-backed ngram proposal tuning.
`copyspec`	No public CLI gamma knob in current parser	Gamma `6` in params	Copy repeated prompt/context spans without a model drafter.
`recycle`	No public CLI k knob in current parser	`k=8` successors	Repeated token-successor patterns.
`suffix`	No public CLI suffix knobs in current parser	depth `64`, factor `2.0`, offset `0.0`, min prob `0.1`	Repeated suffix continuations.

Presets:

Arg	Behavior
`--spec-default`	Sets `ngram-mod` with match `24`, min `48`, max `64`.

Removed generic ngram args:

Removed arg	Use instead
`--spec-ngram-size-n`	The backend-specific `--spec-ngram-*-size-n` or `--spec-ngram-mod-n-match`.
`--spec-ngram-size-m`	The backend-specific `--spec-ngram-*-size-m`.
`--spec-ngram-min-hits`	The backend-specific `--spec-ngram-*-min-hits`.

DFlash Compatibility Spellings

Accepted aliases:

Alias	Current target
`--draft-p-split`	`--spec-draft-p-split`
`--draft-p-min`	`--spec-draft-p-min`

Removed in v0.3.0:

Removed arg	Use instead
`--draft`, `--draft-n`, `--draft-max`	`--spec-draft-n-max`
`--draft-min`, `--draft-n-min`	`--spec-draft-n-min`
`--draft-topk`	`--spec-draft-top-k`
`--draft-model`	`--spec-draft-model`, `-md`, or upstream `--model-draft`
`--dflash-max-slots`	`--spec-dflash-max-slots`
`--tree-budget TOTAL`	`--spec-branch-budget N` branch nodes beyond the main draft path
`--spec-dflash-default`	`--spec-type dflash --spec-draft-model ...` with explicit DFlash args
`--spec-draft-replace`, `--spec-replace`	No replacement; the parsed replacement list was unused

Names from older buun-era experiments that are not accepted by the current Bee parser include:

--draft-temp
--dflash-cross-ctx
--dm-adaptive
--dm-ar-up
--dm-ar-down
--dm-ar-window
--dm-probe-interval
--dm-probe-fraction
--spec-dm-ar-up
--spec-dm-ar-down
--spec-dm-ar-window

Use the canonical --spec-* names in new commands.

Multimodal

Arg	Default	Use
`-mm`, `--mmproj FILE`	Unset	Multimodal projector path.
`-mmu`, `--mmproj-url URL`	Unset	Projector URL.
`--mmproj-auto`, `--no-mmproj`, `--no-mmproj-auto`	Auto enabled	Use projector file if available, especially with HF model loading.
`--mmproj-offload`, `--no-mmproj-offload`	Offload enabled	Put projector on GPU or keep it off GPU. The launch script uses `--no-mmproj-offload`.
`--image`, `--audio FILE`	Unset	CLI multimodal input files.

Runtime compatibility rules when --mmproj is loaded:

context shift is disabled
cache reuse is disabled
flat DFlash can remain active
--spec-branch-budget is forced to 0 under DFlash
non-DFlash speculative types are set to none

This means DFlash plus multimodal is supported only as flat DFlash in the current server path.

Chat, Reasoning, And Sampling

Chat/template args:

Arg	Default	Use
`--jinja`, `--no-jinja`	Enabled	Jinja chat template engine.
`--chat-template TEMPLATE`	Model metadata	Inline/custom template name or content depending on parser path.
`--chat-template-file FILE`	Unset	Load a custom chat template from file.
`--chat-template-kwargs JSON`	Unset	Template kwargs. The launch script uses `{"preserve_thinking":true}`.
`--reasoning on	off	auto`,` -rea`
`--reasoning-format none	deepseek	deepseek-legacy`
`--reasoning-budget N`	`-1` unrestricted	`0` ends thinking immediately, positive values cap thinking tokens.
`--reasoning-budget-message MESSAGE`	None	Message injected when the reasoning budget is exhausted.

Sampling args people commonly tune:

Arg	Code default	Use
`--temp`, `--temperature`	`0.80`	Target sampler temperature. `0` means greedy.
`--top-k N`	`40`	Keep top K target tokens before later samplers. This is not DFlash `--spec-draft-top-k`.
`--top-p N`	`0.95`	Nucleus sampling; `1.0` disables.
`--min-p N`	`0.05`	Minimum probability relative to best token; `0.0` disables.
`--top-nsigma`, `--top-n-sigma N`	`-1.0`	Sigma-based filter; negative disables.
`--xtc-probability N`	`0.0`	XTC probability; `0.0` disables.
`--xtc-threshold N`	`0.10`	XTC threshold.
`--typical`, `--typical-p N`	`1.0`	Locally typical sampling; `1.0` disables.
`--repeat-last-n N`	`64`	Repetition penalty window; `-1` means context size.
`--repeat-penalty N`	`1.0`	Repetition penalty; `1.0` disables.
`--presence-penalty N`	`0.0`	Presence penalty.
`--frequency-penalty N`	`0.0`	Frequency penalty.
`--dry-multiplier N`	`0.0`	DRY repetition penalty multiplier; `0.0` disables.
`--dry-base N`	`1.75`	DRY base.
`--dry-allowed-length N`	`2`	DRY allows this repetition length before penalty.
`--dry-penalty-last-n N`	`-1`	DRY scan window; `-1` means context size.
`--adaptive-target N`	`-1.0`	Adaptive-p target probability; negative disables.
`--adaptive-decay N`	`0.90`	Adaptive-p decay.
`--dynatemp-range N`	`0.0`	Dynamic temperature range; `0.0` disables.
`--dynatemp-exp N`	`1.0`	Dynamic temperature exponent.
`--mirostat N`	`0`	`0` disabled, `1` Mirostat, `2` Mirostat 2.0.
`--mirostat-lr N`	`0.10`	Mirostat learning rate.
`--mirostat-ent N`	`5.00`	Mirostat target entropy.

If you use --spec-draft-temp auto, changing target --temp also changes DFlash drafter temperature. If --spec-draft-temp is a number, target --temp and drafter temperature are independent.

Reasoning Loop Guard

Bee-specific guard args:

CLI arg	JSON field	Code default	Use
`--reasoning-loop-guard MODE`	`reasoning_loop_guard`	`force-close`	`off`, `force-close`, or `stop`.
`--reasoning-loop-min-tokens N`	`reasoning_loop_min_tokens`	`1024`	Do not check before this many hidden reasoning tokens.
`--reasoning-loop-window N`	`reasoning_loop_window`	`2048`	Tail window scanned for repeated loops.
`--reasoning-loop-max-period N`	`reasoning_loop_max_period`	`512`	Largest periodic loop length checked.
`--reasoning-loop-min-coverage N`	`reasoning_loop_min_coverage`	`768`	Minimum repeated coverage needed to trigger.
`--reasoning-loop-check-interval N`	`reasoning_loop_check_interval`	`32`	Accepted-token interval between checks.
`--reasoning-loop-interventions N`	`reasoning_loop_interventions`	`1`	Force-close interventions before stop behavior.

Validation rules in the server require positive windows/periods/coverage/intervals, window >= min_coverage, max_period <= window / 3, and min_tokens >= min_coverage.

Server, Endpoints, And Network

Arg	Code default	Use
`--host HOST`	`127.0.0.1`	Server bind host.
`--port PORT`	`8080`	Server port.
`-to`, `--timeout N`	`600`	Read timeout.
`--threads-http N`	`-1`	HTTP worker threads.
`--api-key KEY`	Unset	Require API key. Do not put secrets in committed scripts.
`--api-key-file FNAME`	Unset	Read API key from file. Prefer this over command-line secrets.
`--ssl-key-file FNAME`	Unset	TLS key file.
`--ssl-cert-file FNAME`	Unset	TLS cert file.
`--metrics`	Disabled	Enable Prometheus-compatible metrics endpoint.
`--props`	Disabled	Enable changing global properties via `POST /props`.
`--slots`, `--no-slots`	Enabled	Expose or hide slot monitoring endpoint.
`--slot-save-path PATH`	Unset	Enable slot save/restore actions.
`--media-path PATH`	Unset	Path for served/uploaded media handling.
`--webui`, `--no-webui`	Enabled	Enable built-in Web UI.
`--webui-config JSON`, `--webui-config-file PATH`	Unset	Web UI settings.
`--webui-mcp-proxy`, `--no-webui-mcp-proxy`	Disabled	Experimental MCP CORS proxy; do not enable for untrusted environments.

Logs

Arg	Use
`--log-timestamps`	Include timestamps in logs.
`--log-prefix`	Include log prefixes.
`--log-colors on	off
`-lv`, `--verbosity N`, `--log-verbosity N`	Log threshold. `0` generic, `1` error, `2` warning, `3` info, `4` debug. Also available as `LLAMA_LOG_VERBOSITY`.
`-v`, `--verbose`, `--log-verbose`	Set verbosity to the maximum debug level.

The launch script uses all three so captured logs are stable:

--log-timestamps --log-prefix --log-colors off

Routine per-ubatch decode ubatch timing and the non-profile spec cycle summary are debug-level logs. Use --verbosity 4 or --verbose when you want those lines.

DFlash diagnostic environment variables:

Env var	Use
`GGML_DFLASH_PROFILE=default,prefill`	Enables summary, replay, copy, verify, and prefill diagnostics without trace logging. `1`, `on`, `true`, and `default` mean `summary,replay,copy,verify`.
`GGML_DFLASH_PROFILE_SYNC_SPLIT=1`	Forces a diagnostic scheduler sync after verifier graph compute so decode wait time is separated from compact logits copy time. This changes timing shape and is for profiling only.
`GGML_DFLASH_DEBUG=1`	Enables DFlash debug logs such as prefill route/capture decisions.
`GGML_DFLASH_CRASH_TRACE=1`	Enables high-volume crash breadcrumbs around recurrent backup and decode sync points.
`GGML_DFLASH_INPUT_DEBUG=1`	Dumps DFlash drafter input metadata for input-shape debugging.
`GGML_DFLASH_VERBOSE_CONTRACT=1`	Logs extra drafter/target contract details during DFlash setup.
`GGML_DFLASH_FORCE_CPU_CROSS=1`	Force the CPU hidden-state cross path even when the GPU ring is available.
`GGML_DFLASH_VERIFY_PAD=1`	Re-enable diagnostic verifier padding to the active draft depth. Default is off because padded rows consume target verify time but are not sampled or accepted.
`GGML_DFLASH_SHARED_DRAFT_BATCH=0`	Disable shared multi-slot DFlash drafter batching and fall back to per-slot cached drafting. By default, flat multi-slot DFlash uses a composite batched K/V projection cache when available.
`GGML_DFLASH_GPU_RING=0`	Disable the GPU cross-attention ring and force the CPU ring path.
`GGML_DFLASH_MULTI_GPU_TAPE=0`	Disable default-on multi-GPU DFlash GPU ring, hidden capture, tape, and replay. Use to force the CPU/eval-callback fallback for split target placement.
`GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=0`	Compatibility spelling for the same multi-GPU DFlash kill switch.
`GGML_DFLASH_MAX_CTX=N`	Cap the DFlash cross-attention context length. `0` removes the cap.
`GGML_DFLASH_KV_CACHE_MODE=k	v

Request JSON Overrides

Server requests can override several defaults without restarting the process.

Speculative fields:

JSON field	Meaning
`speculative.n_min`	Per-request minimum draft length.
`speculative.n_max`	Per-request max main-path draft length.
`speculative.branch_budget`	Per-request branch budget.
`speculative.p_min`	Per-request draft probability gate.

Prompt/cache fields:

JSON field	Meaning
`cache_prompt`	Whether the request can use prompt cache.
`n_cache_reuse`	Minimum chunk size for KV-shift cache reuse.

Reasoning loop fields are listed in the loop-guard section. Sampling fields such as temperature, top_k, top_p, min_p, mirostat, adaptive_target, and adaptive_decay are also read from request JSON in the server task parser.

Recipes

Flat DFlash, Long Context, Compressed KV

llama-server -m target.gguf \
  --spec-type dflash \
  --spec-draft-model draft-dflash.gguf \
  --spec-draft-n-max 16 \
  --spec-dflash-cross-ctx 1024 \
  --spec-branch-budget 0 \
  --ctx-size 51200 \
  -ngl all \
  --spec-draft-ngl all \
  -b 2048 \
  -ub 512 \
  --kv-unified \
  --cache-type-k turbo4 \
  --cache-type-v turbo3_tcq \
  --flash-attn on

Tree DFlash Experiment

llama-server -m target.gguf \
  --spec-type dflash \
  --spec-draft-model draft-dflash.gguf \
  --spec-draft-n-max 16 \
  --spec-branch-budget 4 \
  --spec-draft-top-k 4 \
  --spec-draft-p-split 0.10 \
  --spec-dm-controller profit

Measure tree mode against flat DFlash and no-spec baselines. Tree nodes consume batch/microbatch capacity and extra verification work.

Multimodal Flat DFlash

llama-server -m target.gguf \
  --mmproj mmproj.gguf \
  --spec-type dflash \
  --spec-draft-model draft-dflash.gguf \
  --spec-branch-budget 0

Do not expect tree DFlash, context shift, or cache reuse to survive multimodal initialization in the current server path.

No-Draft Repeated-Context Baseline

llama-server -m model.gguf \
  --spec-type ngram-mod \
  --spec-ngram-mod-n-match 24 \
  --spec-ngram-mod-n-min 48 \
  --spec-ngram-mod-n-max 64

Reasoning Guard For Thinking Models

llama-server -m model.gguf \
  --reasoning on \
  --reasoning-loop-guard force-close \
  --reasoning-loop-min-tokens 1024 \
  --reasoning-loop-window 2048 \
  --reasoning-loop-interventions 1

Verification Discipline

When changing DFlash, cache precision, context length, or batch size, compare against a no-spec baseline and keep all other inputs fixed:

target model file
draft model file, if any
exact command line
commit ID and dirty-worktree status
prompt or prompt hash
context length, cache types, batch sizes, and slot count
target sampling settings
prompt TPS, generation TPS, wall time, draft count, accepted count, and peak memory

Do not carry performance numbers between machines, model files, backends, or dirty worktrees without rerunning the measurement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BeeLlama Args Reference

Start With This Mental Model

Launch Shape

Model Loading

GPU, CPU, And Offload Placement

Context, Batch, And Prompt State

KV Cache Precision

Model Weight Quantization

Model Memory Files And OS Cache

Flash Attention And Backend Behavior

DFlash And Speculative Decoding

Adaptive Draft-Max

Other Speculative Backends

DFlash Compatibility Spellings

Multimodal

Chat, Reasoning, And Sampling

Reasoning Loop Guard

Server, Endpoints, And Network

Logs

Request JSON Overrides

Recipes

Flat DFlash, Long Context, Compressed KV

Tree DFlash Experiment

Multimodal Flat DFlash

No-Draft Repeated-Context Baseline

Reasoning Guard For Thinking Models

Verification Discipline

Uh oh!

FilesExpand file tree

beellama-args.md

Latest commit

History

beellama-args.md

File metadata and controls

BeeLlama Args Reference

Start With This Mental Model

Launch Shape

Model Loading

GPU, CPU, And Offload Placement

Context, Batch, And Prompt State

KV Cache Precision

Model Weight Quantization

Model Memory Files And OS Cache

Flash Attention And Backend Behavior

DFlash And Speculative Decoding

Adaptive Draft-Max

Other Speculative Backends

DFlash Compatibility Spellings

Multimodal

Chat, Reasoning, And Sampling

Reasoning Loop Guard

Server, Endpoints, And Network

Logs

Request JSON Overrides

Recipes

Flat DFlash, Long Context, Compressed KV

Tree DFlash Experiment

Multimodal Flat DFlash

No-Draft Repeated-Context Baseline

Reasoning Guard For Thinking Models

Verification Discipline