Status: Preview — functional; architectural accept ceiling at 1/n_draft for n_draft ≥ 2.
At
--spec-draft-n-max 1EAGLE3 accepts every drafted token (100 %). This is the recommended operating mode. For n_draft ≥ 2 the EAGLE3-3.0 autoregressive recurrence imposes a hard ceiling of 1/n_draft regardless of draft model quality — see §2 for the full explanation. The upstream fix (EAGLE 3.1, not yet in this fork) is tracked in §5 (Future).
| Value | |
|---|---|
| What it is | Separate-model speculative decode; the drafter consumes auxiliary hidden states extracted from 3 target layers |
| Flag | --spec-type draft-eagle3 + -md <eagle3-draft.gguf> |
| Recommended depth | --spec-draft-n-max 1 (100 % accept; 1 free token per cycle) |
| Ceiling at n_draft=3 | 33.333 % (65/195 drafted — 1/n_draft architectural ceiling; see §2) |
| Server support | Single-slot only — --parallel 1 required |
| Correctness baseline | 380c93384 — 3 B1 correctness fixes + KV-position anchor fix |
| Measurement hardware | AMD gfx1150 Vulkan |
Architectural limit. At n_draft ≥ 2 the 1/n_draft ceiling is not a draft model quality issue — a better drafter will not lift it. Use
--spec-draft-n-max 1for practical deployment. See §2.
Quick start (CLI):
llama-speculative-simple \
-m target.gguf \
-md eagle3-draft.gguf \
--spec-type draft-eagle3 \
--spec-draft-n-max 1 \
-fa on -ngl 99 --no-mmapQuick start (server — read the --parallel 1 requirement in §2 first):
llama-server \
-m target.gguf \
-md eagle3-draft.gguf \
--spec-type draft-eagle3 \
--spec-draft-n-max 1 \
--parallel 1 \
-fa on -ngl 99 --no-mmapEAGLE3 is a speculative-decode architecture originally ported from the carlosfundora / 1-bit-turbo
upstream. The drafter is a small separate transformer that consumes hidden-state features
extracted from 3 auxiliary target layers (eagle3.extract_layers GGUF key, typically layers 1,
15, 28 for a 28-layer target) rather than tokens alone.
Native-vs-fork reconciled 2026-06-13. Mainline (ggml-org) landed its own EAGLE3 loader in
PR #18039 (88a39274e); rather than maintain two parallel implementations, this fork adopted
mainline's version wholesale via a Path-A merge (9d0602368) and retired its own arch file. The
fork's original converter, conversion/eagle3.py, is now dead code (conversion is handled by
conversion/llama.py; see d487cece8) — fork-specific behavior (fc_norm, GGUF-key back-compat)
is layered on top of mainline's src/models/eagle3.cpp as additive patches.
Three correctness bugs were fixed in 380c93384 (2026-05-30) before the current accept numbers
were reached:
| Fix | Commit component | Effect |
|---|---|---|
| d2t token remap | B1 — 87b5b3d8d (bundle) |
Draft tokens are post-sampling mapped from draft-vocab to target-vocab via a d2t.weight I32 tensor; fix is dormant when the draft uses the full target vocabulary (no d2t.weight present) |
norm_before_residual gating |
B1 | eagle3.eagle3.norm_before_residual GGUF key now gates whether g_embd_norm or raw g_embd feeds the residual connection; previously the key had no effect |
rope_factors propagation |
B1 | Per-layer RoPE scaling factors (rope_freqs tensors) were loaded but passed as nullptr to ggml_rope_ext; now forwarded correctly |
| Draft-batch KV-position anchor | 380c93384 |
Positions the draft batch from the drafter's own llama_memory_seq_pos_max(ctx_dft)+1 instead of the advancing target dp.n_past; mirrors DFlash fix 003ecc2d1; unblocked accept from 0 % to the measured rates |
Prior to 380c93384, accept was 0 % on every run: the draft batch was anchored to the target
position counter (dp.n_past), which advances every accepted cycle while the drafter KV stays
pinned at the prompt-end position. From cycle 2 on, the Y = X + 1 consecutive-position check in
llama_batch_allocr::init() rejected every draft decode silently ("eagle3 decoder failed").
| Flag | Required | Description |
|---|---|---|
--spec-type draft-eagle3 |
Yes | Selects the EAGLE3 speculative path |
-md <eagle3-draft.gguf> |
Yes | EAGLE3 drafter GGUF (separate file from the target) |
-m <target.gguf> |
Yes | Target causal LM |
--spec-draft-n-max N |
Recommended | Draft depth per cycle. Default 3; set to 1 for production (see ceiling below) |
-fa on |
Recommended | Flash attention |
--no-mmap |
Recommended | Consistent with validated configurations |
Measured configuration: Qwen3.5-9B target + matching eagle3-draft-9b, AMD gfx1150 Vulkan,
--temp 0.
| n_draft | Drafted | Accepted | Accept % | Notes |
|---|---|---|---|---|
| 1 | 101 | 101 | 100.000 % | Every d0 accepted; 1 free token per cycle — recommended |
| 2 | 202 | 101 | 50.000 % | 1/n_draft ceiling (d1 always rejected) |
| 3 | 195 | 65 | 33.333 % | 1/n_draft ceiling (d1 and d2 always rejected) |
At n_draft ≥ 2, accept cannot exceed 1/n_draft under EAGLE3-3.0 regardless of draft model quality. The mechanism:
- d0 is seeded with
g_embd = FC(target_hidden)— an affine projection of the target's last-layer hidden state. The drafter attends the correct context, so d0 is reliably accepted. - d1, d2, … are seeded with
g_embd = llama_get_embeddings_ith(ctx_dft, -1)— the drafter's own prenorm output from the previous step. This diverges immediately from the target distribution: the drafter has no access to the true target hidden states at d1+ positions, so every token after d0 is effectively random with respect to the target. Measured consequence: d1 and d2 are accepted 0 % of the time in the current gate models. - A better-trained or larger drafter does not help — the recurrence itself feeds the wrong signal. This is an EAGLE3-3.0 architectural property, not a model-quality gap.
Recommendation: deploy at --spec-draft-n-max 1. Every d0 is accepted (100 %), delivering
1 free token per draft cycle with no rejected-draft overhead.
llama-server requires --parallel 1 — EAGLE3 must run with a single active generating slot.
The drafter indexes target features by the global last token in the verify batch with no per-sequence
mapping; with n_parallel > 1, multiple slots' tokens share one target ubatch and all slots are
seeded from the same (wrong) token → cross-sequence contamination and incorrect drafts.
Drafter GGUF is not a standard causal LM. The EAGLE3 draft GGUF requires eagle3.extract_layers
and related hparams. It cannot be loaded as -m (it will fail the architecture check or produce
garbage). Always pass it via -md.
PPL is unchanged by construction — perplexity evaluation exercises only the prefill pass; the speculative path does not fire during PPL measurement.
Landed in b2766ef47 (2026-05-31) — port of PR #18039.
SpecForge 32K-vocab EAGLE3 drafters carry a compact output vocabulary (output.weight shaped
[n_embd, 32000]) plus a d2t.weight[32000] draft-to-target token map, rather than the full
target vocabulary. The compact-vocab path is transparent at the CLI level — use the same flags
as a full-vocab drafter:
llama-speculative-simple \
-m target-248320-vocab.gguf \
-md eagle3-specforge-32000-vocab.gguf \
--spec-type draft-eagle3 \
--spec-draft-n-max 1 \
-fa on -ngl 99 --no-mmapHow it works: llama_model_load_internal detects d2t.weight and derives n_draft_vocab
from its tensor width. The loader (src/models/eagle3.cpp) builds the output head at draft-vocab
width and registers the d2t remap table. Graph-side scatter remap translates compact-vocab logits
to target-vocab positions during speculative decode; no changes are needed to the CLI or the
speculative driver.
Smoke gate (ROCm gfx1150, 2026-05-31, pipefail runner, RC=0 on both runs):
| Config | Accept % | Notes |
|---|---|---|
| Qwen3.5-35B-A3B-MTP-IQ4_XS (248 320-vocab) + SpecForge-BF16 drafter (32 000-vocab + d2t), n_draft=3 | 33.333 % | compact-vocab path; d2t graph-remap active; coherent output |
| Qwen3.5-9B-IQ4_XS + eagle3-draft-9b (d2t=none), n_draft=3 | 33.333 % | full-vocab; no-regression vs. prior baseline |
Known limitation (pre-existing, harmless): The SpecForge GGUF exports d2t.weight as
GGML_TYPE_I64, but llama_model_eagle3_get_d2t() (commit 87b5b3d8d, already on main)
requires GGML_TYPE_I32 → the host-side fast path returns empty and the log prints d2t=none.
The graph-side remap path is active and correct; output is coherent. Optional future fix: widen
the getter to accept I64, or export d2t as I32 in conversion/llama.py.
- 100 % accept at n_draft=1 — every draft cycle produces one free token at zero reject overhead. This is the effective config: no draft rejection means no wasted target verification.
- Higher accept than DFlash solo — 33.333 % at n_draft=3 vs. DFlash solo 25.1 %, despite sharing the same "stateless drafter KV" architecture.
- Server path works —
llama-serveris fully wired end-to-end via the genericcommon_speculative_{init,begin,draft,process,accept}spine; no server-specific code is needed. Feature extraction is passive and automatic per ubatch. - Backend-agnostic — EAGLE3 does not require a specific backend; the drafter runs on whatever backend the draft context is assigned to.
- Architectural ceiling at n_draft ≥ 2 — the 1/n_draft hard limit means n_draft=3 cannot exceed 33 % accept, and n_draft=2 cannot exceed 50 %. The ceiling is inherent to EAGLE3-3.0's autoregressive recurrence (see §2). EAGLE 3.1 is the upstream fix path (see §5).
--parallel 1only onllama-server— multi-slot server support would require per-sequence feature-index plumbing in the drafter; this is not yet implemented.- Separate drafter GGUF required — unlike MTP (which bundles the draft head inside the target GGUF), EAGLE3 requires a separately distributed drafter file matched to the target architecture.
- Drafter KV is stateless per iteration — accepted tokens are not decoded into the draft KV. The drafter attends from a frozen positional frame (prompt-end RoPE positions), which is the same tradeoff DFlash makes and is why the ceiling persists even at n_draft=1 as sequence length grows. A future MTP-style catch-up path (decode accepted tokens into the draft KV) would address this but is not yet implemented.
| Config | Accept % | Notes |
|---|---|---|
| Qwen3.5-9B + eagle3-draft-9b, gfx1150 Vulkan, n_draft=1 | 100.000 % | 101/101; recommended config |
| Qwen3.5-9B + eagle3-draft-9b, gfx1150 Vulkan, n_draft=2 | 50.000 % | 101/202; 1/n_draft ceiling |
| Qwen3.5-9B + eagle3-draft-9b, gfx1150 Vulkan, n_draft=3 | 33.333 % | 65/195; 1/n_draft ceiling |
Measured: Qwen3.6-27B-Q4_K_M target + PRISM eagle3 drafter (27B head, compact-vocab), CUDA sm_75,
greedy (--temp 0), default n_draft. 36 cells across Q8_0 / Q4_K_M / IQ3_M × 3 contexts × 4
TriAttn/PFlash quadrants (binary d66c1886e; matrix-234 K-eagle3 chunk).
| Draft quant | Head size | Accept % | TG TPS (ctx4096, noTri/noPf) | Notes |
|---|---|---|---|---|
| Q8_0 | 569 M | 73.333 % | 7.93 t/s | 176/240 |
| Q4_K_M | 368 M | 73.333 % | 7.64 t/s | 176/240 |
| IQ3_M | 303 M | 73.333 % | 7.94 t/s | 176/240 |
Accept% is bit-for-bit identical across all 36 cells — the target model's verification outcome is entirely independent of draft head quantization. TG TPS variation (~5%, non-monotonic) is within measurement noise; the 17 GB target dominates over the 300–570 MB drafter. Recommended minimum: IQ3_M (−47% VRAM vs Q8_0, identical accept%). Q2_K and Q3_K_M are untested but safe to assume given IQ3_M = Q8_0. Applies to any EAGLE3.0 head (d0 seeded from target hidden state; target always verifies its own projection → accept is target-determined, not draft-quality-determined).
When llama_set_eagle3(ctx_tgt, model_eagle3) is called during speculative-init, it sets
cparams.eagle3_extract_enabled = true on the target context (src/llama-context.cpp:1320–1324).
After every target decode, llama_context::process_ubatch copies the hidden-state tensors from the
3 configured auxiliary layers into a host buffer eagle3_state.target_features laid out
[n_aux][n_embd][n_tok] (:1540–1554). Extraction is passive — no extra flags or logit overrides
are needed.
common_speculative_impl_draft_eagle3::draft() (common/speculative.cpp:501–599):
- Reads
llama_get_eagle3_target_features(ctx_tgt)— this call internally callsctx_tgt->synchronize()before returning the host pointer, draining any in-flight GPU compute before features are read (:3800–3804). - CPU-side FC-projects the target features to produce
g_embd— one vector per token position, one per auxiliary layer. - Seeds the draft decode via
llama_set_eagle3_g_embeddingsand runs d0 autoregressively. - d1+ steps reuse the drafter's own prenorm output as the next
g_embd— the architectural ceiling source (§2).
The draft batch is anchored to llama_memory_seq_pos_max(ctx_dft, seq_id) + 1 (the drafter's own
next-KV slot), not to the target position counter dp.n_past — this is the fix in 380c93384
that mirrors DFlash commit 003ecc2d1.
After drafting, the shared common_speculative_process / common_speculative_accept driver runs
the target model over the full draft + verify pass and commits accepted tokens. EAGLE3's process()
and accept() are no-ops — rollback and KV management are handled by the driver's checkpoint/restore
machinery (ckpt.update_dft / ckpt.load_dft / llama_memory_seq_rm).
- MTP speculative decode — 75.56 % accept (Qwen3.5/3.6 9B internal NextN-tail); head
bundled inside the target GGUF;
--spec-type draft-mtp - DFlash drafter spec-decode — 25.1 % accept; cross-attention ring drafter;
--spec-type dflash - PHANTOM-X self-speculative n-gram drafter — no separate draft model;
--spec-type phantom - NLD diffusion self-spec — bidirectional draft for diffusion LMs
Future-watch — not yet in this fork.
EAGLE 3.1 was released 2026-05-26 (vLLM v0.22.0; Kimi K2.6 checkpoint is the first compatible
weight set). The 3.1 architecture changes the d1+ g_embd recurrence — specifically, a post-norm
FC path that feeds the draft prenorm with a closer approximation of FC(target_hidden) at each
autoregressive step rather than the draft's own prenorm output.
If the 3.1 fix lands as described, it would dissolve the 1/n_draft ceiling: d1+ tokens would
receive a higher-quality g_embd signal, and accept could in principle approach the d0 rate at all
depths. This is the upstream fix path for the ceiling documented in §2.
Current status in this fork: EAGLE 3.1 is not in tree. There are no 3.1 checkpoint weights
available for the current validated target (Qwen3.5-9B). The compact-vocab (32K-vocab + d2t)
loader prerequisite (formerly tracked) is now resolved — b2766ef47 ports PR #18039
and enables loading SpecForge-style compact-vocab EAGLE3 drafters. The remaining blocker is the
absence of 3.1 checkpoint weights. Monitor upstream vLLM and EAGLE3 repositories for releases.
| Item | Value |
|---|---|
| Status | FUTURE-WATCH — awaiting 3.1 checkpoint + compatible loader |
| Canonical upstream | SafeAILab EAGLE + vLLM v0.22.0 (released 2026-05-26) |
| Architecture change | FC-norm (prenorm) + post-norm dual-path g_embd at d1+ steps — signals closer approximation of FC(target_hidden) rather than drafter prenorm output |
| Impact | Dissolves 1/n_draft ceiling; d1+ accept could approach d0 rate (near 100 %) at all depths |
| Re-check trigger | When EAGLE 3.1 checkpoint weights exist for Qwen3.5-9B (or current validated target) — compact-vocab loader (formerly tracked) is now RESOLVED (b2766ef47) |
| Cross-reference | PR #18039 LANDED b2766ef47 (compact-vocab, 2026-05-31); see §2 for ceiling explanation |
| Next step | Monitor SafeAILab EAGLE and vLLM repositories for weight releases; import when available + validate accept on Qwen3.5-9B |
All BF16 GGUFs rebuilt 2026-06-14 from new-schema converter (PR #18039 path A — keys
eagle3.target_layers / eagle3.target_hidden_size / eagle3.norm_before_residual, no double
prefix). Quant ladders rebuilt 2026-06-19 from those new-schema BF16s. Build: 819 (3ff8220f3) on
gfx1103 (ROCm). IQ3_M produced without imatrix — architecturally impossible for standalone
eagle3 GGUFs (no token_embd.weight → llama-imatrix fails at tensor count check).
Filename: Qwen3.6-35B-A3B-eagle3-<q>.gguf
| Quant | Size | BPW | Magic | Built |
|---|---|---|---|---|
| BF16 | 408M | 16.00 | 47475546 | 2026-06-14 |
| Q8_0 | 222M | 8.51 | 47475546 | 2026-06-19 |
| Q4_K_M | 148M | 3.67 | 47475546 | 2026-06-19 |
| Q3_K_M | 127M | 3.12 | 47475546 | 2026-06-19 |
| IQ3_M | 122M | 3.01 | 47475546 | 2026-06-19 |
| Q2_K | 112M | 2.76 | 47475546 | 2026-06-19 |
Has fc_norm.weight tensor (EAGLE3.1 FC normalization). 15 tensors total.
Filename: Qwen3.5-9B-eagle3-<q>.gguf
| Quant | Size | BPW | Magic | Built |
|---|---|---|---|---|
| BF16 | 773M | 16.00 | 47475546 | 2026-06-14 |
| Q8_0 | 416M | 8.51 | 47475546 | 2026-06-19 |
| Q4_K_M | 272M | 5.58 | 47475546 | 2026-06-19 |
| Q3_K_M | 234M | 4.69 | 47475546 | 2026-06-19 |
| IQ3_M | 227M | 4.54 | 47475546 | 2026-06-19 |
| Q2_K | 206M | 4.10 | 47475546 | 2026-06-19 |
Has d2t tensor (compact-vocab 32K→248320 mapping). 14 tensors total.
Filename: Qwen3.5-35B-A3B-eagle3-<q>.gguf
| Quant | Size | BPW | Magic | Built |
|---|---|---|---|---|
| BF16 | 464M | 16.00 | 47475546 | 2026-06-14 |
No quant ladder yet — target Qwen3.5-35B-A3B is not in active use; build on demand.
Filename: Qwen3.6-27B-eagle3-<q>.gguf
| Quant | Size | BPW | Magic | Built |
|---|---|---|---|---|
| BF16 | 1.1G | 16.00 | 47475546 | 2026-06-14 |
| Q8_0 | 569M | 8.51 | 47475546 | 2026-06-19 |
| Q4_K_M | 368M | 5.51 | 47475546 | 2026-06-19 |
| IQ3_M | 303M | 4.54 | 47475546 | 2026-06-19 |
Has d2t tensor (compact-vocab). 14 tensors total. Only 3-quant ladder (no Q3_K_M/Q2_K) — lower
priority drafter.
All 4 BF16 heads loaded successfully under llama-gguf r + llama-quantize --dry-run. New-schema
keys confirmed present (eagle3.target_layers, eagle3.target_hidden_size,
eagle3.norm_before_residual). No old double-prefix keys (eagle3.eagle3.*) detected.