beellama.cpp/SD-080-benchmark-notes.txt at main · Anbeeld/beellama.cpp · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
SD-080 DFlash VRAM Fix — Benchmark Notes
=========================================

BASELINE COMMAND:
  llama-speculative-simple \
    -m /root/models/Qwen3.5-27B-heretic.Q4_K_M.gguf \
    -md /root/models/dflash-draft-q4_k_m.gguf \
    -c 8192 -n 512 --draft-max 5 -fa on -ngl 99 --temp 0 \
    -p "Write a comprehensive Python implementation of a red-black tree data structure with insert, delete, search, and traversal operations. Include proper rotations, color fixing, and all edge cases. Add type hints and docstrings for every method. Then write unit tests for each operation using pytest. Make it production quality with proper error handling."

BUILD:
  SD-079 build: /root/spec-decode-builds/SD-079-turbo-mma-fused-20260425/src/build/bin/
  cmake flags: -DGGML_CUDA=ON -DGGML_NATIVE=ON -DCMAKE_CUDA_COMPILER=/opt/cuda/bin/nvcc -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_CUDA_ARCHITECTURES=86

BASELINE RESULTS:
  decode:  46.094 t/s (514 tokens in 11.151s)
  accept:  56.148% (379/675)
  prefill: 463.620 t/s (64 tokens)
  draft-max: 5

  Drafter sched_reserve growth during run:
    82.31 -> 89.81 -> 104.81 -> 134.81 MiB (4 re-reservations)

  VRAM breakdown (from llama_memory_breakdown_print):
    CUDA0 total used: ~18729 MiB
    Model (target): 15088 MiB
    Model (draft):    904 MiB
    KV cache:         811 MiB (512 MiB KV + 299 MiB recurrent)
    Compute (target): 505 MiB
    Compute (draft):  135 MiB (final, after bucket growth)
    Tree verify:      720 MiB
    Tape GPU:          38 MiB
    Unaccounted:     2324 MiB

  Drafter model: 5 layers, 32 heads, 8 kv heads, n_embd=5120, head_dim=128
  n_target_features=25600 (5 target layers x 5120)

SD-080 FIX: Sliding window cap (GGML_DFLASH_MAX_CTX=4096 default)
  Build: /root/spec-decode-builds/SD-080-dflash-vram-fix-20260425/src/build/bin/
  Changes: dflash_draft.cpp (window + cap logic), llama-context.cpp (set_cross_data bucket cap)

REGRESSION TEST (cap=4096 default, same prompt/flags):
  decode:  46.109 t/s (514 tokens in 11.148s)
  accept:  56.148% (379/675)
  prefill: 459.015 t/s (64 tokens)
  Drafter reserve growth: 82.31 -> 89.81 -> 104.81 -> 134.81 MiB (same as baseline)
  → ZERO regression. Cap doesn't kick in until >4096 tokens.

STRESS TEST (cap=256):
  decode:  46.025 t/s (516 tokens in 11.211s)
  accept:  54.245% (377/694)
  Drafter reserve growth: 82.31 -> 89.81 -> 104.81 MiB (stops, never reaches 134.81)
  → Cap verified working. Accept rate drops ~1.9% with reduced context.

F16 TENSOR ATTEMPT (reverted):
  decode:  41.854 t/s (-9.2% regression)
  accept:  55.588%
  Cause: ggml_cast(F16→F32) before matmul adds ~1.1s overhead per 512-token gen.
  VRAM saved: ~200 MiB at cap=4096 (25600×4096×2 vs ×4 bytes).
  Verdict: Not worth 9% speed hit. Sliding window cap alone saves the bulk
  (2.8 GB at 40K context → capped at 400 MB regardless of context length).

VRAM SAVINGS ANALYSIS (at 40K context):
  Without fix: target_hidden grows to 25600 × 32768 × 4 = 3.2 GB (F32)
  With cap=4096: 25600 × 4096 × 4 = 400 MB (F32) — FIXED regardless of context
  Savings: ~2.8 GB VRAM at 40K context