forked from ggml-org/llama.cpp
-
-
Notifications
You must be signed in to change notification settings - Fork 35
Expand file tree
/
Copy pathSD-080-benchmark-notes.txt
More file actions
66 lines (55 loc) · 3.05 KB
/
Copy pathSD-080-benchmark-notes.txt
File metadata and controls
66 lines (55 loc) · 3.05 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
SD-080 DFlash VRAM Fix — Benchmark Notes
=========================================
BASELINE COMMAND:
llama-speculative-simple \
-m /root/models/Qwen3.5-27B-heretic.Q4_K_M.gguf \
-md /root/models/dflash-draft-q4_k_m.gguf \
-c 8192 -n 512 --draft-max 5 -fa on -ngl 99 --temp 0 \
-p "Write a comprehensive Python implementation of a red-black tree data structure with insert, delete, search, and traversal operations. Include proper rotations, color fixing, and all edge cases. Add type hints and docstrings for every method. Then write unit tests for each operation using pytest. Make it production quality with proper error handling."
BUILD:
SD-079 build: /root/spec-decode-builds/SD-079-turbo-mma-fused-20260425/src/build/bin/
cmake flags: -DGGML_CUDA=ON -DGGML_NATIVE=ON -DCMAKE_CUDA_COMPILER=/opt/cuda/bin/nvcc -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_CUDA_ARCHITECTURES=86
BASELINE RESULTS:
decode: 46.094 t/s (514 tokens in 11.151s)
accept: 56.148% (379/675)
prefill: 463.620 t/s (64 tokens)
draft-max: 5
Drafter sched_reserve growth during run:
82.31 -> 89.81 -> 104.81 -> 134.81 MiB (4 re-reservations)
VRAM breakdown (from llama_memory_breakdown_print):
CUDA0 total used: ~18729 MiB
Model (target): 15088 MiB
Model (draft): 904 MiB
KV cache: 811 MiB (512 MiB KV + 299 MiB recurrent)
Compute (target): 505 MiB
Compute (draft): 135 MiB (final, after bucket growth)
Tree verify: 720 MiB
Tape GPU: 38 MiB
Unaccounted: 2324 MiB
Drafter model: 5 layers, 32 heads, 8 kv heads, n_embd=5120, head_dim=128
n_target_features=25600 (5 target layers x 5120)
SD-080 FIX: Sliding window cap (GGML_DFLASH_MAX_CTX=4096 default)
Build: /root/spec-decode-builds/SD-080-dflash-vram-fix-20260425/src/build/bin/
Changes: dflash_draft.cpp (window + cap logic), llama-context.cpp (set_cross_data bucket cap)
REGRESSION TEST (cap=4096 default, same prompt/flags):
decode: 46.109 t/s (514 tokens in 11.148s)
accept: 56.148% (379/675)
prefill: 459.015 t/s (64 tokens)
Drafter reserve growth: 82.31 -> 89.81 -> 104.81 -> 134.81 MiB (same as baseline)
→ ZERO regression. Cap doesn't kick in until >4096 tokens.
STRESS TEST (cap=256):
decode: 46.025 t/s (516 tokens in 11.211s)
accept: 54.245% (377/694)
Drafter reserve growth: 82.31 -> 89.81 -> 104.81 MiB (stops, never reaches 134.81)
→ Cap verified working. Accept rate drops ~1.9% with reduced context.
F16 TENSOR ATTEMPT (reverted):
decode: 41.854 t/s (-9.2% regression)
accept: 55.588%
Cause: ggml_cast(F16→F32) before matmul adds ~1.1s overhead per 512-token gen.
VRAM saved: ~200 MiB at cap=4096 (25600×4096×2 vs ×4 bytes).
Verdict: Not worth 9% speed hit. Sliding window cap alone saves the bulk
(2.8 GB at 40K context → capped at 400 MB regardless of context length).
VRAM SAVINGS ANALYSIS (at 40K context):
Without fix: target_hidden grows to 25600 × 32768 × 4 = 3.2 GB (F32)
With cap=4096: 25600 × 4096 × 4 = 400 MB (F32) — FIXED regardless of context
Savings: ~2.8 GB VRAM at 40K context