Skip to content

Commit e8136a8

Browse files
K3 Block A vast evidence (honest drafter handling)
Re-ran the feasibility smoke on H200 with the DFlash-honesty fix. summary now: verifier_loadable/forward_ok=true; drafter_loadable=true (backbone memory probe); drafter_faithful_transformers_load=false; drafter_forward_ok=null (n/a — spec-decode-only); validation_path= vllm_pr_41703_or_sglang. Verifier 2.77 tok/s. Confirms hardware feasibility for the verifier; DFlash drafting protocol intentionally NOT claimed here (deferred to the vLLM/SGLang run). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
1 parent 5d75842 commit e8136a8

2 files changed

Lines changed: 123 additions & 0 deletions

File tree

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
{
2+
"schema_version": 1,
3+
"kind": "k3_feasibility_smoke",
4+
"config": {
5+
"platform": "cuda",
6+
"verifier_path": "google/gemma-4-26B-A4B-it",
7+
"drafter_id": "z-lab/gemma-4-26B-A4B-it-DFlash",
8+
"prompt_tokens": 512,
9+
"gen_tokens": 8,
10+
"seed": 42,
11+
"skip_drafter": false
12+
},
13+
"stages": [
14+
{
15+
"stage": "baseline",
16+
"memory": {
17+
"label": "baseline",
18+
"platform": "cuda",
19+
"current_allocated_bytes": 0,
20+
"current_reserved_bytes": 0,
21+
"peak_allocated_bytes": 0,
22+
"peak_reserved_bytes": 0,
23+
"device_total_bytes": 150109880320,
24+
"device_name": "NVIDIA H200"
25+
}
26+
},
27+
{
28+
"stage": "verifier_loaded",
29+
"memory": {
30+
"label": "after_verifier_load",
31+
"platform": "cuda",
32+
"current_allocated_bytes": 51611948032,
33+
"current_reserved_bytes": 51636076544,
34+
"peak_allocated_bytes": 51611948032,
35+
"peak_reserved_bytes": 51636076544,
36+
"device_total_bytes": 150109880320,
37+
"device_name": "NVIDIA H200"
38+
},
39+
"verifier_load_seconds": 13.963492935989052,
40+
"verifier_kind": "transformers_bf16_cuda"
41+
},
42+
{
43+
"stage": "drafter_loaded",
44+
"memory": {
45+
"label": "after_drafter_load",
46+
"platform": "cuda",
47+
"current_allocated_bytes": 55328955904,
48+
"current_reserved_bytes": 55348035584,
49+
"peak_allocated_bytes": 55328955904,
50+
"peak_reserved_bytes": 55348035584,
51+
"device_total_bytes": 150109880320,
52+
"device_name": "NVIDIA H200"
53+
},
54+
"drafter_load_seconds": 4.097004538984038,
55+
"drafter_kind": "dflash_backbone_memory_probe"
56+
},
57+
{
58+
"stage": "verifier_forward",
59+
"memory": {
60+
"label": "after_verifier_forward",
61+
"platform": "cuda",
62+
"current_allocated_bytes": 55362510336,
63+
"current_reserved_bytes": 56193187840,
64+
"peak_allocated_bytes": 56160551936,
65+
"peak_reserved_bytes": 56193187840,
66+
"device_total_bytes": 150109880320,
67+
"device_name": "NVIDIA H200"
68+
},
69+
"metrics": {
70+
"prefill_seconds": 2.5831943770172074,
71+
"gen_seconds": 2.8882102399365976,
72+
"gen_tokens": 8,
73+
"tokens_per_sec": 2.769881461321741,
74+
"gen_text_head": "\nThe Kakeya inference engine validates",
75+
"prompt_token_count": 757
76+
}
77+
},
78+
{
79+
"stage": "drafter_forward_skipped",
80+
"reason": "architectures=['DFlashDraftModel'] is not loadable as a standalone transformers model (no auto_map / not a built-in class). DFlash is a block-diffusion speculative-decoding drafter; run it via vLLM (PR #41703) or SGLang per the model card. The transformers path here only loads the qwen3 backbone as a memory probe and does NOT exercise the DFlash drafting protocol.",
81+
"validation_path": "vllm_pr_41703_or_sglang"
82+
}
83+
],
84+
"summary": {
85+
"status": "pass",
86+
"verifier_loadable": true,
87+
"verifier_forward_ok": true,
88+
"drafter_loadable": true,
89+
"drafter_faithful_transformers_load": false,
90+
"drafter_forward_ok": null,
91+
"drafter_note": "architectures=['DFlashDraftModel'] is not loadable as a standalone transformers model (no auto_map / not a built-in class). DFlash is a block-diffusion speculative-decoding drafter; run it via vLLM (PR #41703) or SGLang per the model card. The transformers path here only loads the qwen3 backbone as a memory probe and does NOT exercise the DFlash drafting protocol.",
92+
"drafter_validation_path": "vllm_pr_41703_or_sglang"
93+
}
94+
}

results/research/logs/k3_feasibility_smoke_vast_blockA_specdecode_1780990807.log

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
[k3-smoke] platform: cuda
2+
[k3-smoke] verifier: google/gemma-4-26B-A4B-it
3+
[k3-smoke] drafter: z-lab/gemma-4-26B-A4B-it-DFlash
4+
[k3-smoke] prompt n: 512
5+
[k3-smoke] gen n: 8
6+
[k3-smoke] loading verifier (CUDA bf16): google/gemma-4-26B-A4B-it
7+
Loading weights: 0%| | 0/1013 [00:00<?, ?it/s]Loading weights: 0%| | 2/1013 [00:00<01:44, 9.67it/s]Loading weights: 0%| | 3/1013 [00:00<01:52, 8.97it/s]Loading weights: 0%| | 4/1013 [00:00<02:22, 7.07it/s]Loading weights: 2%|▏ | 25/1013 [00:00<00:17, 57.20it/s]Loading weights: 3%|▎ | 32/1013 [00:00<00:20, 47.45it/s]Loading weights: 5%|▍ | 48/1013 [00:01<00:17, 53.93it/s]Loading weights: 7%|▋ | 70/1013 [00:01<00:14, 65.53it/s]Loading weights: 9%|▉ | 92/1013 [00:01<00:12, 72.61it/s]Loading weights: 11%|█▏ | 114/1013 [00:01<00:11, 77.44it/s]Loading weights: 13%|█▎ | 134/1013 [00:01<00:09, 96.54it/s]Loading weights: 14%|█▍ | 146/1013 [00:02<00:09, 89.27it/s]Loading weights: 15%|█▌ | 157/1013 [00:02<00:11, 72.19it/s]Loading weights: 18%|█▊ | 179/1013 [00:02<00:10, 77.06it/s]Loading weights: 20%|█▉ | 201/1013 [00:02<00:10, 79.21it/s]Loading weights: 22%|██▏ | 222/1013 [00:03<00:08, 93.65it/s]Loading weights: 23%|██▎ | 233/1013 [00:03<00:09, 82.30it/s]Loading weights: 24%|██▍ | 244/1013 [00:03<00:11, 65.06it/s]Loading weights: 26%|██▌ | 264/1013 [00:03<00:11, 64.95it/s]Loading weights: 28%|██▊ | 287/1013 [00:03<00:08, 86.36it/s]Loading weights: 29%|██▉ | 298/1013 [00:04<00:09, 76.08it/s]Loading weights: 31%|███ | 309/1013 [00:04<00:08, 80.36it/s]Loading weights: 31%|███▏ | 319/1013 [00:04<00:10, 68.78it/s]Loading weights: 33%|███▎ | 331/1013 [00:04<00:08, 76.59it/s]Loading weights: 34%|███▎ | 340/1013 [00:04<00:10, 65.71it/s]Loading weights: 35%|███▍ | 353/1013 [00:05<00:11, 55.84it/s]Loading weights: 37%|███▋ | 375/1013 [00:05<00:07, 81.78it/s]Loading weights: 38%|███▊ | 386/1013 [00:05<00:08, 73.24it/s]Loading weights: 39%|███▉ | 396/1013 [00:05<00:08, 76.07it/s]Loading weights: 40%|███▉ | 405/1013 [00:05<00:09, 66.65it/s]Loading weights: 41%|████▏ | 419/1013 [00:06<00:09, 62.47it/s]Loading weights: 43%|████▎ | 440/1013 [00:06<00:06, 85.20it/s]Loading weights: 45%|████▍ | 451/1013 [00:06<00:07, 79.47it/s]Loading weights: 46%|████▌ | 463/1013 [00:06<00:08, 67.41it/s]Loading weights: 48%|████▊ | 485/1013 [00:06<00:07, 73.16it/s]Loading weights: 50%|████▉ | 506/1013 [00:06<00:05, 95.76it/s]Loading weights: 51%|█████ | 518/1013 [00:07<00:05, 88.93it/s]Loading weights: 52%|█████▏ | 529/1013 [00:07<00:06, 71.28it/s]Loading weights: 54%|█████▍ | 550/1013 [00:07<00:06, 75.23it/s]Loading weights: 56%|█████▋ | 572/1013 [00:07<00:05, 78.72it/s]Loading weights: 59%|█████▊ | 594/1013 [00:08<00:05, 80.53it/s]Loading weights: 61%|██████ | 616/1013 [00:08<00:04, 82.00it/s]Loading weights: 63%|██████▎ | 637/1013 [00:08<00:04, 81.57it/s]Loading weights: 79%|███████▊ | 796/1013 [00:08<00:00, 302.53it/s]Loading weights: 97%|█████████▋| 987/1013 [00:08<00:00, 579.91it/s]Loading weights: 100%|██████████| 1013/1013 [00:08<00:00, 114.33it/s]
8+
[k3-smoke] verifier loaded in 14.0s
9+
[k3-smoke] loading drafter (cuda): z-lab/gemma-4-26B-A4B-it-DFlash
10+
[k3-smoke] NOTE: architectures=['DFlashDraftModel'] is not loadable as a standalone transformers model (no auto_map / not a built-in class). DFlash is a block-diffusion speculative-decoding drafter; run it via vLLM (PR #41703) or SGLang per the model card. The transformers path here only loads the qwen3 backbone as a memory probe and does NOT exercise the DFlash drafting protocol.
11+
[k3-smoke] -> loading qwen3 backbone as a MEMORY PROBE ONLY (not a faithful DFlash load; standalone forward will be skipped).
12+
Loading weights: 0%| | 0/56 [00:00<?, ?it/s]Loading weights: 64%|██████▍ | 36/56 [00:00<00:00, 349.83it/s]Loading weights: 100%|██████████| 56/56 [00:00<00:00, 364.99it/s]
13+
[transformers] Qwen3ForCausalLM LOAD REPORT from: z-lab/gemma-4-26B-A4B-it-DFlash
14+
Key | Status |
15+
--------------------------+------------+-
16+
hidden_norm.weight | UNEXPECTED |
17+
fc.weight | UNEXPECTED |
18+
lm_head.weight | MISSING |
19+
model.embed_tokens.weight | MISSING |
20+
21+
Notes:
22+
- UNEXPECTED: can be ignored when loading from different task/architecture; not ok if you expect identical arch.
23+
- MISSING: those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
24+
[k3-smoke] drafter loaded in 4.1s (backbone memory probe)
25+
[transformers] The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
26+
[k3-smoke] verifier forward OK; gen=8 tokens in 2.89s (2.77 tok/s)
27+
[k3-smoke] drafter forward SKIPPED (spec-decode-only drafter; validate via vLLM PR #41703 / SGLang — not transformers).
28+
[k3-smoke] report -> results/research/k3_feasibility_smoke_vast_blockA_specdecode_1780990807.json
29+
[k3-smoke] PASS

0 commit comments

Comments
 (0)