Skip to content

Commit c910c9a

Browse files
committed
add: CC prompt for Llama 4 Scout BF16 shards 1-4 streaming compression
1 parent 6cdfa9b commit c910c9a

1 file changed

Lines changed: 225 additions & 0 deletions

File tree

Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
# SESSION: Llama 4 Scout BF16 — Stream-Index Shards 1-4
2+
3+
## MISSION
4+
5+
Shard 5 (18.2 GB) is DONE → 7.70 MB at 4,735× ratio.
6+
Process the remaining 4 shards. Together with shard 5, this gives us
7+
the full Llama 4 Scout 109B model compressed to bgz17.
8+
9+
## READ FIRST
10+
11+
```bash
12+
# The streaming indexer and HTTP reader that already work:
13+
cat src/hpc/gguf_indexer.rs # stream_index_gguf(), project_row_to_base17()
14+
cat src/hpc/http_reader.rs # HttpRangeReader::with_chunk_size()
15+
cat src/hpc/gguf.rs # GGUF header/tensor parsing, BF16 dequant
16+
17+
# The shard 5 test that PASSED (at the bottom of gguf_indexer.rs):
18+
grep -A 80 "test_stream_index_llama4_bf16_shard5" src/hpc/gguf_indexer.rs
19+
```
20+
21+
Do NOT modify any existing code. Only ADD new test functions.
22+
23+
## SHARD MAP
24+
25+
```
26+
Repo: unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF
27+
Path: BF16/Llama-4-Scout-17B-16E-Instruct-BF16-{NNNNN}-of-00005.gguf
28+
29+
Shard 1: 48,940,000,000 bytes (~48.94 GB) layers 0-10 + embeddings
30+
Shard 2: 49,960,000,000 bytes (~49.96 GB) layers 11-21
31+
Shard 3: 48,660,000,000 bytes (~48.66 GB) layers 22-32
32+
Shard 4: 49,790,000,000 bytes (~49.79 GB) layers 33-43
33+
Shard 5: 18,220,000,000 bytes (~18.22 GB) layers 44-47 + output ✓ DONE
34+
─────────────────────────────────────────────────────────────────────
35+
Total: 215,570,000,000 bytes (~215.57 GB)
36+
```
37+
38+
## WHAT TO BUILD
39+
40+
Add ONE test function that processes all 4 shards sequentially.
41+
NOT 4 separate tests — one function, loop over shards, cleanup between.
42+
43+
```rust
44+
#[test]
45+
#[ignore] // Streams ~197 GB from HuggingFace — takes ~2 hours
46+
fn test_stream_index_llama4_bf16_shards_1_to_4() {
47+
use super::super::http_reader::HttpRangeReader;
48+
use std::io::BufWriter;
49+
50+
let repo = "unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF";
51+
52+
let shards = [
53+
("BF16/Llama-4-Scout-17B-16E-Instruct-BF16-00001-of-00005.gguf", 48_940_000_000u64),
54+
("BF16/Llama-4-Scout-17B-16E-Instruct-BF16-00002-of-00005.gguf", 49_960_000_000u64),
55+
("BF16/Llama-4-Scout-17B-16E-Instruct-BF16-00003-of-00005.gguf", 48_660_000_000u64),
56+
("BF16/Llama-4-Scout-17B-16E-Instruct-BF16-00004-of-00005.gguf", 49_790_000_000u64),
57+
];
58+
59+
let mut grand_total_source: u64 = 0;
60+
let mut grand_total_compressed: u64 = 0;
61+
let mut grand_total_original: u64 = 0; // f32 equivalent
62+
let mut grand_total_tensors: usize = 0;
63+
let mut grand_by_type: [(usize, u64, u64); 6] = [(0,0,0); 6];
64+
65+
// Add shard 5 results (already measured)
66+
let shard5_source: u64 = 18_220_000_000;
67+
let shard5_compressed: u64 = 7_700_000; // 7.70 MB
68+
let shard5_original: u64 = 36_440_000_000; // ~36.44 GB f32 equivalent
69+
grand_total_source += shard5_source;
70+
grand_total_compressed += shard5_compressed;
71+
grand_total_original += shard5_original;
72+
73+
for (i, (filename, size)) in shards.iter().enumerate() {
74+
let shard_num = i + 1;
75+
let url = format!("https://huggingface.co/{}/resolve/main/{}", repo, filename);
76+
let out_path = format!("/tmp/llama4_scout_shard{}.bgz7", shard_num);
77+
78+
eprintln!();
79+
eprintln!("━━━ Shard {}/5 ({:.2} GB) ━━━", shard_num, *size as f64 / 1e9);
80+
eprintln!(" URL: {}", url);
81+
82+
// 256 MB chunks — fewer HTTP round trips
83+
let mut reader = HttpRangeReader::with_chunk_size(
84+
url.clone(), *size, 256 * 1024 * 1024
85+
);
86+
87+
let out = std::fs::File::create(&out_path).expect("create output");
88+
let mut writer = BufWriter::new(out);
89+
90+
let stats = stream_index_gguf(
91+
&mut reader,
92+
&mut writer,
93+
Some(&|name, layer_type, orig, comp| {
94+
let ratio = if comp > 0 { orig as f64 / comp as f64 } else { 0.0 };
95+
eprintln!(" {:60} {:12?} {:>12} → {:>8} ({:.0}×)",
96+
name, layer_type, orig, comp, ratio);
97+
}),
98+
).expect(&format!("stream_index_gguf shard {}", shard_num));
99+
100+
drop(writer);
101+
let out_size = std::fs::metadata(&out_path).map(|m| m.len()).unwrap_or(0);
102+
103+
// Per-shard summary
104+
eprintln!();
105+
eprintln!(" Shard {} result: {:.2} GB → {:.2} MB ({:.0}×)",
106+
shard_num, *size as f64 / 1e9, out_size as f64 / 1e6, stats.overall_ratio());
107+
eprintln!(" Tensors: {} indexed, {} skipped",
108+
stats.tensors_indexed, stats.tensors_skipped);
109+
eprintln!(" Downloaded: {:.2} GB", reader.bytes_downloaded() as f64 / 1e9);
110+
111+
let type_names = ["Attention", "FeedForward", "Conv2D", "Norm", "Embedding", "Skip"];
112+
for (j, name) in type_names.iter().enumerate() {
113+
let (count, orig, comp) = stats.by_type[j];
114+
if count > 0 {
115+
let ratio = if comp > 0 { orig as f64 / comp as f64 } else { 0.0 };
116+
eprintln!(" {:<12} {:>3} tensors: {:>10.2} GB → {:>8.2} MB ({:.0}×)",
117+
name, count, orig as f64 / 1e9, comp as f64 / 1e6, ratio);
118+
grand_by_type[j].0 += count;
119+
grand_by_type[j].1 += orig;
120+
grand_by_type[j].2 += comp;
121+
}
122+
}
123+
124+
// Accumulate
125+
grand_total_source += *size;
126+
grand_total_compressed += out_size;
127+
grand_total_original += stats.original_bytes;
128+
grand_total_tensors += stats.tensors_indexed;
129+
130+
// CLEANUP: remove output file to free disk for next shard
131+
// Keep the stats, drop the bytes
132+
if let Err(e) = std::fs::remove_file(&out_path) {
133+
eprintln!(" Warning: cleanup failed: {}", e);
134+
} else {
135+
eprintln!(" Cleaned up {} (disk freed for next shard)", out_path);
136+
}
137+
138+
assert!(stats.tensors_indexed > 0,
139+
"shard {} should have indexed tensors", shard_num);
140+
}
141+
142+
// Grand total (all 5 shards)
143+
eprintln!();
144+
eprintln!("━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━");
145+
eprintln!("LLAMA 4 SCOUT 17B-16E — FULL MODEL (ALL 5 SHARDS)");
146+
eprintln!("━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━");
147+
eprintln!(" Source (BF16): {:.2} GB", grand_total_source as f64 / 1e9);
148+
eprintln!(" Original (f32): {:.2} GB", grand_total_original as f64 / 1e9);
149+
eprintln!(" Compressed: {:.2} MB", grand_total_compressed as f64 / 1e6);
150+
eprintln!(" Overall ratio: {:.0}×", grand_total_original as f64 / grand_total_compressed as f64);
151+
eprintln!(" Tensors indexed: {}", grand_total_tensors);
152+
eprintln!();
153+
154+
let type_names = ["Attention", "FeedForward", "Conv2D", "Norm", "Embedding", "Skip"];
155+
for (j, name) in type_names.iter().enumerate() {
156+
let (count, orig, comp) = grand_by_type[j];
157+
if count > 0 {
158+
let ratio = if comp > 0 { orig as f64 / comp as f64 } else { 0.0 };
159+
eprintln!(" {:<12} {:>4} tensors: {:>10.2} GB → {:>8.2} MB ({:.0}×)",
160+
name, count, orig as f64 / 1e9, comp as f64 / 1e6, ratio);
161+
}
162+
}
163+
eprintln!("━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━");
164+
165+
// Sanity checks
166+
assert!(grand_total_tensors > 100, "should have many tensors across all shards");
167+
assert!(grand_total_compressed < 200_000_000,
168+
"full model should be under 200 MB: was {} MB", grand_total_compressed / 1_000_000);
169+
}
170+
```
171+
172+
## CRITICAL CONSTRAINTS
173+
174+
1. **256 MB chunk size** — the HttpRangeReader.with_chunk_size() already supports this.
175+
Each shard ~49 GB = ~192 HTTP requests. Not 2250.
176+
177+
2. **CLEANUP between shards**`std::fs::remove_file()` after recording stats.
178+
Otherwise 4 × shard output fills disk. We only need the NUMBERS, not the files.
179+
The final production run will write to a combined output file.
180+
181+
3. **DO NOT modify existing tests** — shard 5 test stays untouched.
182+
Add the new test function BELOW it in the same `mod tests` block.
183+
184+
4. **DO NOT modify stream_index_gguf() or project_row_to_base17()**
185+
these work. Shard 5 proved it. Touch nothing in the production code.
186+
187+
5. **Shard 5 stats hardcoded** — add shard 5's known numbers (7.70 MB output,
188+
18.22 GB source) to the grand total WITHOUT re-downloading it.
189+
190+
## RUN COMMAND
191+
192+
```bash
193+
cargo test test_stream_index_llama4_bf16_shards_1_to_4 \
194+
--release -- --ignored --nocapture 2>&1 | tee /tmp/llama4_full.log
195+
```
196+
197+
Expect ~2 hours total. Each shard ~25-30 min (3× larger than shard 5's 9 min).
198+
Peak RAM should stay under 1 GB throughout.
199+
200+
## EXPECTED OUTPUT
201+
202+
If shard 5's ratio (~4,735×) holds for the MoE-heavy shards 1-4:
203+
204+
```
205+
Shard 1 (48.94 GB): → ~10 MB
206+
Shard 2 (49.96 GB): → ~11 MB
207+
Shard 3 (48.66 GB): → ~10 MB
208+
Shard 4 (49.79 GB): → ~11 MB
209+
Shard 5 (18.22 GB): → 7.7 MB (measured)
210+
──────────────────────────────
211+
Total (215.57 GB): → ~50 MB at ~4,300×
212+
```
213+
214+
The MoE expert layers in shards 1-4 (which contain the bulk of the 16 experts'
215+
gate/up/down weights) should compress at 10,000-15,000× like shard 5 showed.
216+
Attention layers at ~2,000×. Embedding layer in shard 1 might be lower ratio.
217+
218+
## AFTER THE RUN
219+
220+
1. Commit the test (even if running takes hours, commit the code first)
221+
2. Copy the full output log to `.claude/knowledge/llama4_scout_full_results.md`
222+
3. Push both
223+
224+
Do NOT skip the cleanup step. Do NOT run shards in parallel (RAM).
225+
Do NOT modify anything in src/hpc/ except adding the test function.

0 commit comments

Comments
 (0)