Skip to content

Commit 149daf0

Browse files
committed
session prompt: Maverick BF16-direct indexing (18 shards, 801 GB)
1 parent a8631a2 commit 149daf0

1 file changed

Lines changed: 81 additions & 0 deletions

File tree

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# SESSION: Llama 4 Maverick BF16 — Stream-Index All 18 Shards
2+
3+
## MISSION
4+
5+
Process all 18 BF16 shards of Llama 4 Maverick (17B-128E, 402B total params).
6+
801.47 GB streamed through the BF16-direct indexer with F64x8 SIMD.
7+
8+
Scout (16 experts) compressed to 37.88 MB at 5,693×.
9+
Maverick (128 experts) expected: 90–489 MB.
10+
11+
## READ FIRST
12+
13+
```bash
14+
cat src/hpc/gguf_indexer.rs # stream_index_gguf_bf16(), project_tensor_bf16_simd()
15+
cat src/hpc/http_reader.rs # HttpRangeReader::with_chunk_size()
16+
cat src/hpc/gguf.rs # GGUF header/tensor parsing
17+
18+
# The test is already written at the bottom of gguf_indexer.rs:
19+
grep -A 5 "test_stream_index_llama4_maverick_bf16_all_shards" src/hpc/gguf_indexer.rs
20+
```
21+
22+
Do NOT modify any existing code. The test function is already there.
23+
24+
## RUN COMMAND
25+
26+
```bash
27+
cargo test test_stream_index_llama4_maverick_bf16_all_shards \
28+
--release -- --ignored --nocapture 2>&1 | tee /tmp/llama4_maverick_full.log
29+
```
30+
31+
## PIPELINE (already implemented)
32+
33+
```
34+
BF16 bytes → read_tensor_bf16_raw (reusable Vec<u16>, no f32 alloc)
35+
→ project_tensor_bf16_simd (F64x8, 8 rows parallel)
36+
→ project_8rows_bf16_simd (17 zmm accumulators)
37+
→ gather_bf16_x8 (8 indexed u16 loads → F64x8)
38+
→ strided octave (stride=16, 51 of 814 octaves)
39+
→ halftone drop (9 of 17 golden positions)
40+
→ interpolate odd bins from neighbors
41+
→ project_1row_bf16_strided (scalar tail, n_rows % 8)
42+
→ CompressedTensor::write_to (Base17 per row)
43+
→ tail deletion (keep 3 most recent outputs)
44+
```
45+
46+
## DISK BUDGET: 26 GB FREE
47+
48+
Output files are tiny (expected 5-27 MB each). Tail deletion keeps 3 most
49+
recent, deletes older. Total output 90-489 MB. No disk pressure.
50+
51+
## SHARD MAP (18 shards, 801.47 GB)
52+
53+
```
54+
Shard 1: 46.17 GB Shard 10: 42.95 GB
55+
Shard 2: 42.95 GB Shard 11: 42.95 GB
56+
Shard 3: 42.95 GB Shard 12: 47.91 GB
57+
Shard 4: 42.95 GB Shard 13: 42.95 GB
58+
Shard 5: 47.94 GB Shard 14: 42.95 GB
59+
Shard 6: 42.95 GB Shard 15: 42.95 GB
60+
Shard 7: 42.95 GB Shard 16: 47.91 GB
61+
Shard 8: 42.95 GB Shard 17: 42.95 GB
62+
Shard 9: 47.92 GB Shard 18: 48.21 GB
63+
```
64+
65+
128 MoE experts, interleaving Dense→MoE→Dense (every odd layer is MoE).
66+
67+
## EXPECTED RUNTIME
68+
69+
~8-10 hours total. Each shard ~25-30 min.
70+
Peak RAM: ~142 MB (one reusable u16 buffer, largest tensor).
71+
CPU: network-bound (97% fewer BF16→f64 conversions than f32 path).
72+
73+
## AFTER THE RUN
74+
75+
1. Copy log: `cp /tmp/llama4_maverick_full.log .claude/knowledge/`
76+
2. Push results to `src/hpc/openchat/weights/llama4_maverick_shard{NN}.bgz7`
77+
(if output files were kept — otherwise just push the log)
78+
3. Commit + push
79+
80+
Do NOT modify anything in src/hpc/ except adding results to knowledge/.
81+
Do NOT run shards in parallel (RAM). Do NOT skip tail cleanup.

0 commit comments

Comments
 (0)