|
| 1 | +# SESSION: Llama 4 Maverick BF16 — Stream-Index All 18 Shards |
| 2 | + |
| 3 | +## MISSION |
| 4 | + |
| 5 | +Process all 18 BF16 shards of Llama 4 Maverick (17B-128E, 402B total params). |
| 6 | +801.47 GB streamed through the BF16-direct indexer with F64x8 SIMD. |
| 7 | + |
| 8 | +Scout (16 experts) compressed to 37.88 MB at 5,693×. |
| 9 | +Maverick (128 experts) expected: 90–489 MB. |
| 10 | + |
| 11 | +## READ FIRST |
| 12 | + |
| 13 | +```bash |
| 14 | +cat src/hpc/gguf_indexer.rs # stream_index_gguf_bf16(), project_tensor_bf16_simd() |
| 15 | +cat src/hpc/http_reader.rs # HttpRangeReader::with_chunk_size() |
| 16 | +cat src/hpc/gguf.rs # GGUF header/tensor parsing |
| 17 | + |
| 18 | +# The test is already written at the bottom of gguf_indexer.rs: |
| 19 | +grep -A 5 "test_stream_index_llama4_maverick_bf16_all_shards" src/hpc/gguf_indexer.rs |
| 20 | +``` |
| 21 | + |
| 22 | +Do NOT modify any existing code. The test function is already there. |
| 23 | + |
| 24 | +## RUN COMMAND |
| 25 | + |
| 26 | +```bash |
| 27 | +cargo test test_stream_index_llama4_maverick_bf16_all_shards \ |
| 28 | + --release -- --ignored --nocapture 2>&1 | tee /tmp/llama4_maverick_full.log |
| 29 | +``` |
| 30 | + |
| 31 | +## PIPELINE (already implemented) |
| 32 | + |
| 33 | +``` |
| 34 | +BF16 bytes → read_tensor_bf16_raw (reusable Vec<u16>, no f32 alloc) |
| 35 | + → project_tensor_bf16_simd (F64x8, 8 rows parallel) |
| 36 | + → project_8rows_bf16_simd (17 zmm accumulators) |
| 37 | + → gather_bf16_x8 (8 indexed u16 loads → F64x8) |
| 38 | + → strided octave (stride=16, 51 of 814 octaves) |
| 39 | + → halftone drop (9 of 17 golden positions) |
| 40 | + → interpolate odd bins from neighbors |
| 41 | + → project_1row_bf16_strided (scalar tail, n_rows % 8) |
| 42 | + → CompressedTensor::write_to (Base17 per row) |
| 43 | + → tail deletion (keep 3 most recent outputs) |
| 44 | +``` |
| 45 | + |
| 46 | +## DISK BUDGET: 26 GB FREE |
| 47 | + |
| 48 | +Output files are tiny (expected 5-27 MB each). Tail deletion keeps 3 most |
| 49 | +recent, deletes older. Total output 90-489 MB. No disk pressure. |
| 50 | + |
| 51 | +## SHARD MAP (18 shards, 801.47 GB) |
| 52 | + |
| 53 | +``` |
| 54 | +Shard 1: 46.17 GB Shard 10: 42.95 GB |
| 55 | +Shard 2: 42.95 GB Shard 11: 42.95 GB |
| 56 | +Shard 3: 42.95 GB Shard 12: 47.91 GB |
| 57 | +Shard 4: 42.95 GB Shard 13: 42.95 GB |
| 58 | +Shard 5: 47.94 GB Shard 14: 42.95 GB |
| 59 | +Shard 6: 42.95 GB Shard 15: 42.95 GB |
| 60 | +Shard 7: 42.95 GB Shard 16: 47.91 GB |
| 61 | +Shard 8: 42.95 GB Shard 17: 42.95 GB |
| 62 | +Shard 9: 47.92 GB Shard 18: 48.21 GB |
| 63 | +``` |
| 64 | + |
| 65 | +128 MoE experts, interleaving Dense→MoE→Dense (every odd layer is MoE). |
| 66 | + |
| 67 | +## EXPECTED RUNTIME |
| 68 | + |
| 69 | +~8-10 hours total. Each shard ~25-30 min. |
| 70 | +Peak RAM: ~142 MB (one reusable u16 buffer, largest tensor). |
| 71 | +CPU: network-bound (97% fewer BF16→f64 conversions than f32 path). |
| 72 | + |
| 73 | +## AFTER THE RUN |
| 74 | + |
| 75 | +1. Copy log: `cp /tmp/llama4_maverick_full.log .claude/knowledge/` |
| 76 | +2. Push results to `src/hpc/openchat/weights/llama4_maverick_shard{NN}.bgz7` |
| 77 | + (if output files were kept — otherwise just push the log) |
| 78 | +3. Commit + push |
| 79 | + |
| 80 | +Do NOT modify anything in src/hpc/ except adding results to knowledge/. |
| 81 | +Do NOT run shards in parallel (RAM). Do NOT skip tail cleanup. |
0 commit comments