|
| 1 | +# Llama 4 Scout 17B-16E — Full Model Compression Results |
| 2 | + |
| 3 | +## Pipeline |
| 4 | + |
| 5 | +BF16-direct → golden-step Base17 → bgz7 container |
| 6 | + |
| 7 | +- `stream_index_gguf_bf16()` with `octave_stride=16` |
| 8 | +- F64x8 SIMD: 8 rows projected in parallel per zmm register |
| 9 | +- Halftone drop: 9 of 17 golden-step positions, odd bins interpolated |
| 10 | +- No f32 intermediate allocation (BF16 → f64 inline) |
| 11 | +- Reusable u16 buffer across all tensors |
| 12 | + |
| 13 | +## Results |
| 14 | + |
| 15 | +| Shard | Source (BF16) | Compressed | Ratio | |
| 16 | +|-------|---------------|------------|-------| |
| 17 | +| 1 | 48.94 GB | 11.77 MB | 4,159× | |
| 18 | +| 2 | 49.96 GB | 8.32 MB | 6,005× | |
| 19 | +| 3 | 48.66 GB | 5.57 MB | 8,736× | |
| 20 | +| 4 | 49.79 GB | 4.52 MB | 11,016× | |
| 21 | +| 5 | 18.22 GB | 7.70 MB | 2,366× | |
| 22 | +| **Total** | **215.57 GB** | **37.88 MB** | **5,693×** | |
| 23 | + |
| 24 | +## Observations |
| 25 | + |
| 26 | +- Shard 1 (embeddings + early layers): larger output due to embedding table |
| 27 | +- Shards 3-4 (middle MoE layers): highest ratios — expert weights are |
| 28 | + highly structured, golden-step averaging captures the per-expert identity |
| 29 | + in 34 bytes per row |
| 30 | +- Shard 5 (final layers + output head): lower ratio — output projection |
| 31 | + has more variance than interior MoE expert weights |
| 32 | + |
| 33 | +## Location |
| 34 | + |
| 35 | +`src/hpc/openchat/weights/llama4_scout_shard{1-5}.bgz7` |
| 36 | + |
| 37 | +## Implications for Maverick |
| 38 | + |
| 39 | +Maverick has 128 experts (8× Scout). The MoE layers dominate even more. |
| 40 | +If the per-expert ratio holds (~6,000-11,000× on interior shards), |
| 41 | +Maverick's 801 GB could compress to 90-180 MB. |
| 42 | + |
| 43 | +Conservative estimate: ~300 MB (if embedding/attention layers scale worse). |
| 44 | +Optimistic estimate: ~90 MB (if expert sparsity is even higher with 128E). |
0 commit comments