|
| 1 | +# HANDOVER: Llama 4 Maverick 128-Expert MoE + Temperature Fix |
| 2 | + |
| 3 | +## Status from this session |
| 4 | + |
| 5 | +Everything on branch `claude/setup-embedding-pipeline-Fa65C`, all merged. |
| 6 | + |
| 7 | +### What's built and working |
| 8 | +- **Qwopus 27B**: 64 layers streamed from 53.8GB BF16, baked to 26.4MB |
| 9 | +- **SiLU gate correction**: 86% material for real BF16 weights (33% of table scale) |
| 10 | +- **4096-centroid codebook**: 248K tokens assigned, 16MB distance table |
| 11 | +- **Real tokenizer**: Qwen BPE 151K vocab via HuggingFace tokenizers crate |
| 12 | +- **Living thought loop**: tension-driven (free energy), autoregressive, ghost prediction |
| 13 | +- **MoE architecture**: 4096 pseudo-experts, top-128, 4-group hierarchy |
| 14 | +- **Gate as NARS modulator**: three modes compared (No Gate, Filter, NARS) |
| 15 | +- **Inference DAG orchestrator**: 4 pipeline templates, NARS path RL |
| 16 | +- **OSINT pipeline**: spider + OCR + reader-lm + NARS expansion |
| 17 | +- **LiteralGraph**: aiwar (221 nodes) + Wikileaks (1872 nodes) |
| 18 | + |
| 19 | +### What's broken: the attractor collapse |
| 20 | +All generation modes collapse to the same dominant centroid ("!" = centroid 36/78). |
| 21 | +Root cause: argmax token selection + coarse routing + no temperature. |
| 22 | + |
| 23 | +**THE FIX** = thinking style temperature + nucleus sampling (top-p). |
| 24 | +These are the SAME thing. Each thinking style maps to a sampling strategy: |
| 25 | +- Analytical = top-p 0.3 (narrow, precise) |
| 26 | +- Creative = top-p 0.95 (wide, exploratory) |
| 27 | +- Metacognitive = adaptive top-p based on free energy |
| 28 | + |
| 29 | +This was intentionally left for the next session — wiring temperature |
| 30 | +INTO the thinking styles so it's done once, correctly. |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +## Llama 4 Maverick — the target |
| 35 | + |
| 36 | +### Model facts |
| 37 | +``` |
| 38 | +Model: Llama-4-Maverick-17B-128E-Instruct |
| 39 | +Params: ~400B total (17B active per token) |
| 40 | +Experts: 128 (REAL MoE, not dense) |
| 41 | +Top-K: 2 (only 2 experts fire per token) |
| 42 | +Layers: 48 |
| 43 | +Hidden: 5120 |
| 44 | +Heads: 40 (8 KV heads, GQA) |
| 45 | +FFN dim: 8192 per expert |
| 46 | +Vocab: 202,048 |
| 47 | +``` |
| 48 | + |
| 49 | +### GGUF source |
| 50 | +``` |
| 51 | +Repo: unsloth/Llama-4-Maverick-17B-128E-Instruct-GGUF |
| 52 | +BF16: 18 shards × ~43-48 GB = ~800 GB total |
| 53 | +Q2_K: 3 shards × ~48 GB = ~146 GB (but loses gate precision!) |
| 54 | +mmproj-BF16.gguf: 1.7 GB (multimodal projector, separate) |
| 55 | +``` |
| 56 | + |
| 57 | +### Why BF16 (not Q2_K) |
| 58 | +The SiLU gate correction proved that 68.9% of gate weights sit near zero. |
| 59 | +Q2_K quantizes to 2 bits — it CANNOT distinguish gate=0.01 from gate=0.05. |
| 60 | +For 128 real experts where the gate IS the routing decision, we need BF16. |
| 61 | +Stream it via HTTP range requests. Never download. |
| 62 | + |
| 63 | +### Per-layer tensor layout (MoE) |
| 64 | +``` |
| 65 | +blk.N.attn_q.weight: 5120 × 5120 = 52.4 MB |
| 66 | +blk.N.attn_k.weight: 5120 × 1024 = 10.5 MB |
| 67 | +blk.N.attn_v.weight: 5120 × 1024 = 10.5 MB |
| 68 | +blk.N.ffn_gate_inp.weight: 5120 × 128 = 1.3 MB ← THE ROUTER |
| 69 | +blk.N.ffn_gate.0.weight: 5120 × 8192 = 83.9 MB ← expert 0 gate |
| 70 | +blk.N.ffn_up.0.weight: 5120 × 8192 = 83.9 MB ← expert 0 up |
| 71 | +blk.N.ffn_down.0.weight: 8192 × 5120 = 83.9 MB ← expert 0 down |
| 72 | +...repeat for experts 1..127... |
| 73 | +blk.N.ffn_gate.127.weight: 5120 × 8192 = 83.9 MB ← expert 127 gate |
| 74 | +blk.N.ffn_up.127.weight: 5120 × 8192 = 83.9 MB |
| 75 | +blk.N.ffn_down.127.weight: 8192 × 5120 = 83.9 MB |
| 76 | +``` |
| 77 | + |
| 78 | +Per layer: ~32 GB of expert weights (128 × 3 roles × 83.9 MB) |
| 79 | +Total: 48 layers × 32 GB = ~1.5 TB of expert weights |
| 80 | +(Plus attention, embeddings, norms → total ~800 GB BF16) |
| 81 | + |
| 82 | +### Streaming strategy |
| 83 | +``` |
| 84 | +Phase 1: Parse shard 1 header (20 MB range request) |
| 85 | + → Get all tensor names, dims, dtypes, offsets |
| 86 | + → Map tensor offsets to shard files |
| 87 | +
|
| 88 | +Phase 2: Stream token embeddings (202K × 5120 × 2 = 2.1 GB) |
| 89 | + → CLAM 4096 centroids → build codebook + assignments |
| 90 | +
|
| 91 | +Phase 3: Per layer, per expert: |
| 92 | + Stream 256 rows of expert.gate (256 × 8192 × 2 = 4.2 MB) |
| 93 | + Stream 256 rows of expert.up (same) |
| 94 | + Apply SiLU(gate) × up |
| 95 | + CLAM 256 centroids → build 256×256 distance table |
| 96 | + Discard weights |
| 97 | +
|
| 98 | + 128 experts × 4.2 MB = 538 MB per layer |
| 99 | + 48 layers × 538 MB = 25.8 GB total streaming |
| 100 | +
|
| 101 | +Phase 4: Stream router weights per layer |
| 102 | + 5120 × 128 × 2 = 1.3 MB per layer |
| 103 | + → Build 128×128 router distance table (who co-activates?) |
| 104 | +
|
| 105 | +Phase 5: Bake everything |
| 106 | + Per expert: 256×256 = 64 KB |
| 107 | + 128 experts × 64 KB = 8 MB per layer |
| 108 | + 48 layers × 8 MB = 384 MB expert tables |
| 109 | + Plus router + attention + embeddings ≈ 500 MB total |
| 110 | + Compression: 800 GB → 500 MB = 1600× |
| 111 | +``` |
| 112 | + |
| 113 | +### Multi-shard GGUF parsing |
| 114 | +``` |
| 115 | +Shard 1: header + KV metadata + tensor info for ALL tensors + first data |
| 116 | +Shard 2-18: continuation of tensor data |
| 117 | +
|
| 118 | +Tensor offsets in the header are ABSOLUTE (from start of all data). |
| 119 | +To find which shard contains a tensor: |
| 120 | + cumulative_offset = 0 |
| 121 | + for each shard: |
| 122 | + shard_data_size = shard_file_size - shard_header_size |
| 123 | + if tensor_offset < cumulative_offset + shard_data_size: |
| 124 | + → tensor is in this shard |
| 125 | + → local_offset = shard_header_size + (tensor_offset - cumulative_offset) |
| 126 | + cumulative_offset += shard_data_size |
| 127 | +``` |
| 128 | + |
| 129 | +--- |
| 130 | + |
| 131 | +## Wiring 128 Real Experts (Maverick) vs Current 4096 Pseudo-Experts (Qwopus) |
| 132 | + |
| 133 | +### Current (Qwopus pseudo-MoE) |
| 134 | +``` |
| 135 | +4096 "experts" = 4096 centroids from token embeddings |
| 136 | +Each "expert" is just a row in the 4096×4096 input distance table |
| 137 | +Expert internals: shared 256×256 per-layer tables (all experts use same layers) |
| 138 | +Top-128 selection: argmax on router output |
| 139 | +Expert processing: each runs through same 16 layers |
| 140 | +
|
| 141 | +Problem: experts share internals. They're not specialized. |
| 142 | +"Expert 42" and "Expert 1337" run through the SAME layer tables. |
| 143 | +The only difference is their starting position in the 256-space. |
| 144 | +``` |
| 145 | + |
| 146 | +### Target (Maverick real MoE) |
| 147 | +``` |
| 148 | +128 experts, each with THEIR OWN gate/up/down weights |
| 149 | +Expert 0 has: gate_0 (5120×8192), up_0 (5120×8192), down_0 (8192×5120) |
| 150 | +Expert 127 has: gate_127, up_127, down_127 |
| 151 | +Each expert IS a different neural network. Different specialization. |
| 152 | +
|
| 153 | +Router (ffn_gate_inp): 5120 × 128 matrix |
| 154 | + Input hidden state × router = 128 scores |
| 155 | + Top-2 scores → 2 experts fire |
| 156 | + Expert outputs weighted by softmax of their scores |
| 157 | +
|
| 158 | +This is TRUE sparse MoE: |
| 159 | + 128 specialists, only 2 work per token |
| 160 | + Each specialist has genuinely different weights |
| 161 | + The router learned which specialist handles which inputs |
| 162 | +``` |
| 163 | + |
| 164 | +### The wiring change |
| 165 | +``` |
| 166 | +Current: |
| 167 | + input → shared router table → top-128 → shared layer tables → output |
| 168 | +
|
| 169 | +Target: |
| 170 | + input → router table (128×128, from ffn_gate_inp) → top-2 |
| 171 | + → expert_i: own gate_i table (256×256) + own up_i table + own down_i table |
| 172 | + → expert_j: own gate_j table + own up_j table + own down_j table |
| 173 | + → weighted sum of expert_i and expert_j outputs |
| 174 | + → next layer |
| 175 | +
|
| 176 | +Each expert has 3 UNIQUE tables: gate, up, down |
| 177 | +128 experts × 3 tables × 64 KB = 24 MB per layer |
| 178 | +48 layers × 24 MB = 1.15 GB of expert-specific tables |
| 179 | +Plus shared attention tables: 48 × 64 KB = 3 MB |
| 180 | +Plus router tables: 48 × 16 KB = 768 KB |
| 181 | +Total: ~1.2 GB for the complete Maverick brain |
| 182 | +``` |
| 183 | + |
| 184 | +### Code change |
| 185 | +```rust |
| 186 | +// Current (qwopus_moe.rs): |
| 187 | +for &(expert_id, expert_weight) in &active_experts { |
| 188 | + // ALL experts use the SAME layer tables |
| 189 | + let [ref at, ref gt, ref up, ref dn] = layers[l]; |
| 190 | + // ... process through shared tables ... |
| 191 | +} |
| 192 | + |
| 193 | +// Target (stream_maverick.rs): |
| 194 | +for &(expert_id, expert_weight) in &active_experts { |
| 195 | + // Each expert uses ITS OWN tables |
| 196 | + let expert_gate = &expert_tables[l][expert_id].gate; // unique! |
| 197 | + let expert_up = &expert_tables[l][expert_id].up; // unique! |
| 198 | + let expert_down = &expert_tables[l][expert_id].down; // unique! |
| 199 | + // ... process through expert-specific tables ... |
| 200 | +} |
| 201 | +``` |
| 202 | + |
| 203 | +### The SiLU correction for real MoE |
| 204 | +``` |
| 205 | +For Qwopus (dense): SiLU correction changed 33% of table (material) |
| 206 | +For Maverick (MoE): SiLU correction expected to be TRANSFORMATIVE |
| 207 | +
|
| 208 | +Why: the router weight (ffn_gate_inp) decides which 2 of 128 fire. |
| 209 | +Raw cosine on router weights: cos(expert_i, expert_j) ≈ 0.95 for all pairs |
| 210 | + → "all experts look similar" → WRONG |
| 211 | +SiLU-corrected: reveals which experts ACTUALLY co-activate |
| 212 | + → expert 3 and expert 42 → correction = -0.8 → never co-fire |
| 213 | + → expert 3 and expert 17 → correction = +0.3 → frequently co-fire |
| 214 | + → the distance table encodes ROUTING, not just similarity |
| 215 | +``` |
| 216 | + |
| 217 | +--- |
| 218 | + |
| 219 | +## Temperature Fix (Do First) |
| 220 | + |
| 221 | +Before streaming Maverick, fix the attractor collapse. It's 10 lines: |
| 222 | + |
| 223 | +```rust |
| 224 | +// In the living loop / MoE output selection: |
| 225 | + |
| 226 | +// Instead of argmax: |
| 227 | +let winner = peaks[0].0; |
| 228 | + |
| 229 | +// Use nucleus sampling (top-p): |
| 230 | +fn sample_nucleus(peaks: &[(usize, f32)], top_p: f32, temperature: f32) -> usize { |
| 231 | + // Apply temperature |
| 232 | + let scaled: Vec<f32> = peaks.iter() |
| 233 | + .map(|&(_, e)| (e / temperature).exp()) |
| 234 | + .collect(); |
| 235 | + let sum: f32 = scaled.iter().sum(); |
| 236 | + let probs: Vec<f32> = scaled.iter().map(|s| s / sum).collect(); |
| 237 | + |
| 238 | + // Nucleus: accumulate until top_p reached |
| 239 | + let mut cumsum = 0.0; |
| 240 | + for (i, &p) in probs.iter().enumerate() { |
| 241 | + cumsum += p; |
| 242 | + if cumsum >= top_p { |
| 243 | + // Sample uniformly from the nucleus |
| 244 | + let r = simple_random() % (i + 1); |
| 245 | + return peaks[r].0; |
| 246 | + } |
| 247 | + } |
| 248 | + peaks[0].0 |
| 249 | +} |
| 250 | + |
| 251 | +// Thinking style maps to temperature: |
| 252 | +let temperature = match thinking_style { |
| 253 | + Analytical | Logical | Systematic => 0.3, // precise |
| 254 | + Creative | Imaginative | Playful => 1.2, // exploratory |
| 255 | + Metacognitive | Reflective => 0.7, // balanced |
| 256 | + _ => 0.8, // default |
| 257 | +}; |
| 258 | +let top_p = match thinking_style { |
| 259 | + Focused | Precise => 0.3, |
| 260 | + Exploratory | Curious => 0.95, |
| 261 | + _ => 0.9, |
| 262 | +}; |
| 263 | +``` |
| 264 | + |
| 265 | +This unblocks coherent output. THEN stream Maverick with 128 real experts |
| 266 | +through the now-working generation pipeline. |
| 267 | + |
| 268 | +--- |
| 269 | + |
| 270 | +## File locations |
| 271 | + |
| 272 | +``` |
| 273 | +Streaming script: crates/thinking-engine/examples/stream_maverick.rs |
| 274 | +Qwopus data: crates/thinking-engine/data/Qwopus3.5-27B-v3-BF16-silu/ |
| 275 | +Living loop: crates/thinking-engine/examples/qwopus_living.rs |
| 276 | +MoE example: crates/thinking-engine/examples/qwopus_moe.rs |
| 277 | +NARS gate example: crates/thinking-engine/examples/qwopus_nars_gate.rs |
| 278 | +SiLU correction: crates/thinking-engine/src/silu_correction.rs |
| 279 | +HDR audit: crates/thinking-engine/examples/hdr_audit.rs |
| 280 | +Inference DAG: crates/lance-graph-contract/src/orchestration_mode.rs |
| 281 | +OSINT pipeline: crates/lance-graph-osint/examples/stream_explore.rs |
| 282 | +Felt OCR: ndarray/src/hpc/ocr_felt.rs |
| 283 | +SIMD OCR: ndarray/src/hpc/ocr_simd.rs |
| 284 | +``` |
| 285 | + |
| 286 | +## Session stats |
| 287 | +- 61 commits (57 lance-graph + 4 ndarray) |
| 288 | +- 235K LOC Rust |
| 289 | +- 500+ tests across 18 crates |
| 290 | +- All PRs merged |
| 291 | + |
| 292 | +--- |
| 293 | + |
| 294 | +## γ+φ Golden Ratio HDR Encoding (CRITICAL for gate precision) |
| 295 | + |
| 296 | +### The problem |
| 297 | +Current HDR CDF produces uniform distribution (Mean=127.5 for ALL models). |
| 298 | +But gate weights concentrate at zero (68.9% for Qwopus). |
| 299 | +Uniform encoding wastes resolution on far-from-zero regions where |
| 300 | +the model already knows (strong yes/no). The decision boundary at |
| 301 | +zero gets the SAME 1/256 resolution as obvious regions. |
| 302 | + |
| 303 | +### The fix: γ offset + φ redistribution |
| 304 | +Per-role gamma offsets (from variance audit): |
| 305 | +``` |
| 306 | +Q: γ=0.37 (narrow, less resolution needed) |
| 307 | +K: γ=0.94 (moderate, gate-filtered) |
| 308 | +V: γ=1.33 (wide, most information) |
| 309 | +Gate: γ=1.50 (WIDEST — decision boundary needs MAX resolution) |
| 310 | +Up: γ=0.12 (very narrow after SiLU) |
| 311 | +Down: γ=0.15 (funnel, compressed) |
| 312 | +``` |
| 313 | + |
| 314 | +Golden ratio φ=1.618... ensures the redistribution has no periodic aliasing |
| 315 | +(Weyl equidistribution theorem). The spiral stride in highheelbgz already |
| 316 | +uses φ. The γ+φ encoding applies the same principle to u8 quantization. |
| 317 | + |
| 318 | +### Existing code |
| 319 | +- `bgz-tensor/src/gamma_phi.rs`: GammaProfile, gamma_phi_encode/decode |
| 320 | +- `bgz-tensor/src/codebook_calibrated.rs`: two-pass build with γ calibration |
| 321 | +- `highheelbgz/src/`: SpiralAddress with golden ratio stride |
| 322 | +- `thinking-engine/data/codebooks/CODEBOOKS.md`: per-role γ values documented |
| 323 | + |
| 324 | +### Wiring needed |
| 325 | +Pass 1: build CLAM codebook (existing) |
| 326 | +Pass 2: measure cosine distribution → compute γ offset → apply φ redistribution |
| 327 | +Pass 3: re-encode distance table with γ+φ skewed CDF |
| 328 | +Expected: ~4.2 bits → ~5.5 bits entropy (30% more discrimination) |
| 329 | + |
| 330 | +--- |
| 331 | + |
| 332 | +## Standardized ModelPipeline DTO (6 models) |
| 333 | + |
| 334 | +```rust |
| 335 | +pub struct ModelPipeline { |
| 336 | + pub name: String, |
| 337 | + pub family: ModelFamily, // Embedding, Reranker, Reader, LLM, MoE |
| 338 | + pub tokenizer_path: String, |
| 339 | + pub vocab_size: usize, |
| 340 | + pub hidden_dim: usize, |
| 341 | + pub n_layers: usize, |
| 342 | + pub n_experts: Option<usize>, |
| 343 | + pub n_centroids: usize, |
| 344 | + pub gate_policy: GatePolicy, |
| 345 | + pub gamma_profile: GammaProfile, // per-role γ offsets |
| 346 | + pub silu_corrected: bool, |
| 347 | + pub cross_model_anchor: bool, // Jina = truth anchor |
| 348 | +} |
| 349 | +``` |
| 350 | + |
| 351 | +Six pipelines to wire: |
| 352 | +1. Jina v3 (1024-dim, truth anchor, cross-model reference) |
| 353 | +2. BGE-M3 (1024-dim, multilingual, second anchor) |
| 354 | +3. Reader-LM 1.5B (256-dim palette, HTML→text) |
| 355 | +4. Jina Reranker v3 (cross-encoder, relevance scoring) |
| 356 | +5. Qwopus 27B (5120-dim, 64 layers, SSM hybrid) |
| 357 | +6. Maverick 128E (5120-dim, 48 layers, 128 real MoE experts) |
| 358 | + |
| 359 | +Each needs: tokenizer.json + vocab→centroid mapping + γ+φ HDR tables + SiLU correction |
| 360 | + |
| 361 | +### Jina cross-model eval |
| 362 | +Jina as truth anchor: for any input text, Jina embedding = ground truth similarity. |
| 363 | +Compare: cos(jina_emb_A, jina_emb_B) vs thinking_engine_distance(A, B). |
| 364 | +The gap = how much information our distance table loses. |
| 365 | +With γ+φ encoding + SiLU correction: gap should shrink. |
0 commit comments