Skip to content

Commit f10e00d

Browse files
authored
Merge pull request #111 from AdaWorldAPI/claude/setup-embedding-pipeline-Fa65C
docs: add γ+φ encoding + ModelPipeline DTO + Jina eval to handover Complete next-session pipeline: 1. Temperature fix (nucleus + thinking styles) 2. γ+φ golden ratio HDR re-encoding (per-role, Gate γ=1.50) 3. StandardizedModelPipeline DTO (6 models) 4. Jina cross-model eval (truth anchor) 5. Maverick 128E streaming (800 GB → ~1.2 GB) https://claude.ai/code/session_01ChLvBfpJS8dQhHxRD4pYNp
2 parents 3d4cfc9 + 6e9bdd5 commit f10e00d

2 files changed

Lines changed: 493 additions & 0 deletions

File tree

Lines changed: 365 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,365 @@
1+
# HANDOVER: Llama 4 Maverick 128-Expert MoE + Temperature Fix
2+
3+
## Status from this session
4+
5+
Everything on branch `claude/setup-embedding-pipeline-Fa65C`, all merged.
6+
7+
### What's built and working
8+
- **Qwopus 27B**: 64 layers streamed from 53.8GB BF16, baked to 26.4MB
9+
- **SiLU gate correction**: 86% material for real BF16 weights (33% of table scale)
10+
- **4096-centroid codebook**: 248K tokens assigned, 16MB distance table
11+
- **Real tokenizer**: Qwen BPE 151K vocab via HuggingFace tokenizers crate
12+
- **Living thought loop**: tension-driven (free energy), autoregressive, ghost prediction
13+
- **MoE architecture**: 4096 pseudo-experts, top-128, 4-group hierarchy
14+
- **Gate as NARS modulator**: three modes compared (No Gate, Filter, NARS)
15+
- **Inference DAG orchestrator**: 4 pipeline templates, NARS path RL
16+
- **OSINT pipeline**: spider + OCR + reader-lm + NARS expansion
17+
- **LiteralGraph**: aiwar (221 nodes) + Wikileaks (1872 nodes)
18+
19+
### What's broken: the attractor collapse
20+
All generation modes collapse to the same dominant centroid ("!" = centroid 36/78).
21+
Root cause: argmax token selection + coarse routing + no temperature.
22+
23+
**THE FIX** = thinking style temperature + nucleus sampling (top-p).
24+
These are the SAME thing. Each thinking style maps to a sampling strategy:
25+
- Analytical = top-p 0.3 (narrow, precise)
26+
- Creative = top-p 0.95 (wide, exploratory)
27+
- Metacognitive = adaptive top-p based on free energy
28+
29+
This was intentionally left for the next session — wiring temperature
30+
INTO the thinking styles so it's done once, correctly.
31+
32+
---
33+
34+
## Llama 4 Maverick — the target
35+
36+
### Model facts
37+
```
38+
Model: Llama-4-Maverick-17B-128E-Instruct
39+
Params: ~400B total (17B active per token)
40+
Experts: 128 (REAL MoE, not dense)
41+
Top-K: 2 (only 2 experts fire per token)
42+
Layers: 48
43+
Hidden: 5120
44+
Heads: 40 (8 KV heads, GQA)
45+
FFN dim: 8192 per expert
46+
Vocab: 202,048
47+
```
48+
49+
### GGUF source
50+
```
51+
Repo: unsloth/Llama-4-Maverick-17B-128E-Instruct-GGUF
52+
BF16: 18 shards × ~43-48 GB = ~800 GB total
53+
Q2_K: 3 shards × ~48 GB = ~146 GB (but loses gate precision!)
54+
mmproj-BF16.gguf: 1.7 GB (multimodal projector, separate)
55+
```
56+
57+
### Why BF16 (not Q2_K)
58+
The SiLU gate correction proved that 68.9% of gate weights sit near zero.
59+
Q2_K quantizes to 2 bits — it CANNOT distinguish gate=0.01 from gate=0.05.
60+
For 128 real experts where the gate IS the routing decision, we need BF16.
61+
Stream it via HTTP range requests. Never download.
62+
63+
### Per-layer tensor layout (MoE)
64+
```
65+
blk.N.attn_q.weight: 5120 × 5120 = 52.4 MB
66+
blk.N.attn_k.weight: 5120 × 1024 = 10.5 MB
67+
blk.N.attn_v.weight: 5120 × 1024 = 10.5 MB
68+
blk.N.ffn_gate_inp.weight: 5120 × 128 = 1.3 MB ← THE ROUTER
69+
blk.N.ffn_gate.0.weight: 5120 × 8192 = 83.9 MB ← expert 0 gate
70+
blk.N.ffn_up.0.weight: 5120 × 8192 = 83.9 MB ← expert 0 up
71+
blk.N.ffn_down.0.weight: 8192 × 5120 = 83.9 MB ← expert 0 down
72+
...repeat for experts 1..127...
73+
blk.N.ffn_gate.127.weight: 5120 × 8192 = 83.9 MB ← expert 127 gate
74+
blk.N.ffn_up.127.weight: 5120 × 8192 = 83.9 MB
75+
blk.N.ffn_down.127.weight: 8192 × 5120 = 83.9 MB
76+
```
77+
78+
Per layer: ~32 GB of expert weights (128 × 3 roles × 83.9 MB)
79+
Total: 48 layers × 32 GB = ~1.5 TB of expert weights
80+
(Plus attention, embeddings, norms → total ~800 GB BF16)
81+
82+
### Streaming strategy
83+
```
84+
Phase 1: Parse shard 1 header (20 MB range request)
85+
→ Get all tensor names, dims, dtypes, offsets
86+
→ Map tensor offsets to shard files
87+
88+
Phase 2: Stream token embeddings (202K × 5120 × 2 = 2.1 GB)
89+
→ CLAM 4096 centroids → build codebook + assignments
90+
91+
Phase 3: Per layer, per expert:
92+
Stream 256 rows of expert.gate (256 × 8192 × 2 = 4.2 MB)
93+
Stream 256 rows of expert.up (same)
94+
Apply SiLU(gate) × up
95+
CLAM 256 centroids → build 256×256 distance table
96+
Discard weights
97+
98+
128 experts × 4.2 MB = 538 MB per layer
99+
48 layers × 538 MB = 25.8 GB total streaming
100+
101+
Phase 4: Stream router weights per layer
102+
5120 × 128 × 2 = 1.3 MB per layer
103+
→ Build 128×128 router distance table (who co-activates?)
104+
105+
Phase 5: Bake everything
106+
Per expert: 256×256 = 64 KB
107+
128 experts × 64 KB = 8 MB per layer
108+
48 layers × 8 MB = 384 MB expert tables
109+
Plus router + attention + embeddings ≈ 500 MB total
110+
Compression: 800 GB → 500 MB = 1600×
111+
```
112+
113+
### Multi-shard GGUF parsing
114+
```
115+
Shard 1: header + KV metadata + tensor info for ALL tensors + first data
116+
Shard 2-18: continuation of tensor data
117+
118+
Tensor offsets in the header are ABSOLUTE (from start of all data).
119+
To find which shard contains a tensor:
120+
cumulative_offset = 0
121+
for each shard:
122+
shard_data_size = shard_file_size - shard_header_size
123+
if tensor_offset < cumulative_offset + shard_data_size:
124+
→ tensor is in this shard
125+
→ local_offset = shard_header_size + (tensor_offset - cumulative_offset)
126+
cumulative_offset += shard_data_size
127+
```
128+
129+
---
130+
131+
## Wiring 128 Real Experts (Maverick) vs Current 4096 Pseudo-Experts (Qwopus)
132+
133+
### Current (Qwopus pseudo-MoE)
134+
```
135+
4096 "experts" = 4096 centroids from token embeddings
136+
Each "expert" is just a row in the 4096×4096 input distance table
137+
Expert internals: shared 256×256 per-layer tables (all experts use same layers)
138+
Top-128 selection: argmax on router output
139+
Expert processing: each runs through same 16 layers
140+
141+
Problem: experts share internals. They're not specialized.
142+
"Expert 42" and "Expert 1337" run through the SAME layer tables.
143+
The only difference is their starting position in the 256-space.
144+
```
145+
146+
### Target (Maverick real MoE)
147+
```
148+
128 experts, each with THEIR OWN gate/up/down weights
149+
Expert 0 has: gate_0 (5120×8192), up_0 (5120×8192), down_0 (8192×5120)
150+
Expert 127 has: gate_127, up_127, down_127
151+
Each expert IS a different neural network. Different specialization.
152+
153+
Router (ffn_gate_inp): 5120 × 128 matrix
154+
Input hidden state × router = 128 scores
155+
Top-2 scores → 2 experts fire
156+
Expert outputs weighted by softmax of their scores
157+
158+
This is TRUE sparse MoE:
159+
128 specialists, only 2 work per token
160+
Each specialist has genuinely different weights
161+
The router learned which specialist handles which inputs
162+
```
163+
164+
### The wiring change
165+
```
166+
Current:
167+
input → shared router table → top-128 → shared layer tables → output
168+
169+
Target:
170+
input → router table (128×128, from ffn_gate_inp) → top-2
171+
→ expert_i: own gate_i table (256×256) + own up_i table + own down_i table
172+
→ expert_j: own gate_j table + own up_j table + own down_j table
173+
→ weighted sum of expert_i and expert_j outputs
174+
→ next layer
175+
176+
Each expert has 3 UNIQUE tables: gate, up, down
177+
128 experts × 3 tables × 64 KB = 24 MB per layer
178+
48 layers × 24 MB = 1.15 GB of expert-specific tables
179+
Plus shared attention tables: 48 × 64 KB = 3 MB
180+
Plus router tables: 48 × 16 KB = 768 KB
181+
Total: ~1.2 GB for the complete Maverick brain
182+
```
183+
184+
### Code change
185+
```rust
186+
// Current (qwopus_moe.rs):
187+
for &(expert_id, expert_weight) in &active_experts {
188+
// ALL experts use the SAME layer tables
189+
let [ref at, ref gt, ref up, ref dn] = layers[l];
190+
// ... process through shared tables ...
191+
}
192+
193+
// Target (stream_maverick.rs):
194+
for &(expert_id, expert_weight) in &active_experts {
195+
// Each expert uses ITS OWN tables
196+
let expert_gate = &expert_tables[l][expert_id].gate; // unique!
197+
let expert_up = &expert_tables[l][expert_id].up; // unique!
198+
let expert_down = &expert_tables[l][expert_id].down; // unique!
199+
// ... process through expert-specific tables ...
200+
}
201+
```
202+
203+
### The SiLU correction for real MoE
204+
```
205+
For Qwopus (dense): SiLU correction changed 33% of table (material)
206+
For Maverick (MoE): SiLU correction expected to be TRANSFORMATIVE
207+
208+
Why: the router weight (ffn_gate_inp) decides which 2 of 128 fire.
209+
Raw cosine on router weights: cos(expert_i, expert_j) ≈ 0.95 for all pairs
210+
→ "all experts look similar" → WRONG
211+
SiLU-corrected: reveals which experts ACTUALLY co-activate
212+
→ expert 3 and expert 42 → correction = -0.8 → never co-fire
213+
→ expert 3 and expert 17 → correction = +0.3 → frequently co-fire
214+
→ the distance table encodes ROUTING, not just similarity
215+
```
216+
217+
---
218+
219+
## Temperature Fix (Do First)
220+
221+
Before streaming Maverick, fix the attractor collapse. It's 10 lines:
222+
223+
```rust
224+
// In the living loop / MoE output selection:
225+
226+
// Instead of argmax:
227+
let winner = peaks[0].0;
228+
229+
// Use nucleus sampling (top-p):
230+
fn sample_nucleus(peaks: &[(usize, f32)], top_p: f32, temperature: f32) -> usize {
231+
// Apply temperature
232+
let scaled: Vec<f32> = peaks.iter()
233+
.map(|&(_, e)| (e / temperature).exp())
234+
.collect();
235+
let sum: f32 = scaled.iter().sum();
236+
let probs: Vec<f32> = scaled.iter().map(|s| s / sum).collect();
237+
238+
// Nucleus: accumulate until top_p reached
239+
let mut cumsum = 0.0;
240+
for (i, &p) in probs.iter().enumerate() {
241+
cumsum += p;
242+
if cumsum >= top_p {
243+
// Sample uniformly from the nucleus
244+
let r = simple_random() % (i + 1);
245+
return peaks[r].0;
246+
}
247+
}
248+
peaks[0].0
249+
}
250+
251+
// Thinking style maps to temperature:
252+
let temperature = match thinking_style {
253+
Analytical | Logical | Systematic => 0.3, // precise
254+
Creative | Imaginative | Playful => 1.2, // exploratory
255+
Metacognitive | Reflective => 0.7, // balanced
256+
_ => 0.8, // default
257+
};
258+
let top_p = match thinking_style {
259+
Focused | Precise => 0.3,
260+
Exploratory | Curious => 0.95,
261+
_ => 0.9,
262+
};
263+
```
264+
265+
This unblocks coherent output. THEN stream Maverick with 128 real experts
266+
through the now-working generation pipeline.
267+
268+
---
269+
270+
## File locations
271+
272+
```
273+
Streaming script: crates/thinking-engine/examples/stream_maverick.rs
274+
Qwopus data: crates/thinking-engine/data/Qwopus3.5-27B-v3-BF16-silu/
275+
Living loop: crates/thinking-engine/examples/qwopus_living.rs
276+
MoE example: crates/thinking-engine/examples/qwopus_moe.rs
277+
NARS gate example: crates/thinking-engine/examples/qwopus_nars_gate.rs
278+
SiLU correction: crates/thinking-engine/src/silu_correction.rs
279+
HDR audit: crates/thinking-engine/examples/hdr_audit.rs
280+
Inference DAG: crates/lance-graph-contract/src/orchestration_mode.rs
281+
OSINT pipeline: crates/lance-graph-osint/examples/stream_explore.rs
282+
Felt OCR: ndarray/src/hpc/ocr_felt.rs
283+
SIMD OCR: ndarray/src/hpc/ocr_simd.rs
284+
```
285+
286+
## Session stats
287+
- 61 commits (57 lance-graph + 4 ndarray)
288+
- 235K LOC Rust
289+
- 500+ tests across 18 crates
290+
- All PRs merged
291+
292+
---
293+
294+
## γ+φ Golden Ratio HDR Encoding (CRITICAL for gate precision)
295+
296+
### The problem
297+
Current HDR CDF produces uniform distribution (Mean=127.5 for ALL models).
298+
But gate weights concentrate at zero (68.9% for Qwopus).
299+
Uniform encoding wastes resolution on far-from-zero regions where
300+
the model already knows (strong yes/no). The decision boundary at
301+
zero gets the SAME 1/256 resolution as obvious regions.
302+
303+
### The fix: γ offset + φ redistribution
304+
Per-role gamma offsets (from variance audit):
305+
```
306+
Q: γ=0.37 (narrow, less resolution needed)
307+
K: γ=0.94 (moderate, gate-filtered)
308+
V: γ=1.33 (wide, most information)
309+
Gate: γ=1.50 (WIDEST — decision boundary needs MAX resolution)
310+
Up: γ=0.12 (very narrow after SiLU)
311+
Down: γ=0.15 (funnel, compressed)
312+
```
313+
314+
Golden ratio φ=1.618... ensures the redistribution has no periodic aliasing
315+
(Weyl equidistribution theorem). The spiral stride in highheelbgz already
316+
uses φ. The γ+φ encoding applies the same principle to u8 quantization.
317+
318+
### Existing code
319+
- `bgz-tensor/src/gamma_phi.rs`: GammaProfile, gamma_phi_encode/decode
320+
- `bgz-tensor/src/codebook_calibrated.rs`: two-pass build with γ calibration
321+
- `highheelbgz/src/`: SpiralAddress with golden ratio stride
322+
- `thinking-engine/data/codebooks/CODEBOOKS.md`: per-role γ values documented
323+
324+
### Wiring needed
325+
Pass 1: build CLAM codebook (existing)
326+
Pass 2: measure cosine distribution → compute γ offset → apply φ redistribution
327+
Pass 3: re-encode distance table with γ+φ skewed CDF
328+
Expected: ~4.2 bits → ~5.5 bits entropy (30% more discrimination)
329+
330+
---
331+
332+
## Standardized ModelPipeline DTO (6 models)
333+
334+
```rust
335+
pub struct ModelPipeline {
336+
pub name: String,
337+
pub family: ModelFamily, // Embedding, Reranker, Reader, LLM, MoE
338+
pub tokenizer_path: String,
339+
pub vocab_size: usize,
340+
pub hidden_dim: usize,
341+
pub n_layers: usize,
342+
pub n_experts: Option<usize>,
343+
pub n_centroids: usize,
344+
pub gate_policy: GatePolicy,
345+
pub gamma_profile: GammaProfile, // per-role γ offsets
346+
pub silu_corrected: bool,
347+
pub cross_model_anchor: bool, // Jina = truth anchor
348+
}
349+
```
350+
351+
Six pipelines to wire:
352+
1. Jina v3 (1024-dim, truth anchor, cross-model reference)
353+
2. BGE-M3 (1024-dim, multilingual, second anchor)
354+
3. Reader-LM 1.5B (256-dim palette, HTML→text)
355+
4. Jina Reranker v3 (cross-encoder, relevance scoring)
356+
5. Qwopus 27B (5120-dim, 64 layers, SSM hybrid)
357+
6. Maverick 128E (5120-dim, 48 layers, 128 real MoE experts)
358+
359+
Each needs: tokenizer.json + vocab→centroid mapping + γ+φ HDR tables + SiLU correction
360+
361+
### Jina cross-model eval
362+
Jina as truth anchor: for any input text, Jina embedding = ground truth similarity.
363+
Compare: cos(jina_emb_A, jina_emb_B) vs thinking_engine_distance(A, B).
364+
The gap = how much information our distance table loses.
365+
With γ+φ encoding + SiLU correction: gap should shrink.

0 commit comments

Comments
 (0)