Skip to content

Commit 07fb51a

Browse files
committed
session: HiDream-I1 DiT+MoE diffusion indexing + Llama vision diff
1 parent 43cfad0 commit 07fb51a

1 file changed

Lines changed: 149 additions & 0 deletions

File tree

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# SESSION: HiDream-I1 DiT+MoE — First Diffusion Model Indexing
2+
3+
## MISSION
4+
5+
Index HiDream-I1-Full (17B DiT+MoE, MIT license) through the bgz17 pipeline.
6+
First cross-domain validation: do image generation MoE experts show the same
7+
structural redundancy as LLM MoE experts (Maverick's 123,000×)?
8+
9+
Also diff HiDream's Llama-3.1-8B text encoder against base Llama-3.1-8B
10+
to see what "learning to see" does to a language model's attention patterns.
11+
12+
## READ FIRST
13+
14+
```bash
15+
cat src/hpc/safetensors.rs # read_safetensors_header, stream_index_safetensors_bf16
16+
cat src/hpc/gguf_indexer.rs # stream_index_gguf_bf16_with_header (shared core)
17+
cat src/hpc/causal_diff.rs # causal_diff, find_reasoning_scaffold
18+
```
19+
20+
## MODEL MAP
21+
22+
```
23+
HiDream-ai/HiDream-I1-Full (MIT, ungated)
24+
25+
Transformer (DiT + MoE):
26+
transformer/diffusion_pytorch_model-{00001..00007}-of-00007.safetensors
27+
Shard 1: 4.99 GB
28+
Shard 2: 4.98 GB
29+
Shard 3: 4.99 GB
30+
Shard 4: 4.98 GB
31+
Shard 5: 4.99 GB
32+
Shard 6: 4.99 GB
33+
Shard 7: 4.29 GB
34+
Total: 35.21 GB
35+
36+
Text Encoders:
37+
text_encoder/model.safetensors 0.49 GB (CLIP-L)
38+
text_encoder_2/model.safetensors 2.77 GB (CLIP-G/OpenCLIP ViT-bigG)
39+
text_encoder_3/model-00001-of-00002 4.99 GB (Llama-3.1-8B shard 1)
40+
text_encoder_3/model-00002-of-00002 4.53 GB (Llama-3.1-8B shard 2)
41+
Total: 12.78 GB
42+
43+
VAE:
44+
vae/diffusion_pytorch_model.safetensors 0.16 GB
45+
46+
Grand total: ~48.15 GB
47+
```
48+
49+
## PHASE 1: Index Transformer (35 GB, ~1 hour)
50+
51+
The DiT+MoE transformer is the main target. Architecture:
52+
- DiT blocks with self-attention (Q/K/V/O projections)
53+
- MoE expert layers (gate + expert FFN)
54+
- Cross-attention (text conditioning)
55+
- Time-step embeddings
56+
57+
```bash
58+
cargo test test_stream_index_hidream_transformer \
59+
--release -- --ignored --nocapture 2>&1 | tee /tmp/hidream_transformer.log
60+
```
61+
62+
Expected compression:
63+
- MoE expert weights: 50,000-100,000× (if similar to Maverick)
64+
- Attention Q/K/V/O: 500-2,000×
65+
- Cross-attention: unknown — this is NEW (text→image conditioning)
66+
- Time embedding MLP: unknown — sinusoidal structure may compress differently
67+
68+
## PHASE 2: Index Text Encoders (13 GB, ~30 min)
69+
70+
Index all three text encoders. The Llama-3.1-8B encoder is especially
71+
interesting — it's a known architecture fine-tuned for image conditioning.
72+
73+
```bash
74+
cargo test test_stream_index_hidream_text_encoders \
75+
--release -- --ignored --nocapture
76+
```
77+
78+
## PHASE 3: Diff Llama-3.1-8B (what "seeing" adds to "reading")
79+
80+
Compare HiDream's Llama-3.1-8B (text_encoder_3) against base Llama-3.1-8B
81+
(unsloth/Llama-3.1-8B, ungated safetensors).
82+
83+
```bash
84+
# Index base Llama-3.1-8B
85+
cargo test test_stream_index_llama31_8b_base \
86+
--release -- --ignored --nocapture
87+
88+
# Diff
89+
cargo test test_hidream_llama_diff \
90+
--release -- --ignored --nocapture
91+
```
92+
93+
This diff tells us: which attention heads re-routed when a language model
94+
learned to condition image generation? The Q/K/V/O shift pattern reveals
95+
what "visual grounding" looks like in weight space.
96+
97+
Cross-reference with the Qwen3.5 reasoning scaffold:
98+
- Qwen3.5 diff: what "structured reasoning" looks like (Claude distillation)
99+
- HiDream diff: what "visual grounding" looks like (image conditioning)
100+
- Same NARS pipeline, different capability injection
101+
- Do they share attention heads? If yes → multimodal reasoning is routing
102+
103+
## PHASE 4: Cross-Domain MoE Comparison
104+
105+
Compare HiDream's MoE expert compression against Maverick's:
106+
107+
```
108+
Maverick (LLM): 128 experts, 123,000× on gate/up_exps
109+
HiDream (diffusion): N experts, ???× on expert layers
110+
111+
If similar ratios → MoE structural redundancy is architecture-level,
112+
not domain-level. Experts are commodity everywhere.
113+
If different → image generation experts specialize more than
114+
language experts (domain shapes expert identity).
115+
```
116+
117+
## EXPECTED RESULTS
118+
119+
```
120+
HiDream DiT+MoE transformer (35 GB):
121+
Conservative: 5-10 MB (3,500-7,000×)
122+
If MoE-heavy: 1-3 MB (12,000-35,000×)
123+
124+
CLIP-L (0.49 GB): ~100 KB (5,000×)
125+
CLIP-G (2.77 GB): ~500 KB (5,500×)
126+
Llama-3.1-8B (9.52 GB): ~2 MB (5,000×)
127+
128+
Total ~48 GB: → ~3-13 MB
129+
```
130+
131+
## CRITICAL NOTES
132+
133+
1. Use safetensors path: stream_index_safetensors_bf16 (BF16 precision)
134+
2. Tensor names will differ from GGUF conventions — classify_tensor and
135+
classify_projection may need HiDream-specific patterns
136+
3. Check tensor names in shard 1 header first: the naming convention
137+
determines whether classify_tensor catches attention/FFN/MoE correctly
138+
4. If MoE expert tensors are named differently than llama.cpp convention,
139+
add patterns to classify_tensor BEFORE running (or they'll be classified
140+
as generic Attention and compress at lower ratios)
141+
142+
## DELIVERABLES
143+
144+
1. bgz7 indexes: /tmp/hidream_transformer_shard{01-07}.bgz7
145+
2. bgz7 indexes: /tmp/hidream_clip_l.bgz7, hidream_clip_g.bgz7
146+
3. bgz7 indexes: /tmp/hidream_llama_enc.bgz7 (combined shards)
147+
4. bgz7 indexes: /tmp/llama31_8b_base.bgz7
148+
5. Diff results: .claude/knowledge/hidream_results.md
149+
6. Cross-domain MoE comparison: .claude/knowledge/moe_cross_domain.md

0 commit comments

Comments
 (0)