|
| 1 | +# SESSION: HiDream-I1 DiT+MoE — First Diffusion Model Indexing |
| 2 | + |
| 3 | +## MISSION |
| 4 | + |
| 5 | +Index HiDream-I1-Full (17B DiT+MoE, MIT license) through the bgz17 pipeline. |
| 6 | +First cross-domain validation: do image generation MoE experts show the same |
| 7 | +structural redundancy as LLM MoE experts (Maverick's 123,000×)? |
| 8 | + |
| 9 | +Also diff HiDream's Llama-3.1-8B text encoder against base Llama-3.1-8B |
| 10 | +to see what "learning to see" does to a language model's attention patterns. |
| 11 | + |
| 12 | +## READ FIRST |
| 13 | + |
| 14 | +```bash |
| 15 | +cat src/hpc/safetensors.rs # read_safetensors_header, stream_index_safetensors_bf16 |
| 16 | +cat src/hpc/gguf_indexer.rs # stream_index_gguf_bf16_with_header (shared core) |
| 17 | +cat src/hpc/causal_diff.rs # causal_diff, find_reasoning_scaffold |
| 18 | +``` |
| 19 | + |
| 20 | +## MODEL MAP |
| 21 | + |
| 22 | +``` |
| 23 | +HiDream-ai/HiDream-I1-Full (MIT, ungated) |
| 24 | +
|
| 25 | +Transformer (DiT + MoE): |
| 26 | + transformer/diffusion_pytorch_model-{00001..00007}-of-00007.safetensors |
| 27 | + Shard 1: 4.99 GB |
| 28 | + Shard 2: 4.98 GB |
| 29 | + Shard 3: 4.99 GB |
| 30 | + Shard 4: 4.98 GB |
| 31 | + Shard 5: 4.99 GB |
| 32 | + Shard 6: 4.99 GB |
| 33 | + Shard 7: 4.29 GB |
| 34 | + Total: 35.21 GB |
| 35 | +
|
| 36 | +Text Encoders: |
| 37 | + text_encoder/model.safetensors 0.49 GB (CLIP-L) |
| 38 | + text_encoder_2/model.safetensors 2.77 GB (CLIP-G/OpenCLIP ViT-bigG) |
| 39 | + text_encoder_3/model-00001-of-00002 4.99 GB (Llama-3.1-8B shard 1) |
| 40 | + text_encoder_3/model-00002-of-00002 4.53 GB (Llama-3.1-8B shard 2) |
| 41 | + Total: 12.78 GB |
| 42 | +
|
| 43 | +VAE: |
| 44 | + vae/diffusion_pytorch_model.safetensors 0.16 GB |
| 45 | +
|
| 46 | +Grand total: ~48.15 GB |
| 47 | +``` |
| 48 | + |
| 49 | +## PHASE 1: Index Transformer (35 GB, ~1 hour) |
| 50 | + |
| 51 | +The DiT+MoE transformer is the main target. Architecture: |
| 52 | +- DiT blocks with self-attention (Q/K/V/O projections) |
| 53 | +- MoE expert layers (gate + expert FFN) |
| 54 | +- Cross-attention (text conditioning) |
| 55 | +- Time-step embeddings |
| 56 | + |
| 57 | +```bash |
| 58 | +cargo test test_stream_index_hidream_transformer \ |
| 59 | + --release -- --ignored --nocapture 2>&1 | tee /tmp/hidream_transformer.log |
| 60 | +``` |
| 61 | + |
| 62 | +Expected compression: |
| 63 | +- MoE expert weights: 50,000-100,000× (if similar to Maverick) |
| 64 | +- Attention Q/K/V/O: 500-2,000× |
| 65 | +- Cross-attention: unknown — this is NEW (text→image conditioning) |
| 66 | +- Time embedding MLP: unknown — sinusoidal structure may compress differently |
| 67 | + |
| 68 | +## PHASE 2: Index Text Encoders (13 GB, ~30 min) |
| 69 | + |
| 70 | +Index all three text encoders. The Llama-3.1-8B encoder is especially |
| 71 | +interesting — it's a known architecture fine-tuned for image conditioning. |
| 72 | + |
| 73 | +```bash |
| 74 | +cargo test test_stream_index_hidream_text_encoders \ |
| 75 | + --release -- --ignored --nocapture |
| 76 | +``` |
| 77 | + |
| 78 | +## PHASE 3: Diff Llama-3.1-8B (what "seeing" adds to "reading") |
| 79 | + |
| 80 | +Compare HiDream's Llama-3.1-8B (text_encoder_3) against base Llama-3.1-8B |
| 81 | +(unsloth/Llama-3.1-8B, ungated safetensors). |
| 82 | + |
| 83 | +```bash |
| 84 | +# Index base Llama-3.1-8B |
| 85 | +cargo test test_stream_index_llama31_8b_base \ |
| 86 | + --release -- --ignored --nocapture |
| 87 | + |
| 88 | +# Diff |
| 89 | +cargo test test_hidream_llama_diff \ |
| 90 | + --release -- --ignored --nocapture |
| 91 | +``` |
| 92 | + |
| 93 | +This diff tells us: which attention heads re-routed when a language model |
| 94 | +learned to condition image generation? The Q/K/V/O shift pattern reveals |
| 95 | +what "visual grounding" looks like in weight space. |
| 96 | + |
| 97 | +Cross-reference with the Qwen3.5 reasoning scaffold: |
| 98 | +- Qwen3.5 diff: what "structured reasoning" looks like (Claude distillation) |
| 99 | +- HiDream diff: what "visual grounding" looks like (image conditioning) |
| 100 | +- Same NARS pipeline, different capability injection |
| 101 | +- Do they share attention heads? If yes → multimodal reasoning is routing |
| 102 | + |
| 103 | +## PHASE 4: Cross-Domain MoE Comparison |
| 104 | + |
| 105 | +Compare HiDream's MoE expert compression against Maverick's: |
| 106 | + |
| 107 | +``` |
| 108 | +Maverick (LLM): 128 experts, 123,000× on gate/up_exps |
| 109 | +HiDream (diffusion): N experts, ???× on expert layers |
| 110 | +
|
| 111 | +If similar ratios → MoE structural redundancy is architecture-level, |
| 112 | + not domain-level. Experts are commodity everywhere. |
| 113 | +If different → image generation experts specialize more than |
| 114 | + language experts (domain shapes expert identity). |
| 115 | +``` |
| 116 | + |
| 117 | +## EXPECTED RESULTS |
| 118 | + |
| 119 | +``` |
| 120 | +HiDream DiT+MoE transformer (35 GB): |
| 121 | + Conservative: 5-10 MB (3,500-7,000×) |
| 122 | + If MoE-heavy: 1-3 MB (12,000-35,000×) |
| 123 | + |
| 124 | +CLIP-L (0.49 GB): ~100 KB (5,000×) |
| 125 | +CLIP-G (2.77 GB): ~500 KB (5,500×) |
| 126 | +Llama-3.1-8B (9.52 GB): ~2 MB (5,000×) |
| 127 | +
|
| 128 | +Total ~48 GB: → ~3-13 MB |
| 129 | +``` |
| 130 | + |
| 131 | +## CRITICAL NOTES |
| 132 | + |
| 133 | +1. Use safetensors path: stream_index_safetensors_bf16 (BF16 precision) |
| 134 | +2. Tensor names will differ from GGUF conventions — classify_tensor and |
| 135 | + classify_projection may need HiDream-specific patterns |
| 136 | +3. Check tensor names in shard 1 header first: the naming convention |
| 137 | + determines whether classify_tensor catches attention/FFN/MoE correctly |
| 138 | +4. If MoE expert tensors are named differently than llama.cpp convention, |
| 139 | + add patterns to classify_tensor BEFORE running (or they'll be classified |
| 140 | + as generic Attention and compress at lower ratios) |
| 141 | + |
| 142 | +## DELIVERABLES |
| 143 | + |
| 144 | +1. bgz7 indexes: /tmp/hidream_transformer_shard{01-07}.bgz7 |
| 145 | +2. bgz7 indexes: /tmp/hidream_clip_l.bgz7, hidream_clip_g.bgz7 |
| 146 | +3. bgz7 indexes: /tmp/hidream_llama_enc.bgz7 (combined shards) |
| 147 | +4. bgz7 indexes: /tmp/llama31_8b_base.bgz7 |
| 148 | +5. Diff results: .claude/knowledge/hidream_results.md |
| 149 | +6. Cross-domain MoE comparison: .claude/knowledge/moe_cross_domain.md |
0 commit comments