Skip to content

Commit 0829285

Browse files
unamedkrclaude
andcommitted
docs(tier): Qwen3.6-27B llama sub-op tensor reference + split verified
Captured llama-debug's named sub-op tensors at L0 of Qwen3.6-27B: attn_norm-N = MUL(norm × attn_norm.weight) conv_input-N = concat(conv_states, qkv_mixed_transposed) shape {5, 10240} conv_output_raw-N = SSM_CONV(input, conv1d.weight) conv_output_silu-N = SILU(conv_output_raw) q_conv-N VIEW {128, 16, n_tokens} ← offset 0 k_conv-N VIEW {128, 16, n_tokens} ← offset 16×128=2048 v_conv_predelta-N VIEW {128, 48, n_tokens} ← offset 2×16×128=4096 VERIFIED: Q/K/V split offsets match ours exactly. Both engines extract Q at byte 0, K at byte 2048, V at byte 4096 of the 10240-dim conv output. Channel layout is identical. Yet element-level diff at L0 element 2 is 22× magnitude with sign flip. Bug must be in: - ssm_conv1d weight load/use (shape {4, 10240}; A3B uses {4, 8192}) - L2_NORM op (we may differ from ggml_l2_norm) - input_layernorm boundary handling at hidden=5120 - BOS handling (verified: both engines DO add BOS, so not this) Updated docs/tier_benchmark_2026_04_25.md with the named-tensor reference for next-session paired-diff investigation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 2873364 commit 0829285

1 file changed

Lines changed: 23 additions & 0 deletions

File tree

docs/tier_benchmark_2026_04_25.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,29 @@ Sum-level diff was 247% (23.5 vs 6.77). Element-level shows OUTLIER CHANNELS pat
101101

102102
**Concrete next investigation step**: dump first 20 elements of each named tensor at L0 (post_embed, attn_norm_out, qkv_proj_out, conv1d_out, q_split, k_split, v_split, q_l2norm, k_l2norm, gate_silu, delta_state, delta_out, ssm_norm_out, residual). First materially-divergent step localizes the bug.
103103

104+
**llama sub-op tensor names captured (2026-04-25)** for paired-diff:
105+
```
106+
attn_norm-N = MUL(norm output × attn_norm.weight)
107+
linear_attn_qkv_mixed-N shape {10240, n_tokens}
108+
conv_states_reshaped-N shape {3, 10240} (conv buffer state)
109+
conv_input-N shape {5, 10240} (concat states + new)
110+
conv_output_raw-N SSM_CONV(input, conv1d.weight)
111+
conv_output_silu-N SILU(conv_output_raw)
112+
q_conv-N VIEW shape {128, 16, n_tokens} ← Q at offset 0
113+
q_conv_predelta-N L2_NORM(q_conv)
114+
k_conv-N VIEW shape {128, 16, n_tokens} ← K at offset 16×128=2048
115+
k_conv_predelta-N L2_NORM(k_conv)
116+
v_conv_predelta-N VIEW shape {128, 48, n_tokens} ← V at offset 2×16×128=4096
117+
```
118+
119+
**Verified split offsets match ours**: Q at 0, K at 2048, V at 4096. Our `delta_qkv[0:2048]` Q, `delta_qkv[2048:4096]` K, `delta_qkv[4096:10240]` V — ✓ identical.
120+
121+
**Suspected at this point** (since shape/split/load all verified):
122+
- ssm_conv1d weight: shape `{4, 10240}` in GGUF. Our load assumes specific layout. May need to verify how we read this 2-D weight with channel dim 10240 (different from A3B's 8192).
123+
- L2_NORM op specifics — we may apply differently than llama's `ggml_l2_norm`.
124+
- input_layernorm to DN: we use `attn_norm` weight; verify boundary handling for hidden=5120.
125+
- BOS token handling differences (both engines DO add BOS, confirmed).
126+
104127
**Memory**: at 16.8 GB Q4_K_M model size on 16 GB RAM Mac, evaluation is impractical (constant swap, ~0.3 tok/s, -n 30 test took 15+ min). For users wanting to test 27B, smaller quants are available:
105128
- UD-IQ2_M: 10.1 GB (recommended for 16 GB RAM)
106129
- UD-Q2_K_XL: 11.0 GB

0 commit comments

Comments
 (0)