=== INFERENCE PIPELINE DIAGNOSTICS ===
Loading model: models--mlx-community--Granite-4.0-H-Tiny-4bit-DWQ/snapshots/a892ded1552d6d4089fa644bbff6ccbc54dddc67
Model loaded.
--- TEST 1: Basic GPU ops ---
[DIAG] matmul(ones, 2*ones) expect=8 shape=(4,4) min=8.000000 max=8.000000 mean=8.000000 |mean|=8.000000
[VALS] matmul result: [8.0000, 8.0000, 8.0000, 8.0000, 8.0000, 8.0000, 8.0000, 8.0000]
[hipBLASLt] first call
[hipBLASLt] M=4 N=4 K=4 ta=0 tb=0 lda=4 ldb=4 ldc=4
[DIAG] bf16 matmul expect=8 shape=(4,4) min=8.000000 max=8.000000 mean=8.000000 |mean|=8.000000
--- TEST 2: quantized_matmul vs dequant ---
[DIAG] q_proj weights not found (w=0 s=0 b=0)
--- TEST 3: RMS Norm ---
[DIAG] rms_norm([1,2,3,4]) shape=(1,1,4) min=0.365148 max=1.460593 mean=0.912871 |mean|=0.912871
[VALS] rms_norm([1,2,3,4]) expect≈[.365,.730,1.095,1.461]: [0.3651, 0.7303, 1.0954, 1.4606]
[DIAG] rms_norm(rand bf16 4096) shape=(1,3,4096) min=-3.625000 max=3.531250 mean=0.007225 |mean|=0.797776
--- TEST 4: RoPE ---
[DIAG] rope(ones, off=0) shape=(1,1,1,128) min=1.000000 max=1.000000 mean=1.000000 |mean|=1.000000
[VALS] rope(ones, off=0): [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000]
[DIAG] rope(ones, off=100) shape=(1,1,1,128) min=-1.414062 max=1.406250 mean=0.586235 |mean|=0.953492
[VALS] rope(ones, off=100): [1.3672, 1.3438, -1.3672, -1.3516, 0.7305, -1.3828, -1.4062, -0.9219, 1.3594, -1.1719, 1.3750, -1.1094, -0.5898, 1.2109, 1.1406, -0.0040, -0.9805, -1.3906, -1.3516, -1.0781]
--- TEST 5: Full forward pass ---
[DIAG] logits(token=1) shape=(1,1,100352) min=-56.392059 max=106.358070 mean=-12.956909 |mean|=13.429077
[VALS] logits(token=1): [29.7169, 106.3581, 29.6625, 30.8870, 25.5035, 26.7016, 63.0159, 45.6126, 40.3120, 39.8485, 26.7892, 48.8852, 48.5656, 48.9910, 41.7294, 24.7933, 34.7946, 31.9800, 27.5472, 24.7337, 22.3976, 17.8242, 17.8524, 17.3758, 16.2437, 42.6530, 32.3579, 28.8633, 26.7248, 28.7879]
[DIAG] Top-10:
token=100257 logit=27.1867
token=100260 logit=19.6822
token=100259 logit=17.4482
token=100258 logit=5.2841
token=99703 logit=4.3919
token=99519 logit=0.1561
token=99362 logit=-1.0377
token=99783 logit=-2.8212
token=99542 logit=-2.9113
token=99809 logit=-3.0971
[DIAG] logits(step2) shape=(1,1,100352) min=-27.977913 max=23.408852 mean=-4.440776 |mean|=5.364185
--- TEST 6: dequantize() sanity ---
[DIAG] dequant([0..7],s=1,b=0) shape=(1,8) min=0.000000 max=7.000000 mean=3.500000 |mean|=3.500000
[VALS] dequant expect=[0,1,2,3,4,5,6,7]: [0.0000, 1.0000, 2.0000, 3.0000, 4.0000, 5.0000, 6.0000, 7.0000]
--- TEST 6b: Warmup pass ---
[DIAG] warmup logits shape=(1,1,100352) min=-30.554613 max=24.993921 mean=-4.758521 |mean|=5.718008
[DIAG] Warmup complete
--- TEST 7: Token-level generation trace ---
[DIAG] encode("What is 2+2?") = [3923, 374, 220, 17, 10, 17, 30] (7 tokens)
[DIAG] Token-by-token decode:
token 3923 -> "What"
token 374 -> " is"
token 220 -> " "
token 17 -> "2"
token 10 -> "+"
token 17 -> "2"
token 30 -> "?"
[DIAG] Chat template tokens (15): [100264, 882, 100265, 3923, 374, 220, 17, 10, 17, 30, 100257, 198, 100264, 78191, 100265]
[DIAG] Chat template decoded: "<|start_of_role|>user<|end_of_role|>What is 2+2?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>"
[DIAG] prefill logits shape=(1,15,100352) min=-50.140812 max=92.124290 mean=-1.907799 |mean|=7.610769
[DIAG] Generating 20 tokens (argmax):
step=0 token=93909 text="-ves"
step=1 token=6549 text="125"
step=2 token=6549 text="125"
step=3 token=6549 text="125"
step=4 token=6549 text="125"
step=5 token=6549 text="125"
step=6 token=6549 text="125"
step=7 token=6549 text="125"
step=8 token=6549 text="125"
step=9 token=6549 text="125"
step=10 token=6549 text="125"
step=11 token=6549 text="125"
step=12 token=6549 text="125"
step=13 token=6549 text="125"
step=14 token=6549 text="125"
step=15 token=6549 text="125"
step=16 token=6549 text="125"
step=17 token=6549 text="125"
step=18 token=6549 text="125"
step=19 token=6549 text="125"
[DIAG] Full output (argmax): "-ves125125125125125125125125125125125125125125125125125125125"
[DIAG] Generating 20 tokens (categorical T=0.7):
step=0 token=89232 text=".Disclaimer"
step=1 token=6549 text="125"
step=2 token=6549 text="125"
step=3 token=6549 text="125"
step=4 token=6549 text="125"
step=5 token=6549 text="125"
step=6 token=6549 text="125"
step=7 token=6549 text="125"
step=8 token=6549 text="125"
step=9 token=6549 text="125"
step=10 token=6549 text="125"
step=11 token=6549 text="125"
step=12 token=6549 text="125"
step=13 token=6549 text="125"
step=14 token=6549 text="125"
step=15 token=6549 text="125"
step=16 token=6549 text="125"
step=17 token=6549 text="125"
step=18 token=6549 text="125"
step=19 token=6549 text="125"
[DIAG] Full output (categorical): ".Disclaimer125125125125125125125125125125125125125125125125125125125"
[DIAG] Testing via generate_text (chat.cpp path):
token=89232 text=".Disclaimer"
token=6549 text="125"
token=0 text="!"
token=0 text="!"
token=75948 text=" exporters"
token=0 text="!"
token=0 text="!"
token=0 text="!"
token=93548 text=".optString"
token=0 text="!"
token=0 text="!"
token=0 text="!"
token=44206 text="ITT"
token=0 text="!"
token=0 text="!"
token=0 text="!"
token=44206 text="ITT"
token=0 text="!"
token=0 text="!"
token=0 text="!"
[DIAG] generate_text output: ".Disclaimer125!! exporters!!!.optString!!!ITT!!!ITT!!!"
[DIAG] Prompt: 15 tokens, 38.1461 tokens/s, 0.393225s
Generation: 20 tokens, 26.9525 tokens/s, 0.742047s
--- TEST 8: random::categorical ---
[DIAG] categorical([..., 10, ...]) = 2 (expect 2)
[DIAG] categorical([..., 10, ...]) = 2 (expect 2)
[DIAG] categorical([..., 10, ...]) = 2 (expect 2)
[DIAG] categorical([..., 10, ...]) = 2 (expect 2)
[DIAG] categorical([..., 10, ...]) = 2 (expect 2)
[DIAG] categorical(peak@17, V=151936) = 17 (expect 17)
[DIAG] categorical(peak@17, V=151936) = 17 (expect 17)
[DIAG] categorical(peak@17, V=151936) = 17 (expect 17)
[DIAG] Testing categorical with real model logits...
terminate called after throwing an instance of 'std::runtime_error'
what(): hipMalloc (unified) failed: an illegal memory access was encountered.