You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CRITICAL FIXES:
- BUG #1: Multi-block attention for head_dim > 128 (was processing only first block)
Fixed in: tq_uniform.c, tq_mixed.c, tq_polar.c, tq_neon.c, tq_context.c
All attention functions now iterate blocks_per_key blocks per key vector
- BUG #2: Integer attention (tq_uniform_4b_attention_int_ref) now registered
in TQ_TRAITS — previously the slow dequant path was always used
VERIFIED ON REAL MODEL:
- Qwen3.5-0.8B (head_dim=256, 2 blocks per key): cosine 0.9802 (A grade)
- Was: cosine 0.0000 (completely broken for dim > 128)
NEW:
- tests/test_multiblock.cpp: 5 tests for dim=256, 384 multi-block
- tools/tq_realtime_demo.py: End-to-end demo with actual model KV cache
compression and TurboQuant attention (not PyTorch)
18/18 tests pass. All existing tests unaffected.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments