You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
Rewrite the Attention Sink KV cache implementation from eviction-based to ring buffer approach for torch.export compatibility.
Key changes:
- Ring buffer KV cache: Replace dynamic eviction (torch.cat, narrow, shift) with fixed-size ring buffer using index_copy_. Cache layout: [sink slots | ring buffer slots]. Sink tokens (e.g., BOS) occupy fixed positions; window tokens wrap around in the ring buffer region.
- Remove eviction_batch_size: No longer needed -- ring buffer overwrites old entries automatically. Removed from all interfaces (attention_sink.py, model.py, llm_config.py, yaml config).
- Remove attention_sink_forward: No more monkey-patching AttentionMHA.forward. Instead, KVCacheWithAttentionSink sets is_ring_buffer=True, and AttentionMHA.forward handles ring buffer models natively (skip start_pos bounds check, compute mask after KV update).
- Remove rerotate_k / position shifting: Ring buffer uses original positions for RoPE -- no re-rotation needed.
- Fix C++ runner: Remove TEMPORARY max_new_tokens hack. Add max_seq_len prefill check. Make context length check conditional for sliding window models.
- Rewrite tests: Replace 16 eviction-based tests with 18 ring buffer tests covering sink preservation, ring wrapping, causal masking, and degenerate (sink_size=0) cases.
- Add llama_attention_sink.yaml: Example config for attention sink export.
Differential Revision: D99900289
0 commit comments