You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a DeepSeek V4 HC weighted-sum ggml op with CPU, Metal, and meta backend support, and use it in the compressed attention path.
Batch resumed compressed decode projections, reserve a resumed-prompt DeepSeek V4 graph shape, increase the compressed decode replay cap, and place server checkpoints on SWA-spaced prompt tail positions.
On the Apple M3 Max test machine, the retained changes improved synthetic Metal server replay from roughly 127.8/103.4/94.7 tok/s to 165.9/127.1/113.7 tok/s, with generation sanity at about 21.5 tok/s.
0 commit comments