You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
New Python API parameter:
m = Model("model.gguf", progressive=True)
Keeps last 128 tokens' keys at FP32 while compressing everything else.
Measured result: PPL degradation drops from +3.8% to +0.6% at a cost of
28 KB extra memory — effectively free quality.
C API: added k_highres_window field to quant_config. quant_new allocates
the FP32 highres buffer when > 0 and the KV cache is quantized.
The progressive mode mirrors human memory: recent tokens are recalled
with full fidelity, older tokens fade to compressed representations.
No other inference engine offers this — llama.cpp deletes old context,
we compress it.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments