From e51a82225957768d26844e663877df83c14e22ff Mon Sep 17 00:00:00 2001 From: Simba Zhang Date: Fri, 24 Apr 2026 14:40:12 -0700 Subject: [PATCH] fix: remove virtual allocation reference from DeepSeek key takeaways --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 84fca94..52f9fec 100644 --- a/README.md +++ b/README.md @@ -88,9 +88,9 @@ Model: [`Thump604/DeepSeek-V4-Flash-MLX-Q3-mixed-gs128-affine`](https://huggingf > Values shown as `generation speed ยท peak physical RAM used` (sampled every 0.5s during prefill + generation). The 126 GB model streams the rest from NVMe SSD. **Key takeaways:** -- ๐Ÿ† **SSD + TurboQuant dominates at long context** โ€” 4.16 tok/s at 40K vs 0.32 tok/s for plain SSD Stream (**13ร— faster**), with 33% lower GPU allocation (40.6 GB vs 60.5 GB). +- ๐Ÿ† **SSD + TurboQuant dominates at long context** โ€” 4.16 tok/s at 40K vs 0.32 tok/s for plain SSD Stream (**13ร— faster**). TurboQuant compresses the KV cache so far fewer layers need to stream from SSD per token. - At 512-token context all configurations perform similarly (~4.4โ€“4.8 tok/s); TurboQuant's advantage is KV-cache compression at long context. -- Peak physical RAM (GPU InUse) stays โ‰ค 17 GB across all configurations โ€” the rest streams from NVMe SSD. +- Peak physical RAM stays โ‰ค 17 GB across all configurations โ€” the 126 GB model streams the rest from NVMe SSD. ---