Skip to content

fix: remove virtual allocation from DeepSeek key takeaways#83

Merged
solderzzc merged 1 commit into
mainfrom
fix/readme-takeaways-v2
Apr 24, 2026
Merged

fix: remove virtual allocation from DeepSeek key takeaways#83
solderzzc merged 1 commit into
mainfrom
fix/readme-takeaways-v2

Conversation

@solderzzc
Copy link
Copy Markdown
Member

Removes '33% lower GPU allocation (40.6 GB vs 60.5 GB)' — virtual numbers that are misleading. Explains the speed difference in plain language instead.

Copilot AI review requested due to automatic review settings April 24, 2026 21:40
@solderzzc solderzzc merged commit 05d0b6c into main Apr 24, 2026
1 check passed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the DeepSeek-V4-Flash benchmark “Key takeaways” in the README to remove misleading virtual GPU allocation numbers and replace them with a clearer, plain-language explanation tied to KV-cache compression and SSD streaming.

Changes:

  • Removes the “33% lower GPU allocation (40.6 GB vs 60.5 GB)” claim from the DeepSeek key takeaways.
  • Adds a plain-language explanation for the long-context speedup (KV-cache compression reducing SSD streaming pressure).
  • Simplifies the “Peak physical RAM” takeaway to avoid referencing “GPU InUse”.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md
**Key takeaways:**
- 🏆 **SSD + TurboQuant dominates at long context** — 4.16 tok/s at 40K vs 0.32 tok/s for plain SSD Stream (**13× faster**), with 33% lower GPU allocation (40.6 GB vs 60.5 GB).
- 🏆 **SSD + TurboQuant dominates at long context** — 4.16 tok/s at 40K vs 0.32 tok/s for plain SSD Stream (**13× faster**). TurboQuant compresses the KV cache so far fewer layers need to stream from SSD per token.
- At 512-token context all configurations perform similarly (~4.4–4.8 tok/s); TurboQuant's advantage is KV-cache compression at long context.
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this Key takeaways section, the terminology is inconsistent between “KV cache” (line 91) and “KV-cache” (line 92). Please standardize the phrasing (pick one form) within the section for readability/searchability.

Suggested change
- At 512-token context all configurations perform similarly (~4.4–4.8 tok/s); TurboQuant's advantage is KV-cache compression at long context.
- At 512-token context all configurations perform similarly (~4.4–4.8 tok/s); TurboQuant's advantage is KV cache compression at long context.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants