Skip to content

Commit c08b1bf

Browse files
unamedkrclaude
andcommitted
Add community posting drafts (Reddit, HN) in EN/KO
docs/pr/2026-03-31-reddit-localllama.md — KV compression focus docs/pr/2026-03-31-hackernews.md — Show HN technical depth Both with Korean translations and posting strategy notes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent ee6f8de commit c08b1bf

4 files changed

Lines changed: 173 additions & 0 deletions

File tree

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Hacker News 한글 포스팅 — 2026-03-31
2+
3+
## 제목
4+
5+
Show HN: 순수 C LLM 엔진, KV 캐시 3.8배 압축 (9K 줄, 의존성 없음)
6+
7+
## URL
8+
9+
https://github.com/quantumaikr/TurboQuant.cpp
10+
11+
## 첫 댓글
12+
13+
LLM 추론 시 KV 캐시를 실시간 압축하는 순수 C 엔진을 만들었습니다.
14+
15+
문제: 32K 토큰 이상의 긴 컨텍스트에서 KV 캐시가 가중치보다 많은 메모리를 차지합니다. 4B 모델의 32K 컨텍스트 KV 캐시는 FP16으로 4.4 GB입니다.
16+
17+
TurboQuant는 KV 캐시를 추론 중 Q4로 양자화하여 1.2 GB로 줄입니다 (3.8배 압축, FP16 출력 대비 코사인 유사도 0.999). 최신 논문 3편 기반: TurboQuant (ICLR '26), QJL (AAAI '25), PolarQuant (AISTATS '26).
18+
19+
기술 상세:
20+
- C11 9,000줄, libc만 사용, 외부 의존성 없음
21+
- Q4 가중치 양자화 + ARM NEON 2-row 배치
22+
- 스레드 풀, 정수 Q4×Q8 어텐션 (vdotq_s32)
23+
- 멀티 아키텍처: Qwen3.5 (DeltaNet) + Gemma 3 (슬라이딩 윈도우)
24+
- 듀얼 토크나이저: GPT2 바이트 BPE + SentencePiece 자동 감지
25+
- TQM 포맷: 사전 양자화 mmap 바이너리
26+
27+
llama.cpp와 단일 스레드 속도 동등 (51 vs 50.7 tok/s). 핵심 가치는 속도가 아닌 긴 컨텍스트에서의 메모리 효율입니다.
28+
29+
Claude Code와 함께 2일 만에 구축. v0.1.0 릴리스.

docs/pr/2026-03-31-hackernews.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Hacker News Post — 2026-03-31
2+
3+
## Title
4+
5+
Show HN: Pure C LLM engine with 3.8x KV cache compression (9K lines, zero deps)
6+
7+
## URL
8+
9+
https://github.com/quantumaikr/TurboQuant.cpp
10+
11+
## Comment (post immediately after submission)
12+
13+
Hi HN, I built a LLM inference engine in pure C that compresses the KV cache during inference.
14+
15+
The problem: at long contexts (32K+ tokens), the KV cache — not the weights — becomes the memory bottleneck. A 4B model at 32K context needs 4.4 GB just for KV in FP16.
16+
17+
TurboQuant quantizes the KV cache to Q4 on-the-fly, reducing that to 1.2 GB (3.8x compression) with 0.999 cosine similarity to FP16 output. Based on three recent papers: TurboQuant (ICLR '26), QJL (AAAI '25), PolarQuant (AISTATS '26).
18+
19+
Technical details:
20+
- 9,000 lines of C11, libc only, no external dependencies
21+
- Q4 weight quantization with ARM NEON 2-row batching
22+
- Thread pool, integer Q4×Q8 attention (vdotq_s32)
23+
- Multi-architecture: Qwen3.5 (DeltaNet hybrid) + Gemma 3 (sliding window)
24+
- Dual tokenizer: GPT2 byte-level BPE + SentencePiece
25+
- TQM format: pre-quantized mmap binary for instant loading
26+
27+
Speed matches llama.cpp single-thread (51 vs 50.7 tok/s on Qwen3.5-0.8B Q4). The value is memory efficiency at long contexts, not raw speed.
28+
29+
Built in 2 days with Claude Code. v0.1.0 just released.
30+
31+
---
32+
33+
## Posting Notes
34+
35+
- **Best time**: US weekday morning (UTC 14:00-16:00)
36+
- **HN audience cares about**: technical depth, honesty, zero-dep C code, paper implementations
37+
- **Avoid**: marketing language, speed claims without context, "revolutionary" etc.
38+
- **Expected questions**:
39+
- "How does KV quantization affect perplexity?" → 0.999 cosine, per-layer verified
40+
- "Why not contribute to llama.cpp?" → Different approach (KV compression is orthogonal)
41+
- "Built in 2 days?" → With AI assistance (Claude Code), honest about it
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# r/LocalLLaMA 한글 포스팅 — 2026-03-31
2+
3+
## 제목
4+
5+
TurboQuant.cpp — KV 캐시 3.8배 압축으로 32K 컨텍스트에서 llama.cpp 대비 3.2 GB 절약하는 순수 C 추론 엔진
6+
7+
## 본문
8+
9+
llama.cpp가 하지 않는 것 하나에 집중한 C 추론 엔진을 만들었습니다: **KV 캐시 압축**.
10+
11+
짧은 컨텍스트에서는 KV 메모리가 큰 문제가 아닙니다. 하지만 32K 토큰 이상에서는 모델 가중치보다 KV 캐시가 더 많은 메모리를 차지합니다.
12+
13+
**실측 데이터 (Gemma 3 4B):**
14+
15+
```
16+
컨텍스트 llama.cpp KV (FP16) TurboQuant KV (Q4) 절약
17+
───────── ────────────────── ────────────────── ──────
18+
4K 토큰 544 MB 145 MB 399 MB
19+
32K 토큰 4,352 MB 1,156 MB 3,196 MB
20+
128K 토큰 17,408 MB 4,624 MB 12,784 MB
21+
```
22+
23+
3.8배 압축, PyTorch 대비 레이어별 정확도 검증 완료.
24+
25+
**속도는 경쟁력 있지만 핵심이 아닙니다:**
26+
- 단일 스레드 Q4: 51.1 tok/s (llama.cpp: 50.7 tok/s) — 동등 수준
27+
- 더 빠르다는 주장이 아닙니다
28+
29+
**차별점:**
30+
- ICLR 2026 TurboQuant 논문 기반 KV 캐시 3.8배 압축
31+
- 3개 모델 지원: Gemma 3 4B, Qwen3.5-0.8B, Gemma 3 270M
32+
- 순수 C, 외부 의존성 없음, ~1MB 바이너리
33+
- 멀티 아키텍처: DeltaNet (Qwen) + 슬라이딩 윈도우 (Gemma)
34+
- Gemma 4 대응 준비 완료
35+
36+
```bash
37+
git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp
38+
bash scripts/quickstart.sh "What is deep learning?"
39+
```
40+
41+
2일 만에 구축. C 9,000줄. 테스트 스위트 20개. 첫 릴리스 v0.1.0.
42+
43+
KV 캐시 압축은 제한된 RAM에서 긴 컨텍스트를 사용하는 시나리오에서 가장 큰 가치를 가집니다 — 로컬 LLM 사용자에게 가장 중요한 시나리오입니다.
44+
45+
GitHub: https://github.com/quantumaikr/TurboQuant.cpp
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# r/LocalLLaMA Post — 2026-03-31
2+
3+
## Title
4+
5+
TurboQuant.cpp — Pure C inference engine with 3.8x KV cache compression. Runs Gemma 3 4B at 32K context using 1.2 GB KV instead of 4.4 GB.
6+
7+
## Body
8+
9+
We built a C inference engine from scratch focused on one thing llama.cpp doesn't do: **compressing the KV cache**.
10+
11+
At short contexts, KV memory doesn't matter much. But at 32K+ tokens, it becomes the dominant memory cost — often larger than the model weights themselves.
12+
13+
**The numbers (Gemma 3 4B):**
14+
15+
```
16+
Context llama.cpp KV (FP16) TurboQuant KV (Q4) Saved
17+
───────── ────────────────── ────────────────── ──────
18+
4K tokens 544 MB 145 MB 399 MB
19+
32K tokens 4,352 MB 1,156 MB 3,196 MB
20+
128K tokens 17,408 MB 4,624 MB 12,784 MB
21+
```
22+
23+
3.8x compression with verified output quality (per-layer exact match against PyTorch).
24+
25+
**Speed is competitive, not the selling point:**
26+
- Single-thread Q4: 51.1 tok/s (llama.cpp: 50.7 tok/s) on Qwen3.5-0.8B
27+
- Same ballpark. We're not claiming to be faster.
28+
29+
**What's different:**
30+
- 3.8x KV cache compression (TurboQuant/PolarQuant/QJL algorithms from ICLR 2026)
31+
- 3 models: Gemma 3 4B, Qwen3.5-0.8B, Gemma 3 270M
32+
- Pure C, zero dependencies, ~1MB binary
33+
- Multi-architecture: DeltaNet hybrid (Qwen) + sliding window (Gemma)
34+
- Gemma 4 ready (same architecture family)
35+
36+
**Quick start:**
37+
```bash
38+
git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp
39+
bash scripts/quickstart.sh "What is deep learning?"
40+
```
41+
42+
Built in 2 days. 9,000 lines of C. 20 test suites. First release: v0.1.0.
43+
44+
The KV compression matters most for long context on limited RAM — exactly the scenario local LLM users care about.
45+
46+
GitHub: https://github.com/quantumaikr/TurboQuant.cpp
47+
48+
---
49+
50+
## Posting Notes
51+
52+
- **Flair**: `New Model` or `Resource`
53+
- **Best time**: UTC Tue-Thu 1-3 PM (US East morning)
54+
- **Expected questions**:
55+
- "What about quality degradation?" → 0.999 cosine similarity, per-layer PyTorch match
56+
- "vs llama.cpp?" → Same speed, different value prop (KV compression)
57+
- "Only 3 models?" → Multi-arch engine, more coming. Gemma 4 ready.
58+
- "Q4 KV vs FP16 isn't fair" → Both are valid choices. We offer the option llama.cpp doesn't.

0 commit comments

Comments
 (0)