Skip to content

Commit e8bface

Browse files
unamedkrclaude
andcommitted
Add comprehensive Getting Started guide
docs/getting-started.md — Single page covering: 1. Build (requirements, cmake, test) 2. 30-second demos (ab_test, demo_real_model, speed benchmark) 3. CLI tool (tq info, tq bench, tq +memory, tq demo) 4. Python API (quantize, dequantize, attention, type comparison) 5. C API (quantize, attention, K/V asymmetric, RHT) 6. llama.cpp integration (CMake, type registration, CLI parser) 7. Real model validation (Qwen3.5-0.8B) 8. Benchmark suite (7 benchmarks) 9. Project structure 10. Recommended strategies (from real A/B tests) README + README.ko.md: Getting Started linked at top of docs table Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 1327552 commit e8bface

3 files changed

Lines changed: 285 additions & 2 deletions

File tree

README.ko.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -127,12 +127,12 @@ r/LocalLLaMA 커뮤니티와 llama.cpp Discussion #20969에서 검증된 기법:
127127
128128
| 문서 | 설명 |
129129
|------|------|
130+
| **[시작 가이드](docs/getting-started.md)** | **빌드, CLI, Python, C API, llama.cpp — 한 페이지에 모두** |
130131
| [아키텍처](docs/architecture.md) | 4-layer 설계, 타입 시스템, 디스패치 |
131132
| [Qwen3.5 검증](docs/qwen35_validation_results.md) | 실제 모델 A/B 테스트 결과 |
132133
| [통합 가이드](docs/integration_guide.md) | llama.cpp, vLLM, Python |
133134
| [llama.cpp 플러그인](integrations/llamacpp/README.md) | llama.cpp 통합 단계별 가이드 |
134135
| [포맷 사양](spec/tq_format_v1.md) | 블록 구조, 비트 패킹 |
135-
| [성능 심층 분석](bench/speed_int_vs_float_v2.c) | 정수 vs FP32 벤치마크 |
136136
| [변경 이력](CHANGELOG.md) | 전체 릴리즈 노트 |
137137
138138
---

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -127,12 +127,12 @@ Built on techniques validated by r/LocalLLaMA community and llama.cpp Discussion
127127
128128
| Document | Description |
129129
|----------|-------------|
130+
| **[Getting Started](docs/getting-started.md)** | **Build, CLI, Python, C API, llama.cpp — all in one page** |
130131
| [Architecture](docs/architecture.md) | 4-layer design, type system, dispatch |
131132
| [Qwen3.5 Validation](docs/qwen35_validation_results.md) | Real model A/B test results |
132133
| [Integration Guide](docs/integration_guide.md) | llama.cpp, vLLM, Python |
133134
| [llama.cpp Plugin](integrations/llamacpp/README.md) | Step-by-step llama.cpp integration |
134135
| [Format Spec](spec/tq_format_v1.md) | Block structure, bit packing |
135-
| [Performance Deep Dive](bench/speed_int_vs_float_v2.c) | Integer vs FP32 benchmark |
136136
| [Changelog](CHANGELOG.md) | Full release notes |
137137
138138
---

docs/getting-started.md

Lines changed: 283 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,283 @@
1+
# Getting Started
2+
3+
TurboQuant.cpp를 빌드하고, 직접 체험하고, 프로젝트에 통합하는 가이드입니다.
4+
5+
---
6+
7+
## 1. 빌드
8+
9+
### 요구사항
10+
11+
- C11 / C++17 컴파일러 (GCC 9+, Clang 12+, MSVC 2019+)
12+
- CMake 3.20+
13+
- Python 3.8+ (Python 바인딩/CLI 사용 시)
14+
15+
### 빌드
16+
17+
```bash
18+
git clone https://github.com/quantumaikr/TurboQuant.cpp
19+
cd TurboQuant.cpp
20+
21+
cmake -B build -DCMAKE_BUILD_TYPE=Release \
22+
-DTQ_BUILD_TESTS=ON \
23+
-DTQ_BUILD_BENCH=ON
24+
25+
cmake --build build -j$(sysctl -n hw.ncpu 2>/dev/null || nproc)
26+
```
27+
28+
### 테스트
29+
30+
```bash
31+
ctest --test-dir build --output-on-failure
32+
# 17 C++ 테스트 스위트, 100% 통과
33+
```
34+
35+
---
36+
37+
## 2. 30초 체험
38+
39+
빌드 후 바로 실행 가능한 데모 3개:
40+
41+
```bash
42+
# A/B 비교: FP16 vs 양자화 타입별 품질 비교
43+
./build/ab_test
44+
45+
# 실제 모델별 메모리 절약 (Llama, Qwen, Phi)
46+
./build/demo_real_model
47+
48+
# 속도: 정수 Attention vs FP32
49+
./build/speed_int_vs_float
50+
```
51+
52+
---
53+
54+
## 3. CLI 도구 (`tq`)
55+
56+
### 설치
57+
58+
```bash
59+
# Python 바인딩 설치 (CLI에 필요)
60+
pip install -e bindings/python
61+
62+
# CLI 실행
63+
python3 tools/tq info
64+
```
65+
66+
### 명령어
67+
68+
```bash
69+
# 양자화 타입 정보 (★ 추천 표시)
70+
python3 tools/tq info
71+
72+
# 모델별 메모리 절약 계산
73+
python3 tools/tq +memory llama-3.2-3b 65536
74+
python3 tools/tq +memory qwen3.5-0.8b 131072
75+
76+
# 성능 벤치마크
77+
python3 tools/tq bench
78+
python3 tools/tq bench --seq-len 2048 --head-dim 256
79+
80+
# AI 에이전트용 JSON 출력
81+
python3 tools/tq info --json
82+
python3 tools/tq +memory llama-3.2-3b 65536 --json
83+
84+
# A/B 비교 (빌드 필요)
85+
python3 tools/tq +compare
86+
```
87+
88+
### Qwen3.5-0.8B 대화 (실제 모델 추론)
89+
90+
```bash
91+
# torch + transformers 설치 (최초 1회)
92+
python3 -m venv /tmp/tq_venv
93+
source /tmp/tq_venv/bin/activate
94+
pip install torch transformers numpy accelerate
95+
96+
# 대화 모드
97+
python3 tools/tq_chat.py
98+
99+
# 단일 질문
100+
python3 tools/tq_chat.py "What is KV cache quantization?"
101+
102+
# 벤치마크 모드
103+
python3 tools/tq_chat.py --benchmark
104+
```
105+
106+
---
107+
108+
## 4. Python API
109+
110+
```bash
111+
pip install -e bindings/python
112+
```
113+
114+
```python
115+
from turboquant import TurboQuant
116+
import numpy as np
117+
118+
tq = TurboQuant("cpu")
119+
120+
# 양자화
121+
keys = np.random.randn(512, 128).astype(np.float32) * 0.15
122+
compressed = tq.quantize_keys(keys, TurboQuant.UNIFORM_4B)
123+
print(f"{keys.nbytes:,}{len(compressed):,} bytes ({keys.nbytes/len(compressed):.1f}x)")
124+
125+
# 역양자화
126+
decompressed = tq.dequantize_keys(compressed, 512, 128, TurboQuant.UNIFORM_4B)
127+
mse = np.mean((keys - decompressed) ** 2)
128+
print(f"MSE: {mse:.6f}")
129+
130+
# Attention
131+
query = np.random.randn(128).astype(np.float32)
132+
scores = tq.attention(query, compressed, 512, 128, TurboQuant.UNIFORM_4B)
133+
134+
# 타입 비교
135+
for qtype in [TurboQuant.UNIFORM_4B, TurboQuant.MIXED_4B8, TurboQuant.UNIFORM_2B]:
136+
name = tq.type_name(qtype)
137+
bpe = tq.type_bpe(qtype)
138+
print(f" {name}: {bpe:.1f} bits, {32/bpe:.1f}x compression")
139+
```
140+
141+
---
142+
143+
## 5. C API
144+
145+
```c
146+
#include "turboquant/turboquant.h"
147+
148+
// 초기화
149+
tq_context_t* ctx;
150+
tq_init(&ctx, TQ_BACKEND_CPU);
151+
152+
// 양자화
153+
size_t size = tq_quantize_keys_size(seq_len, head_dim, TQ_TYPE_UNIFORM_4B);
154+
void* buf = malloc(size);
155+
tq_quantize_keys(ctx, keys, seq_len, head_dim, TQ_TYPE_UNIFORM_4B, buf, size);
156+
157+
// Attention (FP32보다 2.9-4.8x 빠름)
158+
float scores[seq_len];
159+
tq_attention(ctx, query, buf, seq_len, head_dim, TQ_TYPE_UNIFORM_4B, scores);
160+
161+
// K/V 비대칭 양자화 (Key 4bit + Value 2bit = 9.8x 압축)
162+
tq_quantize_kv(ctx, keys, values, n, head_dim,
163+
TQ_TYPE_UNIFORM_4B, TQ_TYPE_UNIFORM_2B,
164+
key_out, key_size, val_out, val_size);
165+
166+
// RHT 전처리 (MSE 1.8-3.9x 추가 개선)
167+
tq_quantize_keys_rht(ctx, keys, n, head_dim,
168+
TQ_TYPE_UNIFORM_4B, seed, out, size);
169+
170+
// 정리
171+
free(buf);
172+
tq_free(ctx);
173+
```
174+
175+
### CMake 통합
176+
177+
```cmake
178+
add_subdirectory(TurboQuant.cpp)
179+
target_link_libraries(my_app turboquant)
180+
```
181+
182+
---
183+
184+
## 6. llama.cpp 통합
185+
186+
```bash
187+
# 상세 가이드: integrations/llamacpp/README.md
188+
189+
# CMakeLists.txt에 추가
190+
add_subdirectory(path/to/TurboQuant.cpp turboquant)
191+
target_link_libraries(llama PRIVATE turboquant)
192+
```
193+
194+
```cpp
195+
#include "integrations/llamacpp/tq_kv_cache.cpp"
196+
197+
// 초기화 시 타입 등록
198+
tq_ggml_register_types();
199+
200+
// CLI 파서 (21개 별칭 지원)
201+
tq_type type = tq_parse_kv_cache_type("turbo3"); // or "tq-uniform-4b", "uniform_4b" ...
202+
203+
// 사용 가능한 타입 목록
204+
tq_print_kv_cache_types();
205+
```
206+
207+
---
208+
209+
## 7. 실제 모델 검증
210+
211+
Qwen3.5-0.8B의 실제 KV 캐시로 검증:
212+
213+
```bash
214+
# 실제 모델에서 KV 캐시 추출
215+
source /tmp/tq_venv/bin/activate
216+
python3 tests/reference/dump_qwen35_kv.py
217+
218+
# 양자화 품질 검증
219+
./build/qwen35_validation
220+
```
221+
222+
검증 결과: [docs/qwen35_validation_results.md](qwen35_validation_results.md)
223+
224+
---
225+
226+
## 8. 벤치마크 실행
227+
228+
```bash
229+
# 품질 (MSE, cosine, cross-platform)
230+
./build/tq_quality
231+
232+
# 성능 (throughput, compression, SIMD)
233+
./build/tq_bench
234+
235+
# 정수 vs FP32 Attention 속도
236+
./build/speed_int_vs_float
237+
238+
# 개별 커널 성능
239+
./build/bench_kernel
240+
241+
# 메모리 사용량
242+
./build/bench_memory
243+
```
244+
245+
---
246+
247+
## 9. 프로젝트 구조
248+
249+
```
250+
include/turboquant/ Public C API
251+
src/core/ 알고리즘 (polar, qjl, turbo, uniform, mixed, rht)
252+
src/cache/ 페이지 캐시 + 점진적 압축
253+
src/backend/cpu/ CPU 커널 (generic, NEON, AVX2)
254+
src/backend/cuda/ CUDA 커널
255+
src/backend/metal/ Metal 셰이더
256+
tests/ Google Test (17 스위트)
257+
bench/ 벤치마크
258+
tools/ CLI 도구 (tq, tq_chat)
259+
examples/ 예제 (C, C++, Python)
260+
integrations/ llama.cpp, vLLM 통합
261+
bindings/python/ Python ctypes 바인딩
262+
spec/ 포맷 사양 + 테스트 벡터
263+
docs/ 문서
264+
```
265+
266+
---
267+
268+
## 10. 추천 양자화 전략
269+
270+
실제 Qwen3.5-0.8B A/B 테스트 기반:
271+
272+
| 상황 | 추천 | 품질 | 압축 |
273+
|------|------|------|------|
274+
| **프로덕션 기본** | `uniform_4b` | A+ (0.994) | 7.5x |
275+
| **최적 가성비** | K4V2 (key=4b, val=2b) | ~0.97 | 9.8x |
276+
| **아웃라이어 심한 모델** | `mixed_4b8` | A+ (0.994) | 6.4x |
277+
| **극한 압축** | `uniform_2b` | A (0.953) | 14.2x |
278+
| **품질 극대화** | RHT + `uniform_4b` | A++ | 7.5x |
279+
280+
```bash
281+
# CLI로 확인
282+
python3 tools/tq info
283+
```

0 commit comments

Comments
 (0)