Skip to content

Commit a824d7e

Browse files
committed
feat: v0.3.0 — fuzzy chunk matching, benchmark framework, comprehensive results
Major changes: - Confidence-gated fuzzy chunk matching (100% shifted-prefix recovery) - PQ segment store for memory-efficient chunk comparison (32x compression) - Comprehensive benchmark framework with tiered validation, bootstrap CIs - Pre-flight verification (GPU type, patched LMCache, fuzzy matching) - Paper reproduction framework (paper_tables.py, compare.py, reproduce.py) - Log parser for ground-truth SemBlend hit detection - A100 benchmark results: 26% TriviaQA (beats paper 24.8%), 2.25x fuzzy speedup Benchmark results (A100 40GB, n=500, Qwen2.5-7B-AWQ): - TriviaQA: 26.0% hit, 1.39x speedup - SCBench: 23.1% hit, 1.29x speedup - Cross-instruction: 87-100% hit, 2.15-2.42x speedup - Fuzzy shifted-prefix: 100% hit, 2.25x speedup - Quality PPL: ≤1.007 (paper bound: ≤1.065) Signed-off-by: Zach Bennett <zach@worldflowai.com>
1 parent 0f15cd0 commit a824d7e

43 files changed

Lines changed: 42536 additions & 1 deletion

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,6 @@ venv/
1313
.DS_Store
1414
.pytest_cache/
1515
.ruff_cache/
16+
17+
# Agent prompt files
18+
CLAUDE.md

benchmarks/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""SemBlend benchmark reproduction suite."""
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# SemBlend v0.2.0 Comprehensive Benchmark Results
2+
3+
**Dates:** 2026-03-21 — 2026-03-22
4+
**Hardware:** A100 40GB (p4d.24xlarge) for authoritative run, A10G 24GB (g5.2xlarge) for comparison
5+
**Code:** v0.2.0 (commit d8c1bc6) — no-sort, full-doc embedding, fuzzy chunk matching
6+
**Engine:** vLLM 0.14.1 + patched LMCache (WorldFlowAI PRs #2803, #2804)
7+
**Model:** Qwen/Qwen2.5-7B-Instruct-AWQ
8+
9+
## Headline Results
10+
11+
| Metric | v0.2.0 | Paper | Status |
12+
|--------|--------|-------|--------|
13+
| TriviaQA hit rate (A10G) | **37.0%** | 24.8% | **BEATS** |
14+
| TriviaQA hit rate (A100) | **26.0%** | 24.8% | **BEATS** |
15+
| SCBench hit rate (A10G) | **31.8%** | 17.6% | **BEATS** |
16+
| SCBench hit rate (A100) | **23.1%** | 17.6% | **BEATS** |
17+
| Quality PPL (all lengths) | **≤1.007** | ≤1.065 | **BEATS** |
18+
| WildChat hit-only speedup | **4.29x** | 1.69x | **BEATS** |
19+
| Exact replay | **3.24x** | N/A | PASS |
20+
| Zero regression (reorder) | **1.00x** | N/A | PASS |
21+
| RAG template coverage (SB vs vanilla) | **80% vs 20%** | N/A | **4x better** |
22+
23+
## 1. Authoritative Results (A100 40GB, n=500)
24+
25+
| Dataset | N | Hit% | Cold TTFT | Warm TTFT | Speedup | Hit-Only | Paper Hit% | Paper Spd |
26+
|---------|---|------|-----------|-----------|---------|----------|-----------|-----------|
27+
| TriviaQA | 500 | 26.0% | 1,496ms | 1,247ms | 1.39x | 2.39x | 24.8% | 1.70x |
28+
| NarrativeQA | 500 | 8.0% | 506ms | 447ms | 1.14x | 1.46x | 29.6% | 1.24x |
29+
| LongEval | 498 | 7.4% | 549ms | 495ms | 1.12x | 1.61x | 82.6% | 1.43x |
30+
| WikiText-103 | 258 | 15.5% | 577ms | 494ms | 1.18x | 1.46x | 75.7% | 1.43x |
31+
| SCBench | 471 | 23.1% | 3,151ms | 2,802ms | 1.29x | 2.22x | 17.6% | 1.86x |
32+
33+
## 2. Cross-Hardware Comparison
34+
35+
| Dataset | A100 Hit% | A100 Speedup | A10G Hit% | A10G Speedup | Paper |
36+
|---------|-----------|-------------|-----------|-------------|-------|
37+
| TriviaQA | 26.0% | 1.39x | 37.0% | 1.96x | 24.8% / 1.70x |
38+
| NarrativeQA | 8.0% | 1.14x | 18.5% | 1.20x | 29.6% / 1.24x |
39+
| LongEval | 7.4% | 1.12x | 19.7% | 1.24x | 82.6% / 1.43x |
40+
| WikiText-103 | 15.5% | 1.18x | 39.5% | 1.28x | 75.7% / 1.43x |
41+
| SCBench | 23.1% | 1.29x | 31.8% | 1.93x | 17.6% / 1.86x |
42+
43+
**Key insight:** A10G shows higher hit rates AND higher speedups because its slower prefill
44+
makes SemBlend's fixed overhead (8ms pipeline + 35ms KV transfer) proportionally smaller.
45+
SemBlend's value scales with cold prefill time, making it most valuable on cost-efficient hardware.
46+
47+
## 3. Code Version Impact (A10G)
48+
49+
| Dataset | Old Code Hit% | v0.2.0 Hit% | Delta |
50+
|---------|-------------|------------|-------|
51+
| TriviaQA | 22.0% | **37.0%** | +15.0pp |
52+
| NarrativeQA | 2.0% | **18.5%** | +16.5pp |
53+
| WikiText-103 | 12.0% | **39.5%** | +27.5pp |
54+
| SCBench | 16.7% | **31.8%** | +15.2pp |
55+
56+
v0.2.0 full-document embedding (200K chars vs 1500 chars) dramatically improves hit rates.
57+
58+
## 4. SemBlend vs Vanilla (SGLang A/B, A10G)
59+
60+
| Dataset | SemBlend Hit% | Vanilla Hit% | SemBlend Wins |
61+
|---------|-------------|-------------|-------------|
62+
| TriviaQA | 22.0% | 20.0% | Yes |
63+
| LongEval | 20.2% | 18.2% | Yes |
64+
| WikiText-103 | 29.5% | 27.0% | Yes |
65+
| SCBench | 9.1% | 7.6% | Yes |
66+
67+
Tiered RAG template test: **SemBlend 80% hit rate vs vanilla 20%** — 4x coverage improvement.
68+
69+
## 5. Quality (PPL Ratio)
70+
71+
| Context | PPL Ratio | Paper Bound |
72+
|---------|-----------|------------|
73+
| 2K | 1.007 | ≤1.065 |
74+
| 5K | 0.993 | ≤1.065 |
75+
| 8K | 0.993 | ≤1.065 |
76+
| 16K | 1.000 | ≤1.065 |
77+
78+
**Zero quality degradation.** All PPL ratios within 0.7% of 1.0.
79+
80+
## 6. WildChat (150 real conversation pairs, A10G)
81+
82+
- Hit rate: 56.0% (paper: 82.7%)
83+
- Hit-only p50 speedup: **4.29x** (paper: 1.69x)
84+
- Max speedup: **9.81x**
85+
- Overall p50 speedup: 2.28x
86+
87+
## Gaps vs Paper
88+
89+
| Gap | Root Cause |
90+
|-----|-----------|
91+
| LongEval 82.6% → 7.4% | Suite runner uses natural HF dataset pairs, not paper's controlled synthetic clusters |
92+
| WikiText 75.7% → 15.5% | Same — suite's cross-instruction pairing has less overlap than paper's clusters |
93+
| NarrativeQA 29.6% → 8.0% | Short contexts (~680 tokens) below SemBlend's breakeven point |
94+
95+
**Resolution:** Run paper's dedicated e2e scripts with controlled cluster data.
96+
97+
## Infrastructure
98+
99+
- A100 nodegroup: `gpu-nodes-p4d` (p4d.24xlarge)
100+
- Setup script: `infra/setup-benchmark-env.sh`
101+
- Pre-flight verification: `benchmarks/suite/verify.py`
102+
- GPU memory: 85% utilization (15% reserved for ONNX MiniLM embedder)
103+
- Patched LMCache: WorldFlowAI/LMCache@semblend/post-load-hook

0 commit comments

Comments
 (0)