|
| 1 | +# SemBlend v0.2.0 Comprehensive Benchmark Results |
| 2 | + |
| 3 | +**Dates:** 2026-03-21 — 2026-03-22 |
| 4 | +**Hardware:** A100 40GB (p4d.24xlarge) for authoritative run, A10G 24GB (g5.2xlarge) for comparison |
| 5 | +**Code:** v0.2.0 (commit d8c1bc6) — no-sort, full-doc embedding, fuzzy chunk matching |
| 6 | +**Engine:** vLLM 0.14.1 + patched LMCache (WorldFlowAI PRs #2803, #2804) |
| 7 | +**Model:** Qwen/Qwen2.5-7B-Instruct-AWQ |
| 8 | + |
| 9 | +## Headline Results |
| 10 | + |
| 11 | +| Metric | v0.2.0 | Paper | Status | |
| 12 | +|--------|--------|-------|--------| |
| 13 | +| TriviaQA hit rate (A10G) | **37.0%** | 24.8% | **BEATS** | |
| 14 | +| TriviaQA hit rate (A100) | **26.0%** | 24.8% | **BEATS** | |
| 15 | +| SCBench hit rate (A10G) | **31.8%** | 17.6% | **BEATS** | |
| 16 | +| SCBench hit rate (A100) | **23.1%** | 17.6% | **BEATS** | |
| 17 | +| Quality PPL (all lengths) | **≤1.007** | ≤1.065 | **BEATS** | |
| 18 | +| WildChat hit-only speedup | **4.29x** | 1.69x | **BEATS** | |
| 19 | +| Exact replay | **3.24x** | N/A | PASS | |
| 20 | +| Zero regression (reorder) | **1.00x** | N/A | PASS | |
| 21 | +| RAG template coverage (SB vs vanilla) | **80% vs 20%** | N/A | **4x better** | |
| 22 | + |
| 23 | +## 1. Authoritative Results (A100 40GB, n=500) |
| 24 | + |
| 25 | +| Dataset | N | Hit% | Cold TTFT | Warm TTFT | Speedup | Hit-Only | Paper Hit% | Paper Spd | |
| 26 | +|---------|---|------|-----------|-----------|---------|----------|-----------|-----------| |
| 27 | +| TriviaQA | 500 | 26.0% | 1,496ms | 1,247ms | 1.39x | 2.39x | 24.8% | 1.70x | |
| 28 | +| NarrativeQA | 500 | 8.0% | 506ms | 447ms | 1.14x | 1.46x | 29.6% | 1.24x | |
| 29 | +| LongEval | 498 | 7.4% | 549ms | 495ms | 1.12x | 1.61x | 82.6% | 1.43x | |
| 30 | +| WikiText-103 | 258 | 15.5% | 577ms | 494ms | 1.18x | 1.46x | 75.7% | 1.43x | |
| 31 | +| SCBench | 471 | 23.1% | 3,151ms | 2,802ms | 1.29x | 2.22x | 17.6% | 1.86x | |
| 32 | + |
| 33 | +## 2. Cross-Hardware Comparison |
| 34 | + |
| 35 | +| Dataset | A100 Hit% | A100 Speedup | A10G Hit% | A10G Speedup | Paper | |
| 36 | +|---------|-----------|-------------|-----------|-------------|-------| |
| 37 | +| TriviaQA | 26.0% | 1.39x | 37.0% | 1.96x | 24.8% / 1.70x | |
| 38 | +| NarrativeQA | 8.0% | 1.14x | 18.5% | 1.20x | 29.6% / 1.24x | |
| 39 | +| LongEval | 7.4% | 1.12x | 19.7% | 1.24x | 82.6% / 1.43x | |
| 40 | +| WikiText-103 | 15.5% | 1.18x | 39.5% | 1.28x | 75.7% / 1.43x | |
| 41 | +| SCBench | 23.1% | 1.29x | 31.8% | 1.93x | 17.6% / 1.86x | |
| 42 | + |
| 43 | +**Key insight:** A10G shows higher hit rates AND higher speedups because its slower prefill |
| 44 | +makes SemBlend's fixed overhead (8ms pipeline + 35ms KV transfer) proportionally smaller. |
| 45 | +SemBlend's value scales with cold prefill time, making it most valuable on cost-efficient hardware. |
| 46 | + |
| 47 | +## 3. Code Version Impact (A10G) |
| 48 | + |
| 49 | +| Dataset | Old Code Hit% | v0.2.0 Hit% | Delta | |
| 50 | +|---------|-------------|------------|-------| |
| 51 | +| TriviaQA | 22.0% | **37.0%** | +15.0pp | |
| 52 | +| NarrativeQA | 2.0% | **18.5%** | +16.5pp | |
| 53 | +| WikiText-103 | 12.0% | **39.5%** | +27.5pp | |
| 54 | +| SCBench | 16.7% | **31.8%** | +15.2pp | |
| 55 | + |
| 56 | +v0.2.0 full-document embedding (200K chars vs 1500 chars) dramatically improves hit rates. |
| 57 | + |
| 58 | +## 4. SemBlend vs Vanilla (SGLang A/B, A10G) |
| 59 | + |
| 60 | +| Dataset | SemBlend Hit% | Vanilla Hit% | SemBlend Wins | |
| 61 | +|---------|-------------|-------------|-------------| |
| 62 | +| TriviaQA | 22.0% | 20.0% | Yes | |
| 63 | +| LongEval | 20.2% | 18.2% | Yes | |
| 64 | +| WikiText-103 | 29.5% | 27.0% | Yes | |
| 65 | +| SCBench | 9.1% | 7.6% | Yes | |
| 66 | + |
| 67 | +Tiered RAG template test: **SemBlend 80% hit rate vs vanilla 20%** — 4x coverage improvement. |
| 68 | + |
| 69 | +## 5. Quality (PPL Ratio) |
| 70 | + |
| 71 | +| Context | PPL Ratio | Paper Bound | |
| 72 | +|---------|-----------|------------| |
| 73 | +| 2K | 1.007 | ≤1.065 | |
| 74 | +| 5K | 0.993 | ≤1.065 | |
| 75 | +| 8K | 0.993 | ≤1.065 | |
| 76 | +| 16K | 1.000 | ≤1.065 | |
| 77 | + |
| 78 | +**Zero quality degradation.** All PPL ratios within 0.7% of 1.0. |
| 79 | + |
| 80 | +## 6. WildChat (150 real conversation pairs, A10G) |
| 81 | + |
| 82 | +- Hit rate: 56.0% (paper: 82.7%) |
| 83 | +- Hit-only p50 speedup: **4.29x** (paper: 1.69x) |
| 84 | +- Max speedup: **9.81x** |
| 85 | +- Overall p50 speedup: 2.28x |
| 86 | + |
| 87 | +## Gaps vs Paper |
| 88 | + |
| 89 | +| Gap | Root Cause | |
| 90 | +|-----|-----------| |
| 91 | +| LongEval 82.6% → 7.4% | Suite runner uses natural HF dataset pairs, not paper's controlled synthetic clusters | |
| 92 | +| WikiText 75.7% → 15.5% | Same — suite's cross-instruction pairing has less overlap than paper's clusters | |
| 93 | +| NarrativeQA 29.6% → 8.0% | Short contexts (~680 tokens) below SemBlend's breakeven point | |
| 94 | + |
| 95 | +**Resolution:** Run paper's dedicated e2e scripts with controlled cluster data. |
| 96 | + |
| 97 | +## Infrastructure |
| 98 | + |
| 99 | +- A100 nodegroup: `gpu-nodes-p4d` (p4d.24xlarge) |
| 100 | +- Setup script: `infra/setup-benchmark-env.sh` |
| 101 | +- Pre-flight verification: `benchmarks/suite/verify.py` |
| 102 | +- GPU memory: 85% utilization (15% reserved for ONNX MiniLM embedder) |
| 103 | +- Patched LMCache: WorldFlowAI/LMCache@semblend/post-load-hook |
0 commit comments