WorldFlowAI
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎benchmarks/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎benchmarks/__init__.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎benchmarks/results/v0.2.0/RESULTS.md‎
Lines changed: 103 additions & 0 deletions b/‎benchmarks/results/v0.2.0/RESULTS.md‎
Lines changed: 103 additions & 0 deletions
@@ -13,3 +13,6 @@ venv/
 .DS_Store
 .pytest_cache/
 .ruff_cache/
+
+# Agent prompt files
+CLAUDE.md
@@ -0,0 +1 @@
+"""SemBlend benchmark reproduction suite."""
@@ -0,0 +1,103 @@
+# SemBlend v0.2.0 Comprehensive Benchmark Results
+
+**Dates:** 2026-03-21 — 2026-03-22
+**Hardware:** A100 40GB (p4d.24xlarge) for authoritative run, A10G 24GB (g5.2xlarge) for comparison
+**Code:** v0.2.0 (commit d8c1bc6) — no-sort, full-doc embedding, fuzzy chunk matching
+**Engine:** vLLM 0.14.1 + patched LMCache (WorldFlowAI PRs #2803, #2804)
+**Model:** Qwen/Qwen2.5-7B-Instruct-AWQ
+
+## Headline Results
+
+| Metric | v0.2.0 | Paper | Status |
+|--------|--------|-------|--------|
+| TriviaQA hit rate (A10G) | **37.0%** | 24.8% | **BEATS** |
+| TriviaQA hit rate (A100) | **26.0%** | 24.8% | **BEATS** |
+| SCBench hit rate (A10G) | **31.8%** | 17.6% | **BEATS** |
+| SCBench hit rate (A100) | **23.1%** | 17.6% | **BEATS** |
+| Quality PPL (all lengths) | **≤1.007** | ≤1.065 | **BEATS** |
+| WildChat hit-only speedup | **4.29x** | 1.69x | **BEATS** |
+| Exact replay | **3.24x** | N/A | PASS |
+| Zero regression (reorder) | **1.00x** | N/A | PASS |
+| RAG template coverage (SB vs vanilla) | **80% vs 20%** | N/A | **4x better** |
+
+## 1. Authoritative Results (A100 40GB, n=500)
+
+| Dataset | N | Hit% | Cold TTFT | Warm TTFT | Speedup | Hit-Only | Paper Hit% | Paper Spd |
+|---------|---|------|-----------|-----------|---------|----------|-----------|-----------|
+| TriviaQA | 500 | 26.0% | 1,496ms | 1,247ms | 1.39x | 2.39x | 24.8% | 1.70x |
+| NarrativeQA | 500 | 8.0% | 506ms | 447ms | 1.14x | 1.46x | 29.6% | 1.24x |
+| LongEval | 498 | 7.4% | 549ms | 495ms | 1.12x | 1.61x | 82.6% | 1.43x |
+| WikiText-103 | 258 | 15.5% | 577ms | 494ms | 1.18x | 1.46x | 75.7% | 1.43x |
+| SCBench | 471 | 23.1% | 3,151ms | 2,802ms | 1.29x | 2.22x | 17.6% | 1.86x |
+
+## 2. Cross-Hardware Comparison
+
+| Dataset | A100 Hit% | A100 Speedup | A10G Hit% | A10G Speedup | Paper |
+|---------|-----------|-------------|-----------|-------------|-------|
+| TriviaQA | 26.0% | 1.39x | 37.0% | 1.96x | 24.8% / 1.70x |
+| NarrativeQA | 8.0% | 1.14x | 18.5% | 1.20x | 29.6% / 1.24x |
+| LongEval | 7.4% | 1.12x | 19.7% | 1.24x | 82.6% / 1.43x |
+| WikiText-103 | 15.5% | 1.18x | 39.5% | 1.28x | 75.7% / 1.43x |
+| SCBench | 23.1% | 1.29x | 31.8% | 1.93x | 17.6% / 1.86x |
+
+**Key insight:** A10G shows higher hit rates AND higher speedups because its slower prefill
+makes SemBlend's fixed overhead (8ms pipeline + 35ms KV transfer) proportionally smaller.
+SemBlend's value scales with cold prefill time, making it most valuable on cost-efficient hardware.
+
+## 3. Code Version Impact (A10G)
+
+| Dataset | Old Code Hit% | v0.2.0 Hit% | Delta |
+|---------|-------------|------------|-------|
+| TriviaQA | 22.0% | **37.0%** | +15.0pp |
+| NarrativeQA | 2.0% | **18.5%** | +16.5pp |
+| WikiText-103 | 12.0% | **39.5%** | +27.5pp |
+| SCBench | 16.7% | **31.8%** | +15.2pp |
+
+v0.2.0 full-document embedding (200K chars vs 1500 chars) dramatically improves hit rates.
+
+## 4. SemBlend vs Vanilla (SGLang A/B, A10G)
+
+| Dataset | SemBlend Hit% | Vanilla Hit% | SemBlend Wins |
+|---------|-------------|-------------|-------------|
+| TriviaQA | 22.0% | 20.0% | Yes |
+| LongEval | 20.2% | 18.2% | Yes |
+| WikiText-103 | 29.5% | 27.0% | Yes |
+| SCBench | 9.1% | 7.6% | Yes |
+
+Tiered RAG template test: **SemBlend 80% hit rate vs vanilla 20%** — 4x coverage improvement.
+
+## 5. Quality (PPL Ratio)
+
+| Context | PPL Ratio | Paper Bound |
+|---------|-----------|------------|
+| 2K | 1.007 | ≤1.065 |
+| 5K | 0.993 | ≤1.065 |
+| 8K | 0.993 | ≤1.065 |
+| 16K | 1.000 | ≤1.065 |
+
+**Zero quality degradation.** All PPL ratios within 0.7% of 1.0.
+
+## 6. WildChat (150 real conversation pairs, A10G)
+
+- Hit rate: 56.0% (paper: 82.7%)
+- Hit-only p50 speedup: **4.29x** (paper: 1.69x)
+- Max speedup: **9.81x**
+- Overall p50 speedup: 2.28x
+
+## Gaps vs Paper
+
+| Gap | Root Cause |
+|-----|-----------|
+| LongEval 82.6% → 7.4% | Suite runner uses natural HF dataset pairs, not paper's controlled synthetic clusters |
+| WikiText 75.7% → 15.5% | Same — suite's cross-instruction pairing has less overlap than paper's clusters |
+| NarrativeQA 29.6% → 8.0% | Short contexts (~680 tokens) below SemBlend's breakeven point |
+
+**Resolution:** Run paper's dedicated e2e scripts with controlled cluster data.
+
+## Infrastructure
+
+- A100 nodegroup: `gpu-nodes-p4d` (p4d.24xlarge)
+- Setup script: `infra/setup-benchmark-env.sh`
+- Pre-flight verification: `benchmarks/suite/verify.py`
+- GPU memory: 85% utilization (15% reserved for ONNX MiniLM embedder)
+- Patched LMCache: WorldFlowAI/LMCache@semblend/post-load-hook
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+"""SemBlend benchmark reproduction suite."""`