update bench result

carsontung666 · carsontung666 · commit 6860d4be8dca · 2026-05-25T16:54:16.000+08:00
diff --git a/README.md b/README.md
@@ -183,23 +183,38 @@ Available rankers are `none`, `bm25`, and `vector`. The vector ranker uses
 LiteLLM embeddings (`--embedding-provider`, `--embedding-model`) and leaves
 the default `ranker=none` unchanged.
 
-#### Latest Run (Claude Sonnet 4.6, `--strategy block --ranker none`, top-k=10)
+#### Latest Run
 
-The cutoff for each query is its gold-file count: one-gold queries use top-1,
-two-gold queries use top-2, and so on. The `6+` row contains one 6-gold query
-and one 21-gold outlier.
+Claude Sonnet 4.6, `--ranker none`, `--top-k 10`, 500 queries, 0 failures.
+Block (ConDB) is compared against a **Vertical baseline** — a per-beam variant
+that expands each parent's children into separate subtree blocks (`A→B`,
+`A→C`), one LLM call per branch, losing the cross-branch view Block keeps.
+
+| variant | recall@gold | exact@gold | MRR | nDCG@10 | avg returned | avg latency |
+|---|---:|---:|---:|---:|---:|---:|
+| Vertical (baseline) | 0.382 | 0.366 | 0.466 | 0.481 | 3.00 | ~24 s |
+| **Block (ConDB)** | **0.711** | **0.672** | **0.805** | **0.813** | 7.20 | ~8 s |
+
+Block gains **+0.33 recall@gold** at ~3× lower latency.
+
+Block per-gold-count breakdown — the cutoff is the query's gold-file count
+(`6+` row contains one 6-gold query and one 21-gold outlier):
 
 | gold files | queries | cutoff | recall@gold | exact@gold | found@gold | avg returned |
-|------------|--------:|-------:|------------:|-----------:|-----------:|-------------:|
-| 1          | 430     | 1      | 0.749       | 0.749      | 0.75       | 7.00         |
-| 2          | 48      | 2      | 0.521       | 0.271      | 1.04       | 8.31         |
-| 3          | 13      | 3      | 0.410       | 0.077      | 1.23       | 8.77         |
-| 4          | 6       | 4      | 0.417       | 0.000      | 1.67       | 9.17         |
-| 5          | 1       | 5      | 0.200       | 0.000      | 1.00       | 2.00         |
-| 6+         | 2       | gold   | 0.274       | 0.000      | 2.00       | 10.00        |
-
-The full 500-query run aggregates to `recall@gold=0.711`, `exact@gold=0.672`,
-`MRR=0.805`, `nDCG@10=0.813`, `avg gold=1.24`, and `avg returned=7.20`.
+|---|---:|---:|---:|---:|---:|---:|
+| 1 | 430 | 1 | 0.749 | 0.749 | 0.75 | 7.00 |
+| 2 | 48 | 2 | 0.521 | 0.271 | 1.04 | 8.31 |
+| 3 | 13 | 3 | 0.410 | 0.077 | 1.23 | 8.77 |
+| 4 | 6 | 4 | 0.417 | 0.000 | 1.67 | 9.17 |
+| 5 | 1 | 5 | 0.200 | 0.000 | 1.00 | 2.00 |
+| 6+ | 2 | gold | 0.274 | 0.000 | 2.00 | 10.00 |
+
+Reproduce:
+
+```bash
+python bench/run_swebench_filetree.py --tier all --strategy block    --ranker none --top-k 10
+python bench/run_swebench_filetree.py --tier all --strategy vertical --ranker none --top-k 10
+```
 
 ### Document mode — single long document
 
diff --git a/bench/README.md b/bench/README.md
@@ -20,7 +20,14 @@ Outputs:
 - `report.md`: markdown report
 - `bench.sqlite`: temporary benchmark database
 
-Ranker options:
+Strategies (`--strategy`):
+
+- `auto`: pick `beam` for ≤50 nodes, `block` otherwise.
+- `beam`: depth-first LLM beam expansion (small trees).
+- `block`: token-bounded block partitioning with cross-block merge.
+- `vertical`: baseline — per-beam-branch subtree blocks, no cross-branch view.
+
+Ranker options (apply to `block` / `vertical`):
 
 - `--ranker none`: preserve traversal and block-local LLM order.
 - `--ranker bm25`: lexical path ordering for cross-block merge candidates.
@@ -29,7 +36,7 @@ Ranker options:
 
 ### Latest Full Run
 
-Claude Sonnet 4.6, `tier=all`, `strategy=block`, `ranker=none`, `top_k=10`.
+Claude Sonnet 4.6, `tier=all`, `ranker=none`, `top_k=10`, 500 queries, 0 failures.
 
 Metric notes:
 
@@ -38,6 +45,22 @@ Metric notes:
 - `exact@gold`: all gold files are recovered within that same cutoff.
 - `found@gold`: average number of gold files recovered within that cutoff.
 
+#### Block (ConDB) vs Vertical (baseline)
+
+Vertical is a per-beam-branch variant: each parent expands its children into
+separate subtree blocks (`A→B`, `A→C`), one LLM call per branch. It removes
+the cross-branch view that Block keeps, so it serves as a direct baseline
+for the merged-pool design used in ConDB.
+
+| variant | recall@gold | exact@gold | MRR | nDCG@10 | avg returned | avg latency |
+|---|---:|---:|---:|---:|---:|---:|
+| Vertical (baseline) | 0.382 | 0.366 | 0.466 | 0.481 | 3.00 | ~24 s |
+| **Block (ConDB)** | **0.711** | **0.672** | **0.805** | **0.813** | 7.20 | ~8 s |
+
+Block is **+0.33 recall@gold** at ~3× lower latency.
+
+#### Block — per-gold-count breakdown
+
 | gold files | queries | cutoff | recall@gold | exact@gold | found@gold | avg returned |
 |---|---:|---:|---:|---:|---:|---:|
 | 1 | 430 | 1 | 0.749 | 0.749 | 0.75 | 7.00 |
@@ -47,5 +70,13 @@ Metric notes:
 | 5 | 1 | 5 | 0.200 | 0.000 | 1.00 | 2.00 |
 | 6+ | 2 | gold | 0.274 | 0.000 | 2.00 | 10.00 |
 
-Full-set aggregate: `n=500`, `recall@gold=0.711`, `exact@gold=0.672`,
-`MRR=0.805`, `nDCG@10=0.813`, `avg gold=1.24`, `avg returned=7.20`.
+#### Vertical — per-gold-count breakdown
+
+| gold files | queries | cutoff | recall@gold | exact@gold | found@gold | avg returned |
+|---|---:|---:|---:|---:|---:|---:|
+| 1 | 430 | 1 | 0.412 | 0.412 | 0.41 | 3.07 |
+| 2 | 48 | 2 | 0.250 | 0.125 | 0.50 | 2.67 |
+| 3 | 13 | 3 | 0.103 | 0.000 | 0.31 | 2.54 |
+| 4 | 6 | 4 | 0.042 | 0.000 | 0.17 | 2.50 |
+| 5 | 1 | 5 | 0.400 | 0.000 | 2.00 | 5.00 |
+| 6+ | 2 | gold | 0.000 | 0.000 | 0.00 | 0.00 |
diff --git a/bench/run_swebench_filetree.py b/bench/run_swebench_filetree.py
@@ -32,6 +32,7 @@
 from contextdb.retriever.algorithm.beam_retriever import BeamRetriever
 from contextdb.retriever.algorithm.block_retriever import BlockRetriever
 from contextdb.retriever.algorithm.ranker import make_ranker
+from contextdb.retriever.algorithm.vertical_retriever import VerticalRetriever
 
 DEFAULT_MODEL = "claude-sonnet-4-6"
 DEFAULT_DATA_DIR = Path("data/swebench_pathonly")
@@ -43,14 +44,15 @@ def make_filesystem_retriever(db: ConDB, args, node_count: int):
         strategy = "beam" if node_count <= 50 else "block"
     if strategy == "beam":
         return BeamRetriever(db.storage, db._llm, mode="filesystem")
-    if strategy == "block":
+    if strategy in ("block", "vertical"):
         ranker = make_ranker(
             args.ranker,
             embedding_provider=args.embedding_provider,
             embedding_model=args.embedding_model,
             embedding_api_key=args.embedding_api_key,
         )
-        return BlockRetriever(
+        cls = VerticalRetriever if strategy == "vertical" else BlockRetriever
+        return cls(
             db.storage,
             db._llm,
             mode="filesystem",
@@ -482,7 +484,7 @@ def main():
     p.add_argument("--model", default=DEFAULT_MODEL)
     p.add_argument("--provider", default="anthropic")
     p.add_argument("--top-k", type=int, default=10)
-    p.add_argument("--strategy", choices=["auto", "beam", "block"], default="auto")
+    p.add_argument("--strategy", choices=["auto", "beam", "block", "vertical"], default="auto")
     p.add_argument("--ranker", choices=["bm25", "vector", "none"], default="none",
                    help="Optional path ordering for Block merge results")
     p.add_argument("--embedding-provider", default="openai")