Skip to content

Commit 6860d4b

Browse files
committed
update bench result
1 parent 3c77771 commit 6860d4b

3 files changed

Lines changed: 69 additions & 21 deletions

File tree

README.md

Lines changed: 29 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -183,23 +183,38 @@ Available rankers are `none`, `bm25`, and `vector`. The vector ranker uses
183183
LiteLLM embeddings (`--embedding-provider`, `--embedding-model`) and leaves
184184
the default `ranker=none` unchanged.
185185

186-
#### Latest Run (Claude Sonnet 4.6, `--strategy block --ranker none`, top-k=10)
186+
#### Latest Run
187187

188-
The cutoff for each query is its gold-file count: one-gold queries use top-1,
189-
two-gold queries use top-2, and so on. The `6+` row contains one 6-gold query
190-
and one 21-gold outlier.
188+
Claude Sonnet 4.6, `--ranker none`, `--top-k 10`, 500 queries, 0 failures.
189+
Block (ConDB) is compared against a **Vertical baseline** — a per-beam variant
190+
that expands each parent's children into separate subtree blocks (`A→B`,
191+
`A→C`), one LLM call per branch, losing the cross-branch view Block keeps.
192+
193+
| variant | recall@gold | exact@gold | MRR | nDCG@10 | avg returned | avg latency |
194+
|---|---:|---:|---:|---:|---:|---:|
195+
| Vertical (baseline) | 0.382 | 0.366 | 0.466 | 0.481 | 3.00 | ~24 s |
196+
| **Block (ConDB)** | **0.711** | **0.672** | **0.805** | **0.813** | 7.20 | ~8 s |
197+
198+
Block gains **+0.33 recall@gold** at ~3× lower latency.
199+
200+
Block per-gold-count breakdown — the cutoff is the query's gold-file count
201+
(`6+` row contains one 6-gold query and one 21-gold outlier):
191202

192203
| gold files | queries | cutoff | recall@gold | exact@gold | found@gold | avg returned |
193-
|------------|--------:|-------:|------------:|-----------:|-----------:|-------------:|
194-
| 1 | 430 | 1 | 0.749 | 0.749 | 0.75 | 7.00 |
195-
| 2 | 48 | 2 | 0.521 | 0.271 | 1.04 | 8.31 |
196-
| 3 | 13 | 3 | 0.410 | 0.077 | 1.23 | 8.77 |
197-
| 4 | 6 | 4 | 0.417 | 0.000 | 1.67 | 9.17 |
198-
| 5 | 1 | 5 | 0.200 | 0.000 | 1.00 | 2.00 |
199-
| 6+ | 2 | gold | 0.274 | 0.000 | 2.00 | 10.00 |
200-
201-
The full 500-query run aggregates to `recall@gold=0.711`, `exact@gold=0.672`,
202-
`MRR=0.805`, `nDCG@10=0.813`, `avg gold=1.24`, and `avg returned=7.20`.
204+
|---|---:|---:|---:|---:|---:|---:|
205+
| 1 | 430 | 1 | 0.749 | 0.749 | 0.75 | 7.00 |
206+
| 2 | 48 | 2 | 0.521 | 0.271 | 1.04 | 8.31 |
207+
| 3 | 13 | 3 | 0.410 | 0.077 | 1.23 | 8.77 |
208+
| 4 | 6 | 4 | 0.417 | 0.000 | 1.67 | 9.17 |
209+
| 5 | 1 | 5 | 0.200 | 0.000 | 1.00 | 2.00 |
210+
| 6+ | 2 | gold | 0.274 | 0.000 | 2.00 | 10.00 |
211+
212+
Reproduce:
213+
214+
```bash
215+
python bench/run_swebench_filetree.py --tier all --strategy block --ranker none --top-k 10
216+
python bench/run_swebench_filetree.py --tier all --strategy vertical --ranker none --top-k 10
217+
```
203218

204219
### Document mode — single long document
205220

bench/README.md

Lines changed: 35 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,14 @@ Outputs:
2020
- `report.md`: markdown report
2121
- `bench.sqlite`: temporary benchmark database
2222

23-
Ranker options:
23+
Strategies (`--strategy`):
24+
25+
- `auto`: pick `beam` for ≤50 nodes, `block` otherwise.
26+
- `beam`: depth-first LLM beam expansion (small trees).
27+
- `block`: token-bounded block partitioning with cross-block merge.
28+
- `vertical`: baseline — per-beam-branch subtree blocks, no cross-branch view.
29+
30+
Ranker options (apply to `block` / `vertical`):
2431

2532
- `--ranker none`: preserve traversal and block-local LLM order.
2633
- `--ranker bm25`: lexical path ordering for cross-block merge candidates.
@@ -29,7 +36,7 @@ Ranker options:
2936

3037
### Latest Full Run
3138

32-
Claude Sonnet 4.6, `tier=all`, `strategy=block`, `ranker=none`, `top_k=10`.
39+
Claude Sonnet 4.6, `tier=all`, `ranker=none`, `top_k=10`, 500 queries, 0 failures.
3340

3441
Metric notes:
3542

@@ -38,6 +45,22 @@ Metric notes:
3845
- `exact@gold`: all gold files are recovered within that same cutoff.
3946
- `found@gold`: average number of gold files recovered within that cutoff.
4047

48+
#### Block (ConDB) vs Vertical (baseline)
49+
50+
Vertical is a per-beam-branch variant: each parent expands its children into
51+
separate subtree blocks (`A→B`, `A→C`), one LLM call per branch. It removes
52+
the cross-branch view that Block keeps, so it serves as a direct baseline
53+
for the merged-pool design used in ConDB.
54+
55+
| variant | recall@gold | exact@gold | MRR | nDCG@10 | avg returned | avg latency |
56+
|---|---:|---:|---:|---:|---:|---:|
57+
| Vertical (baseline) | 0.382 | 0.366 | 0.466 | 0.481 | 3.00 | ~24 s |
58+
| **Block (ConDB)** | **0.711** | **0.672** | **0.805** | **0.813** | 7.20 | ~8 s |
59+
60+
Block is **+0.33 recall@gold** at ~3× lower latency.
61+
62+
#### Block — per-gold-count breakdown
63+
4164
| gold files | queries | cutoff | recall@gold | exact@gold | found@gold | avg returned |
4265
|---|---:|---:|---:|---:|---:|---:|
4366
| 1 | 430 | 1 | 0.749 | 0.749 | 0.75 | 7.00 |
@@ -47,5 +70,13 @@ Metric notes:
4770
| 5 | 1 | 5 | 0.200 | 0.000 | 1.00 | 2.00 |
4871
| 6+ | 2 | gold | 0.274 | 0.000 | 2.00 | 10.00 |
4972

50-
Full-set aggregate: `n=500`, `recall@gold=0.711`, `exact@gold=0.672`,
51-
`MRR=0.805`, `nDCG@10=0.813`, `avg gold=1.24`, `avg returned=7.20`.
73+
#### Vertical — per-gold-count breakdown
74+
75+
| gold files | queries | cutoff | recall@gold | exact@gold | found@gold | avg returned |
76+
|---|---:|---:|---:|---:|---:|---:|
77+
| 1 | 430 | 1 | 0.412 | 0.412 | 0.41 | 3.07 |
78+
| 2 | 48 | 2 | 0.250 | 0.125 | 0.50 | 2.67 |
79+
| 3 | 13 | 3 | 0.103 | 0.000 | 0.31 | 2.54 |
80+
| 4 | 6 | 4 | 0.042 | 0.000 | 0.17 | 2.50 |
81+
| 5 | 1 | 5 | 0.400 | 0.000 | 2.00 | 5.00 |
82+
| 6+ | 2 | gold | 0.000 | 0.000 | 0.00 | 0.00 |

bench/run_swebench_filetree.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
from contextdb.retriever.algorithm.beam_retriever import BeamRetriever
3333
from contextdb.retriever.algorithm.block_retriever import BlockRetriever
3434
from contextdb.retriever.algorithm.ranker import make_ranker
35+
from contextdb.retriever.algorithm.vertical_retriever import VerticalRetriever
3536

3637
DEFAULT_MODEL = "claude-sonnet-4-6"
3738
DEFAULT_DATA_DIR = Path("data/swebench_pathonly")
@@ -43,14 +44,15 @@ def make_filesystem_retriever(db: ConDB, args, node_count: int):
4344
strategy = "beam" if node_count <= 50 else "block"
4445
if strategy == "beam":
4546
return BeamRetriever(db.storage, db._llm, mode="filesystem")
46-
if strategy == "block":
47+
if strategy in ("block", "vertical"):
4748
ranker = make_ranker(
4849
args.ranker,
4950
embedding_provider=args.embedding_provider,
5051
embedding_model=args.embedding_model,
5152
embedding_api_key=args.embedding_api_key,
5253
)
53-
return BlockRetriever(
54+
cls = VerticalRetriever if strategy == "vertical" else BlockRetriever
55+
return cls(
5456
db.storage,
5557
db._llm,
5658
mode="filesystem",
@@ -482,7 +484,7 @@ def main():
482484
p.add_argument("--model", default=DEFAULT_MODEL)
483485
p.add_argument("--provider", default="anthropic")
484486
p.add_argument("--top-k", type=int, default=10)
485-
p.add_argument("--strategy", choices=["auto", "beam", "block"], default="auto")
487+
p.add_argument("--strategy", choices=["auto", "beam", "block", "vertical"], default="auto")
486488
p.add_argument("--ranker", choices=["bm25", "vector", "none"], default="none",
487489
help="Optional path ordering for Block merge results")
488490
p.add_argument("--embedding-provider", default="openai")

0 commit comments

Comments
 (0)