Skip to content

Commit 979b494

Browse files
committed
new bench
1 parent 4fbad56 commit 979b494

18 files changed

Lines changed: 894 additions & 2042 deletions

.gitignore

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,3 +69,21 @@ venv.bak/
6969
Thumbs.db
7070
bench/repos
7171
bench/results
72+
73+
# Bench output
74+
bench/runs/
75+
76+
# Dataset local copy (fetched from HF)
77+
data/
78+
79+
# Local-only working files (scripts, notes, issue log)
80+
scripts/
81+
notes/
82+
issues/
83+
84+
# Miscellaneous local files not part of the repo
85+
Discussion_*.docx
86+
PageIndex-report-preview.pdf
87+
claude-code-windows-setup.md
88+
examples/TFP_FAA_AIP_2025.pdf
89+
examples/large_doc.json

README.md

Lines changed: 39 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -121,31 +121,53 @@ result = db.query(tree_id, "question", strategy="block", beam_size=3)
121121

122122
---
123123

124-
## Benchmark Snapshot
124+
## Benchmark
125125

126-
Current filesystem benchmark summary lives in [bench/fs_block_beam_vertical.md](bench/fs_block_beam_vertical.md).
126+
Two benchmarks live under `bench/`.
127127

128-
Run setup: `fs_query_order=prefix`, `beam_size=3`, `max_turns=10`, `5` filesystem queries on `context7` only.
128+
### Filesystem mode — SWEBench-FileTree
129129

130-
### Claude Opus 4.6
130+
Runs on [`AmuroEita/SWEBench-FileTree`](https://huggingface.co/datasets/AmuroEita/SWEBench-FileTree),
131+
a path-only version of SWE-bench code retrieval:
131132

132-
| Retriever | Avg Time (s) | Avg LLM Calls | Hit@1 | Hit@10 | Total Cost (USD) |
133-
|---|---:|---:|---:|---:|---:|
134-
| **Block** | 8.44 | 2.4 | 1.00 | 1.00 | 0.2166 |
135-
| **Vertical** | 28.18 | 6.8 | 0.40 | 1.00 | 0.2900 |
136-
| **Beam** | 18.36 | 4.8 | 0.60 | 1.00 | 0.2091 |
133+
- 500 GitHub issues as queries
134+
- 475 `(repo, commit)` repository snapshots as independent retrieval universes
135+
- 58,058 file paths; no source code, no file summaries
137136

138-
### Claude Sonnet 4.6
137+
Given an issue and one snapshot's file tree, return the file(s) the fix
138+
touches. Specification: `notes/condb_swebench_filetree_bench.md`.
139139

140-
| Retriever | Avg Time (s) | Avg LLM Calls | Hit@1 | Hit@10 | Total Cost (USD) |
141-
|---|---:|---:|---:|---:|---:|
142-
| **Block** | 8.42 | 3.4 | 1.00 | 1.00 | 0.0643 |
143-
| **Vertical** | 20.78 | 7.0 | 0.40 | 0.80 | 0.1712 |
144-
| **Beam** | 17.84 | 4.8 | 0.40 | 1.00 | 0.1335 |
140+
```bash
141+
export ANTHROPIC_API_KEY=sk-ant-...
142+
python bench/run_swebench_filetree.py --tier medium
143+
```
144+
145+
Tiers:
146+
147+
```
148+
strict 107 queries sanity check (gold path appears in query text)
149+
medium 133 queries main report
150+
loose 261 queries fuzzy matching
151+
full 500 queries includes ~48% path-signal-less queries
152+
```
145153

146-
`Block` is the best default: perfect Hit@1 across both models, lowest cost on Sonnet 4.6 (prompt caching cuts cost by ~60%), and fastest latency. `Beam` and `Vertical` are sensitive to model version — `Block` is the most robust choice.
154+
Output goes to `bench/runs/<timestamp>__<tier>/`: `report.md`, `summary.json`,
155+
`per_query.jsonl`.
156+
157+
### Document mode — single long document
158+
159+
Compares retriever algorithms (Block / Beam / Vertical / ...) on one
160+
hierarchical document. Reports time, LLM calls, token usage with prompt
161+
caching, and USD cost.
162+
163+
```bash
164+
python bench/run_document_bench.py \
165+
--doc examples/large_doc.json \
166+
--config bench/queries.json
167+
```
147168

148-
These numbers are benchmark snapshots, not hard guarantees; exact cost and latency will vary with model choice, provider pricing, prompt-cache behavior, and corpus shape.
169+
Queries live in the config JSON as `{"queries": ["...", "..."]}`. Swap in
170+
any `--doc` and any `--config` to benchmark a different document.
149171

150172
---
151173

0 commit comments

Comments
 (0)