You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
python bench/run_swebench_filetree.py --tier medium
161
+
```
162
+
163
+
Tiers (by retriever difficulty; lower difficulty = more path signal in query):
164
+
165
+
```
166
+
easy 107 queries gold path appears in query text (sanity check)
167
+
medium 133 queries gold filename appears in query (main report)
168
+
hard 261 queries gold module stem appears (fuzzy matching)
169
+
all 500 queries no filter, includes ~48% path-signal-less queries
170
+
```
171
+
172
+
Output goes to `bench/runs/<timestamp>__<tier>/`: `report.md`, `summary.json`,
173
+
`per_query.jsonl`.
174
+
175
+
Block mode can optionally rerank only the cross-block merge candidates before
176
+
the file/directory split:
177
+
178
+
```bash
179
+
python bench/run_swebench_filetree.py --tier medium --strategy block --ranker vector
180
+
```
181
+
182
+
Available rankers are `none`, `bm25`, and `vector`. The vector ranker uses
183
+
LiteLLM embeddings (`--embedding-provider`, `--embedding-model`) and leaves
184
+
the default `ranker=none` unchanged.
185
+
186
+
#### Latest Run
163
187
164
-
`Block` is the best default: perfect Hit@1 across both models, lowest cost on Sonnet 4.6 (prompt caching cuts cost by ~60%), and fastest latency. `Beam` and `Vertical` are sensitive to model version — `Block` is the most robust choice.
python bench/run_swebench_filetree.py --tier all --strategy block --ranker none --top-k 10
216
+
python bench/run_swebench_filetree.py --tier all --strategy vertical --ranker none --top-k 10
217
+
```
218
+
219
+
### Document mode — single long document
220
+
221
+
Compares retriever algorithms (Block / Beam / Vertical / ...) on one
222
+
hierarchical document. Reports time, LLM calls, token usage with prompt
223
+
caching, and USD cost.
224
+
225
+
```bash
226
+
python bench/run_document_bench.py \
227
+
--doc examples/large_doc.json \
228
+
--config bench/queries.json
229
+
```
165
230
166
-
These numbers are benchmark snapshots, not hard guarantees; exact cost and latency will vary with model choice, provider pricing, prompt-cache behavior, and corpus shape.
231
+
Queries live in the config JSON as `{"queries": ["...", "..."]}`. Swap in
232
+
any `--doc` and any `--config` to benchmark a different document.
0 commit comments