Skip to content

Commit 2822a3b

Browse files
committed
revert readme
1 parent ed2de03 commit 2822a3b

2 files changed

Lines changed: 17 additions & 51 deletions

File tree

README.md

Lines changed: 16 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -141,63 +141,29 @@ result = db.query(tree_id, "question", strategy="block", beam_size=3)
141141

142142
## 📈 Benchmark Snapshot
143143

144-
Two benchmarks live under `bench/`.
144+
Current filesystem benchmark summary lives in [bench/fs_block_beam_vertical.md](bench/fs_block_beam_vertical.md).
145145

146-
### Filesystem mode — SWEBench-FileTree
146+
Run setup: `fs_query_order=prefix`, `beam_size=3`, `max_turns=10`, `5` filesystem queries on `context7` only.
147147

148-
Runs on [`AmuroEita/SWEBench-FileTree`](https://huggingface.co/datasets/AmuroEita/SWEBench-FileTree),
149-
a path-only version of SWE-bench code retrieval:
148+
### Claude Opus 4.6
150149

151-
- 500 GitHub issues as queries
152-
- 475 `(repo, commit)` repository snapshots as independent retrieval universes
153-
- 58,058 file paths; no source code, no file summaries
150+
| Retriever | Avg Time (s) | Avg LLM Calls | Hit@1 | Hit@10 | Total Cost (USD) |
151+
|---|---:|---:|---:|---:|---:|
152+
| **Block** | 8.44 | 2.4 | 1.00 | 1.00 | 0.2166 |
153+
| **Vertical** | 28.18 | 6.8 | 0.40 | 1.00 | 0.2900 |
154+
| **Beam** | 18.36 | 4.8 | 0.60 | 1.00 | 0.2091 |
154155

155-
Given an issue and one snapshot's file tree, return the file(s) the fix
156-
touches. Specification: `notes/condb_swebench_filetree_bench.md`.
156+
### Claude Sonnet 4.6
157157

158-
```bash
159-
export ANTHROPIC_API_KEY=sk-ant-...
160-
python bench/run_swebench_filetree.py --tier medium
161-
```
162-
163-
Tiers (by retriever difficulty; lower difficulty = more path signal in query):
164-
165-
```
166-
easy 107 queries gold path appears in query text (sanity check)
167-
medium 133 queries gold filename appears in query (main report)
168-
hard 261 queries gold module stem appears (fuzzy matching)
169-
all 500 queries no filter, includes ~48% path-signal-less queries
170-
```
171-
172-
Output goes to `bench/runs/<timestamp>__<tier>/`: `report.md`, `summary.json`,
173-
`per_query.jsonl`.
158+
| Retriever | Avg Time (s) | Avg LLM Calls | Hit@1 | Hit@10 | Total Cost (USD) |
159+
|---|---:|---:|---:|---:|---:|
160+
| **Block** | 8.42 | 3.4 | 1.00 | 1.00 | 0.0643 |
161+
| **Vertical** | 20.78 | 7.0 | 0.40 | 0.80 | 0.1712 |
162+
| **Beam** | 17.84 | 4.8 | 0.40 | 1.00 | 0.1335 |
174163

175-
#### Snapshot (Claude Sonnet 4.6, `--strategy auto`, top-k=10)
176-
177-
| tier | n | hit@1 | hit@3 | hit@5 | hit@10 | MRR | nDCG@10 |
178-
|--------|-----|-------|-------|-------|--------|-------|---------|
179-
| easy | 107 | 0.776 | 0.841 | 0.841 | 0.841 | 0.805 | 0.772 |
180-
| medium | 133 | 0.797 | 0.850 | 0.850 | 0.850 | 0.821 | 0.787 |
181-
182-
Path-only filesystem retrieval lands gold in top-10 for ~85% of queries that
183-
carry any path-level signal. The `hard` tier (261 queries, module-stem
184-
signal only) is in progress and will be added once Anthropic API rate-limit
185-
retries are wired in. `all` (the path-signal-less ceiling) has not been run.
186-
187-
### Document mode — single long document
188-
189-
Compares retriever algorithms (Block / Beam / Vertical / ...) on one
190-
hierarchical document. Reports time, LLM calls, token usage with prompt
191-
caching, and USD cost.
192-
193-
```bash
194-
python bench/run_document_bench.py \
195-
--doc examples/large_doc.json \
196-
--config bench/queries.json
197-
```
164+
`Block` is the best default: perfect Hit@1 across both models, lowest cost on Sonnet 4.6 (prompt caching cuts cost by ~60%), and fastest latency. `Beam` and `Vertical` are sensitive to model version — `Block` is the most robust choice.
198165

199-
Queries live in the config JSON as `{"queries": ["...", "..."]}`. Swap in
200-
any `--doc` and any `--config` to benchmark a different document.
166+
These numbers are benchmark snapshots, not hard guarantees; exact cost and latency will vary with model choice, provider pricing, prompt-cache behavior, and corpus shape.
201167

202168
---
203169

bench/run_swebench_filetree.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
from contextdb.adapter.filesystem import FileSystemAdapter
3131
from contextdb.api.condb import ConDB
3232

33-
DEFAULT_MODEL = "claude-sonnet-4-6"
33+
DEFAULT_MODEL = "claude-haiku-4-5"
3434
DEFAULT_DATA_DIR = Path("data/swebench_pathonly")
3535
K_VALUES = (1, 3, 5, 10)
3636

0 commit comments

Comments
 (0)