Skip to content

Commit 4255304

Browse files
Merge pull request #12 from VectifyAI/v0.4
V0.4
2 parents 4fc1161 + e5696a3 commit 4255304

60 files changed

Lines changed: 4619 additions & 2817 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

MANIFEST.in

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
include README.md
2+
include LICENSE
3+
include requirements.txt
4+
recursive-include contextdb *.py *.jinja *.yaml py.typed
5+
global-exclude __pycache__
6+
global-exclude *.py[cod]
7+
global-exclude .DS_Store
8+
prune tests
9+
prune bench
10+
prune data
11+
prune examples
12+
prune notes
13+
prune scripts

README.md

Lines changed: 82 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -141,29 +141,95 @@ result = db.query(tree_id, "question", strategy="block", beam_size=3)
141141

142142
## 📈 Benchmark Snapshot
143143

144-
Current filesystem benchmark summary lives in [bench/fs_block_beam_vertical.md](bench/fs_block_beam_vertical.md).
144+
Two benchmarks live under `bench/`.
145145

146-
Run setup: `fs_query_order=prefix`, `beam_size=3`, `max_turns=10`, `5` filesystem queries on `context7` only.
146+
### Filesystem mode — SWEBench-FileTree
147147

148-
### Claude Opus 4.6
148+
Runs on [`AmuroEita/SWEBench-FileTree`](https://huggingface.co/datasets/AmuroEita/SWEBench-FileTree),
149+
a path-only version of SWE-bench code retrieval:
149150

150-
| Retriever | Avg Time (s) | Avg LLM Calls | Hit@1 | Hit@10 | Total Cost (USD) |
151-
|---|---:|---:|---:|---:|---:|
152-
| **Block** | 8.44 | 2.4 | 1.00 | 1.00 | 0.2166 |
153-
| **Vertical** | 28.18 | 6.8 | 0.40 | 1.00 | 0.2900 |
154-
| **Beam** | 18.36 | 4.8 | 0.60 | 1.00 | 0.2091 |
151+
- 500 GitHub issues as queries
152+
- 475 `(repo, commit)` repository snapshots as independent retrieval universes
153+
- 58,058 file paths; no source code, no file summaries
155154

156-
### Claude Sonnet 4.6
155+
Given an issue and one snapshot's file tree, return the file(s) the fix
156+
touches. Specification: `notes/condb_swebench_filetree_bench.md`.
157157

158-
| Retriever | Avg Time (s) | Avg LLM Calls | Hit@1 | Hit@10 | Total Cost (USD) |
159-
|---|---:|---:|---:|---:|---:|
160-
| **Block** | 8.42 | 3.4 | 1.00 | 1.00 | 0.0643 |
161-
| **Vertical** | 20.78 | 7.0 | 0.40 | 0.80 | 0.1712 |
162-
| **Beam** | 17.84 | 4.8 | 0.40 | 1.00 | 0.1335 |
158+
```bash
159+
export ANTHROPIC_API_KEY=sk-ant-...
160+
python bench/run_swebench_filetree.py --tier medium
161+
```
162+
163+
Tiers (by retriever difficulty; lower difficulty = more path signal in query):
164+
165+
```
166+
easy 107 queries gold path appears in query text (sanity check)
167+
medium 133 queries gold filename appears in query (main report)
168+
hard 261 queries gold module stem appears (fuzzy matching)
169+
all 500 queries no filter, includes ~48% path-signal-less queries
170+
```
171+
172+
Output goes to `bench/runs/<timestamp>__<tier>/`: `report.md`, `summary.json`,
173+
`per_query.jsonl`.
174+
175+
Block mode can optionally rerank only the cross-block merge candidates before
176+
the file/directory split:
177+
178+
```bash
179+
python bench/run_swebench_filetree.py --tier medium --strategy block --ranker vector
180+
```
181+
182+
Available rankers are `none`, `bm25`, and `vector`. The vector ranker uses
183+
LiteLLM embeddings (`--embedding-provider`, `--embedding-model`) and leaves
184+
the default `ranker=none` unchanged.
185+
186+
#### Latest Run
163187

164-
`Block` is the best default: perfect Hit@1 across both models, lowest cost on Sonnet 4.6 (prompt caching cuts cost by ~60%), and fastest latency. `Beam` and `Vertical` are sensitive to model version — `Block` is the most robust choice.
188+
Claude Sonnet 4.6, `--ranker none`, `--top-k 10`, 500 queries, 0 failures.
189+
Block (ConDB) is compared against a **Vertical baseline** — a per-beam variant
190+
that expands each parent's children into separate subtree blocks (`A→B`,
191+
`A→C`), one LLM call per branch, losing the cross-branch view Block keeps.
192+
193+
| variant | recall@gold | exact@gold | MRR | nDCG@10 | avg returned | avg latency |
194+
|---|---:|---:|---:|---:|---:|---:|
195+
| Vertical (baseline) | 0.382 | 0.366 | 0.466 | 0.481 | 3.00 | ~24 s |
196+
| **Block (ConDB)** | **0.711** | **0.672** | **0.805** | **0.813** | 7.20 | ~8 s |
197+
198+
Block gains **+0.33 recall@gold** at ~3× lower latency.
199+
200+
Block per-gold-count breakdown — the cutoff is the query's gold-file count
201+
(`6+` row contains one 6-gold query and one 21-gold outlier):
202+
203+
| gold files | queries | cutoff | recall@gold | exact@gold | found@gold | avg returned |
204+
|---|---:|---:|---:|---:|---:|---:|
205+
| 1 | 430 | 1 | 0.749 | 0.749 | 0.75 | 7.00 |
206+
| 2 | 48 | 2 | 0.521 | 0.271 | 1.04 | 8.31 |
207+
| 3 | 13 | 3 | 0.410 | 0.077 | 1.23 | 8.77 |
208+
| 4 | 6 | 4 | 0.417 | 0.000 | 1.67 | 9.17 |
209+
| 5 | 1 | 5 | 0.200 | 0.000 | 1.00 | 2.00 |
210+
| 6+ | 2 | gold | 0.274 | 0.000 | 2.00 | 10.00 |
211+
212+
Reproduce:
213+
214+
```bash
215+
python bench/run_swebench_filetree.py --tier all --strategy block --ranker none --top-k 10
216+
python bench/run_swebench_filetree.py --tier all --strategy vertical --ranker none --top-k 10
217+
```
218+
219+
### Document mode — single long document
220+
221+
Compares retriever algorithms (Block / Beam / Vertical / ...) on one
222+
hierarchical document. Reports time, LLM calls, token usage with prompt
223+
caching, and USD cost.
224+
225+
```bash
226+
python bench/run_document_bench.py \
227+
--doc examples/large_doc.json \
228+
--config bench/queries.json
229+
```
165230

166-
These numbers are benchmark snapshots, not hard guarantees; exact cost and latency will vary with model choice, provider pricing, prompt-cache behavior, and corpus shape.
231+
Queries live in the config JSON as `{"queries": ["...", "..."]}`. Swap in
232+
any `--doc` and any `--config` to benchmark a different document.
167233

168234
---
169235

bench/README.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Benchmarks
2+
3+
## SWEBench-FileTree
4+
5+
Path-only code retrieval over `AmuroEita/SWEBench-FileTree`.
6+
7+
```bash
8+
PYTHONPATH=. python bench/run_swebench_filetree.py \
9+
--tier all \
10+
--strategy block \
11+
--ranker none \
12+
--top-k 10 \
13+
--output-dir bench/runs/block_none_all
14+
```
15+
16+
Outputs:
17+
18+
- `summary.json`: aggregate metrics
19+
- `per_query.jsonl`: per-query predictions and metrics
20+
- `report.md`: markdown report
21+
- `bench.sqlite`: temporary benchmark database
22+
23+
Strategies (`--strategy`):
24+
25+
- `auto`: pick `beam` for ≤50 nodes, `block` otherwise.
26+
- `beam`: depth-first LLM beam expansion (small trees).
27+
- `block`: token-bounded block partitioning with cross-block merge.
28+
- `vertical`: baseline — per-beam-branch subtree blocks, no cross-branch view.
29+
30+
Ranker options (apply to `block` / `vertical`):
31+
32+
- `--ranker none`: preserve traversal and block-local LLM order.
33+
- `--ranker bm25`: lexical path ordering for cross-block merge candidates.
34+
- `--ranker vector`: embedding path ordering for cross-block merge candidates;
35+
configure with `--embedding-provider` and `--embedding-model`.
36+
37+
### Latest Full Run
38+
39+
Claude Sonnet 4.6, `tier=all`, `ranker=none`, `top_k=10`, 500 queries, 0 failures.
40+
41+
Metric notes:
42+
43+
- `recall@gold`: fraction of gold files recovered when the cutoff is the
44+
query's gold-file count.
45+
- `exact@gold`: all gold files are recovered within that same cutoff.
46+
- `found@gold`: average number of gold files recovered within that cutoff.
47+
48+
#### Block (ConDB) vs Vertical (baseline)
49+
50+
Vertical is a per-beam-branch variant: each parent expands its children into
51+
separate subtree blocks (`A→B`, `A→C`), one LLM call per branch. It removes
52+
the cross-branch view that Block keeps, so it serves as a direct baseline
53+
for the merged-pool design used in ConDB.
54+
55+
| variant | recall@gold | exact@gold | MRR | nDCG@10 | avg returned | avg latency |
56+
|---|---:|---:|---:|---:|---:|---:|
57+
| Vertical (baseline) | 0.382 | 0.366 | 0.466 | 0.481 | 3.00 | ~24 s |
58+
| **Block (ConDB)** | **0.711** | **0.672** | **0.805** | **0.813** | 7.20 | ~8 s |
59+
60+
Block is **+0.33 recall@gold** at ~3× lower latency.
61+
62+
#### Block — per-gold-count breakdown
63+
64+
| gold files | queries | cutoff | recall@gold | exact@gold | found@gold | avg returned |
65+
|---|---:|---:|---:|---:|---:|---:|
66+
| 1 | 430 | 1 | 0.749 | 0.749 | 0.75 | 7.00 |
67+
| 2 | 48 | 2 | 0.521 | 0.271 | 1.04 | 8.31 |
68+
| 3 | 13 | 3 | 0.410 | 0.077 | 1.23 | 8.77 |
69+
| 4 | 6 | 4 | 0.417 | 0.000 | 1.67 | 9.17 |
70+
| 5 | 1 | 5 | 0.200 | 0.000 | 1.00 | 2.00 |
71+
| 6+ | 2 | gold | 0.274 | 0.000 | 2.00 | 10.00 |
72+
73+
#### Vertical — per-gold-count breakdown
74+
75+
| gold files | queries | cutoff | recall@gold | exact@gold | found@gold | avg returned |
76+
|---|---:|---:|---:|---:|---:|---:|
77+
| 1 | 430 | 1 | 0.412 | 0.412 | 0.41 | 3.07 |
78+
| 2 | 48 | 2 | 0.250 | 0.125 | 0.50 | 2.67 |
79+
| 3 | 13 | 3 | 0.103 | 0.000 | 0.31 | 2.54 |
80+
| 4 | 6 | 4 | 0.042 | 0.000 | 0.17 | 2.50 |
81+
| 5 | 1 | 5 | 0.400 | 0.000 | 2.00 | 5.00 |
82+
| 6+ | 2 | gold | 0.000 | 0.000 | 0.00 | 0.00 |

0 commit comments

Comments
 (0)