VectifyAI
diff --git a/‎MANIFEST.in‎
Lines changed: 13 additions & 0 deletions b/‎MANIFEST.in‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 82 additions & 16 deletions b/‎README.md‎
Lines changed: 82 additions & 16 deletions
diff --git a/‎bench/README.md‎
Lines changed: 82 additions & 0 deletions b/‎bench/README.md‎
Lines changed: 82 additions & 0 deletions
@@ -0,0 +1,13 @@
+include README.md
+include LICENSE
+include requirements.txt
+recursive-include contextdb *.py *.jinja *.yaml py.typed
+global-exclude __pycache__
+global-exclude *.py[cod]
+global-exclude .DS_Store
+prune tests
+prune bench
+prune data
+prune examples
+prune notes
+prune scripts
@@ -141,29 +141,95 @@ result = db.query(tree_id, "question", strategy="block", beam_size=3)
 
 ## 📈 Benchmark Snapshot
 
-Current filesystem benchmark summary lives in [bench/fs_block_beam_vertical.md](bench/fs_block_beam_vertical.md).
+Two benchmarks live under `bench/`.
 
-Run setup: `fs_query_order=prefix`, `beam_size=3`, `max_turns=10`, `5` filesystem queries on `context7` only.
+### Filesystem mode — SWEBench-FileTree
 
-### Claude Opus 4.6
+Runs on [`AmuroEita/SWEBench-FileTree`](https://huggingface.co/datasets/AmuroEita/SWEBench-FileTree),
+a path-only version of SWE-bench code retrieval:
 
-| Retriever | Avg Time (s) | Avg LLM Calls | Hit@1 | Hit@10 | Total Cost (USD) |
-|---|---:|---:|---:|---:|---:|
-| **Block** | 8.44 | 2.4 | 1.00 | 1.00 | 0.2166 |
-| **Vertical** | 28.18 | 6.8 | 0.40 | 1.00 | 0.2900 |
-| **Beam** | 18.36 | 4.8 | 0.60 | 1.00 | 0.2091 |
+- 500 GitHub issues as queries
+- 475 `(repo, commit)` repository snapshots as independent retrieval universes
+- 58,058 file paths; no source code, no file summaries
 
-### Claude Sonnet 4.6
+Given an issue and one snapshot's file tree, return the file(s) the fix
+touches. Specification: `notes/condb_swebench_filetree_bench.md`.
 
-| Retriever | Avg Time (s) | Avg LLM Calls | Hit@1 | Hit@10 | Total Cost (USD) |
-|---|---:|---:|---:|---:|---:|
-| **Block** | 8.42 | 3.4 | 1.00 | 1.00 | 0.0643 |
-| **Vertical** | 20.78 | 7.0 | 0.40 | 0.80 | 0.1712 |
-| **Beam** | 17.84 | 4.8 | 0.40 | 1.00 | 0.1335 |
+```bash
+export ANTHROPIC_API_KEY=sk-ant-...
+python bench/run_swebench_filetree.py --tier medium
+```
+
+Tiers (by retriever difficulty; lower difficulty = more path signal in query):
+
+```
+easy     107 queries   gold path appears in query text (sanity check)
+medium   133 queries   gold filename appears in query    (main report)
+hard     261 queries   gold module stem appears          (fuzzy matching)
+all      500 queries   no filter, includes ~48% path-signal-less queries
+```
+
+Output goes to `bench/runs/<timestamp>__<tier>/`: `report.md`, `summary.json`,
+`per_query.jsonl`.
+
+Block mode can optionally rerank only the cross-block merge candidates before
+the file/directory split:
+
+```bash
+python bench/run_swebench_filetree.py --tier medium --strategy block --ranker vector
+```
+
+Available rankers are `none`, `bm25`, and `vector`. The vector ranker uses
+LiteLLM embeddings (`--embedding-provider`, `--embedding-model`) and leaves
+the default `ranker=none` unchanged.
+
+#### Latest Run
 
-`Block` is the best default: perfect Hit@1 across both models, lowest cost on Sonnet 4.6 (prompt caching cuts cost by ~60%), and fastest latency. `Beam` and `Vertical` are sensitive to model version — `Block` is the most robust choice.
+Claude Sonnet 4.6, `--ranker none`, `--top-k 10`, 500 queries, 0 failures.
+Block (ConDB) is compared against a **Vertical baseline** — a per-beam variant
+that expands each parent's children into separate subtree blocks (`A→B`,
+`A→C`), one LLM call per branch, losing the cross-branch view Block keeps.
+
+| variant | recall@gold | exact@gold | MRR | nDCG@10 | avg returned | avg latency |
+|---|---:|---:|---:|---:|---:|---:|
+| Vertical (baseline) | 0.382 | 0.366 | 0.466 | 0.481 | 3.00 | ~24 s |
+| **Block (ConDB)** | **0.711** | **0.672** | **0.805** | **0.813** | 7.20 | ~8 s |
+
+Block gains **+0.33 recall@gold** at ~3× lower latency.
+
+Block per-gold-count breakdown — the cutoff is the query's gold-file count
+(`6+` row contains one 6-gold query and one 21-gold outlier):
+
+| gold files | queries | cutoff | recall@gold | exact@gold | found@gold | avg returned |
+|---|---:|---:|---:|---:|---:|---:|
+| 1 | 430 | 1 | 0.749 | 0.749 | 0.75 | 7.00 |
+| 2 | 48 | 2 | 0.521 | 0.271 | 1.04 | 8.31 |
+| 3 | 13 | 3 | 0.410 | 0.077 | 1.23 | 8.77 |
+| 4 | 6 | 4 | 0.417 | 0.000 | 1.67 | 9.17 |
+| 5 | 1 | 5 | 0.200 | 0.000 | 1.00 | 2.00 |
+| 6+ | 2 | gold | 0.274 | 0.000 | 2.00 | 10.00 |
+
+Reproduce:
+
+```bash
+python bench/run_swebench_filetree.py --tier all --strategy block    --ranker none --top-k 10
+python bench/run_swebench_filetree.py --tier all --strategy vertical --ranker none --top-k 10
+```
+
+### Document mode — single long document
+
+Compares retriever algorithms (Block / Beam / Vertical / ...) on one
+hierarchical document. Reports time, LLM calls, token usage with prompt
+caching, and USD cost.
+
+```bash
+python bench/run_document_bench.py \
+  --doc examples/large_doc.json \
+  --config bench/queries.json
+```
 
-These numbers are benchmark snapshots, not hard guarantees; exact cost and latency will vary with model choice, provider pricing, prompt-cache behavior, and corpus shape.
+Queries live in the config JSON as `{"queries": ["...", "..."]}`. Swap in
+any `--doc` and any `--config` to benchmark a different document.
 
 ---
 
 
@@ -0,0 +1,82 @@
+# Benchmarks
+
+## SWEBench-FileTree
+
+Path-only code retrieval over `AmuroEita/SWEBench-FileTree`.
+
+```bash
+PYTHONPATH=. python bench/run_swebench_filetree.py \
+  --tier all \
+  --strategy block \
+  --ranker none \
+  --top-k 10 \
+  --output-dir bench/runs/block_none_all
+```
+
+Outputs:
+
+- `summary.json`: aggregate metrics
+- `per_query.jsonl`: per-query predictions and metrics
+- `report.md`: markdown report
+- `bench.sqlite`: temporary benchmark database
+
+Strategies (`--strategy`):
+
+- `auto`: pick `beam` for ≤50 nodes, `block` otherwise.
+- `beam`: depth-first LLM beam expansion (small trees).
+- `block`: token-bounded block partitioning with cross-block merge.
+- `vertical`: baseline — per-beam-branch subtree blocks, no cross-branch view.
+
+Ranker options (apply to `block` / `vertical`):
+
+- `--ranker none`: preserve traversal and block-local LLM order.
+- `--ranker bm25`: lexical path ordering for cross-block merge candidates.
+- `--ranker vector`: embedding path ordering for cross-block merge candidates;
+  configure with `--embedding-provider` and `--embedding-model`.
+
+### Latest Full Run
+
+Claude Sonnet 4.6, `tier=all`, `ranker=none`, `top_k=10`, 500 queries, 0 failures.
+
+Metric notes:
+
+- `recall@gold`: fraction of gold files recovered when the cutoff is the
+  query's gold-file count.
+- `exact@gold`: all gold files are recovered within that same cutoff.
+- `found@gold`: average number of gold files recovered within that cutoff.
+
+#### Block (ConDB) vs Vertical (baseline)
+
+Vertical is a per-beam-branch variant: each parent expands its children into
+separate subtree blocks (`A→B`, `A→C`), one LLM call per branch. It removes
+the cross-branch view that Block keeps, so it serves as a direct baseline
+for the merged-pool design used in ConDB.
+
+| variant | recall@gold | exact@gold | MRR | nDCG@10 | avg returned | avg latency |
+|---|---:|---:|---:|---:|---:|---:|
+| Vertical (baseline) | 0.382 | 0.366 | 0.466 | 0.481 | 3.00 | ~24 s |
+| **Block (ConDB)** | **0.711** | **0.672** | **0.805** | **0.813** | 7.20 | ~8 s |
+
+Block is **+0.33 recall@gold** at ~3× lower latency.
+
+#### Block — per-gold-count breakdown
+
+| gold files | queries | cutoff | recall@gold | exact@gold | found@gold | avg returned |
+|---|---:|---:|---:|---:|---:|---:|
+| 1 | 430 | 1 | 0.749 | 0.749 | 0.75 | 7.00 |
+| 2 | 48 | 2 | 0.521 | 0.271 | 1.04 | 8.31 |
+| 3 | 13 | 3 | 0.410 | 0.077 | 1.23 | 8.77 |
+| 4 | 6 | 4 | 0.417 | 0.000 | 1.67 | 9.17 |
+| 5 | 1 | 5 | 0.200 | 0.000 | 1.00 | 2.00 |
+| 6+ | 2 | gold | 0.274 | 0.000 | 2.00 | 10.00 |
+
+#### Vertical — per-gold-count breakdown
+
+| gold files | queries | cutoff | recall@gold | exact@gold | found@gold | avg returned |
+|---|---:|---:|---:|---:|---:|---:|
+| 1 | 430 | 1 | 0.412 | 0.412 | 0.41 | 3.07 |
+| 2 | 48 | 2 | 0.250 | 0.125 | 0.50 | 2.67 |
+| 3 | 13 | 3 | 0.103 | 0.000 | 0.31 | 2.54 |
+| 4 | 6 | 4 | 0.042 | 0.000 | 0.17 | 2.50 |
+| 5 | 1 | 5 | 0.400 | 0.000 | 2.00 | 5.00 |
+| 6+ | 2 | gold | 0.000 | 0.000 | 0.00 | 0.00 |