Skip to content

Commit 8d5fd35

Browse files
committed
version for release
1 parent 6d2a6a3 commit 8d5fd35

33 files changed

Lines changed: 2457 additions & 2292 deletions

README.md

Lines changed: 37 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -8,23 +8,17 @@
88
Store, navigate, and query hierarchical document structures with LLM-powered reasoning retrieval.
99
</p>
1010

11-
<h4 align="center">
12-
<a href="https://github.com/VectifyAI/PageIndex">PageIndex</a>&nbsp; | &nbsp;
13-
<a href="https://docs.pageindex.ai">Docs</a>&nbsp; | &nbsp;
14-
<a href="https://discord.com/invite/VuXuf29EUj">Discord</a>
15-
</h4>
16-
1711
</div>
1812

1913
---
2014

2115
## What is ConDB?
2216

23-
**ConDB** is the storage and retrieval engine behind [PageIndex](https://github.com/VectifyAI/PageIndex). It stores hierarchical document trees (generated by PageIndex or other sources) in a SQLite database, and provides LLM-powered **reasoning-based retrieval** to query them — no vector DB, no chunking.
17+
**ConDB** stores hierarchical document trees in a SQLite database and provides LLM-powered **reasoning-based retrieval** to query them — no vector DB, no chunking. It accepts pageindex-compatible trees, chat trees, and custom hierarchical JSON without taking a runtime code dependency on PageIndex itself.
2418

2519
**Key capabilities:**
2620

27-
- **Hierarchical storage** — store tree-structured documents (PDFs, Markdown, custom JSON) in SQLite
21+
- **Hierarchical storage** — store document trees, chat trees, and custom hierarchical JSON in SQLite
2822
- **Reasoning-based retrieval** — LLM navigates the tree to find relevant content, like a human expert
2923
- **Multiple retrieval strategies** — beam search for small trees, block retrieval for large documents
3024
- **Multi-provider LLM support** — works with Anthropic (Claude) and OpenAI (GPT) out of the box
@@ -38,7 +32,6 @@
3832

3933
```bash
4034
pip install -r requirements.txt
41-
pip install pageindex # optional, for PDF/Markdown indexing
4235
```
4336

4437
### Basic Usage
@@ -52,29 +45,28 @@ db = contextdb.open("my_docs.sqlite")
5245
# Configure LLM
5346
db.set_llm(provider="anthropic", model="claude-sonnet-4-20250514")
5447

55-
# Store a PageIndex tree
56-
tree_id = db.store(pageindex_json, format="pageindex")
48+
# Store a document tree
49+
tree_id = db.store(document_tree_json, format="document")
5750

5851
# Query with LLM reasoning
5952
result = db.query(tree_id, "What are the key findings?")
6053
print(result.contents)
6154
```
6255

63-
### Index from files (requires `pageindex`)
56+
### Index from files with an external tree builder
6457

6558
```python
66-
from contextdb import ContextTree, LLMClient
59+
from contextdb import ContextTree
6760

68-
llm = LLMClient("anthropic", api_key="sk-...", model="claude-sonnet-4-20250514")
69-
ct = ContextTree("context.sqlite", llm=llm)
61+
def build_markdown_tree(path: str) -> dict:
62+
...
7063

71-
# Index documents
72-
tree_id = ct.index_pdf_file("report.pdf")
73-
tree_id = ct.index_markdown_file("doc.md")
64+
ct = ContextTree("context.sqlite")
7465

75-
# Query
76-
result = ct.query(tree_id, "What are the main topics?", use_llm=True, max_turns=5)
77-
print(result.contents)
66+
tree_id = ct.index_markdown_file("doc.md", tree_builder=build_markdown_tree)
67+
68+
# You can also generate a tree out of process and call:
69+
# tree_id = ct.index_document_tree(document_tree_json)
7870

7971
ct.close()
8072
```
@@ -118,17 +110,37 @@ result = db.query(tree_id, "question", strategy="block", beam_size=3)
118110

119111
---
120112

113+
## Benchmark Snapshot
114+
115+
Current filesystem benchmark summary lives in [bench/fs_block_beam_vertical.md](bench/fs_block_beam_vertical.md).
116+
117+
Run setup for the snapshot below: `beam_size=3`, `max_turns=10`, `5` filesystem queries on `context7` only.
118+
119+
### Block vs Beam vs Vertical
120+
121+
| Retriever | Avg Time (s) | Avg LLM Calls | Hit@1 | Hit@10 | Total Cost (USD) |
122+
|---|---:|---:|---:|---:|---:|
123+
| **Block** | 5.47 | 1.00 | 1.00 | 1.00 | 0.0762 |
124+
| **Vertical** | 7.31 | 1.60 | 1.00 | 1.00 | 0.1486 |
125+
| **Beam** | 20.18 | 4.60 | 0.60 | 0.80 | 0.1328 |
126+
127+
`Block` is the best default on this `context7` snapshot: same retrieval quality as `Vertical`, with lower latency and fewer model calls. `Beam` is still workable, but it trails clearly on retrieval accuracy.
128+
129+
These numbers are benchmark snapshots, not hard guarantees; exact cost and latency will vary with model choice, provider pricing, prompt-cache behavior, and corpus shape.
130+
131+
---
132+
121133
## Architecture
122134

123135
```
124136
contextdb/
125137
├── api/
126138
│ ├── condb.py # ConDB — main entry point
127-
│ └── context_tree.py # ContextTree — file indexing API
139+
│ └── context_tree.py # ContextTree — tree indexing + query API
128140
├── core/
129141
│ └── storage.py # TreeDB (SQLite), StorageProtocol
130142
├── adapter/
131-
│ └── base.py # PageIndex, ChatIndex, Generic adapters
143+
│ └── base.py # DocumentTree, ChatIndex, Generic adapters
132144
├── retriever/
133145
│ ├── base.py # Retriever protocols
134146
│ └── algorithm/ # Beam, Block retrieval strategies
@@ -182,7 +194,7 @@ ct = ContextTree("db.sqlite", llm=MyLLM())
182194

183195
## Related Projects
184196

185-
- [**PageIndex**](https://github.com/VectifyAI/PageIndex)reasoning-based RAG framework that generates hierarchical tree indices from documents
197+
- [**PageIndex**](https://github.com/VectifyAI/PageIndex)one possible external producer of pageindex-compatible document trees
186198
- [**AgentFS**](https://github.com/anthropics/agentfs) — filesystem for AI agents
187199

188200
---
@@ -191,7 +203,7 @@ ct = ContextTree("db.sqlite", llm=MyLLM())
191203

192204
## License
193205

194-
MIT
206+
Apache-2.0
195207

196208
<br/>
197209

bench/benchmark_retrievers.py

Lines changed: 29 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,11 @@
1010
python bench/benchmark_retrievers.py --mode doc --doc <document.json> --config <queries.json>
1111
1212
Filesystem mode:
13-
python bench/benchmark_retrievers.py --mode fs --repo-dir <path> --queries-config <queries.json>
13+
python bench/benchmark_retrievers.py --mode fs --fs-root <path> --queries-config <queries.json>
1414
1515
Examples:
1616
python bench/benchmark_retrievers.py --mode doc --doc examples/large_doc.json --config bench/queries.json
17-
python bench/benchmark_retrievers.py --mode fs --repo-dir bench/filesystem/repo --queries-config bench/filesystem/repo/queries.json
18-
python bench/benchmark_retrievers.py --mode fs --repo-dir bench/filesystem/arxiv --queries-config bench/filesystem/arxiv/queries.json
19-
python bench/benchmark_retrievers.py --mode fs --repo-dir bench/filesystem/context7 --queries-config bench/filesystem/context7/queries.json
17+
python bench/benchmark_retrievers.py --mode fs --fs-root bench/filesystem/context7 --queries-config bench/filesystem/context7/queries.json
2018
"""
2119

2220
import argparse
@@ -27,7 +25,7 @@
2725
import time
2826
from dataclasses import asdict, dataclass, field
2927
from pathlib import Path
30-
from typing import Any, Optional
28+
from typing import Optional
3129

3230
sys.path.insert(0, str(Path(__file__).parent.parent))
3331

@@ -38,6 +36,8 @@
3836

3937
ALGORITHM_DIR = Path(__file__).parent.parent / "contextdb/retriever/algorithm"
4038
EXCLUDED_FILES = {"base_retriever.py", "block_cutter.py", "block_types.py", "__init__.py"}
39+
FS_HIT_LEVELS = (1, 3, 5, 10)
40+
FS_DISPLAY_HIT_LEVELS = (1, 10)
4141

4242

4343
def discover_retrievers() -> dict[str, type]:
@@ -91,7 +91,6 @@ class RetrieverMetrics:
9191
# FS-specific
9292
retrieved_files: Optional[list[str]] = None
9393
hit_at: Optional[dict[int, bool]] = None
94-
mrr: Optional[float] = None
9594

9695

9796
@dataclass
@@ -145,11 +144,9 @@ def total(items, attr):
145144

146145
# FS hit metrics
147146
if self.mode == "fs":
148-
for k in [1, 3, 5, 10]:
147+
for k in FS_HIT_LEVELS:
149148
hits = [r for r in valid if r.hit_at and k in r.hit_at]
150149
s[f"hit@{k}"] = sum(1 for r in hits if r.hit_at[k]) / len(hits) if hits else 0
151-
mrr_vals = [r.mrr for r in valid if r.mrr is not None]
152-
s["mrr"] = sum(mrr_vals) / len(mrr_vals) if mrr_vals else 0
153150

154151
summary[name] = s
155152

@@ -205,25 +202,21 @@ def build_entities(flat_list: list) -> dict:
205202
# ── FS Mode Helpers ─────────────────────────────────────────────────
206203

207204

208-
def compute_fs_metrics(retrieved_files: list[str], ground_truth: list[str]) -> tuple[dict[int, bool], float]:
209-
"""Returns (hit_at_k_dict, mrr)."""
205+
def compute_fs_metrics(retrieved_files: list[str], ground_truth: list[str]) -> dict[int, bool]:
206+
"""Returns hit-at-k booleans for the retrieved file list."""
210207
gt_set = set(ground_truth)
211208
hit_at = {}
212-
first_hit_rank = None
213209

214210
for i, f in enumerate(retrieved_files):
215-
if f in gt_set and first_hit_rank is None:
216-
first_hit_rank = i + 1
217-
for k in [1, 3, 5, 10]:
211+
for k in FS_HIT_LEVELS:
218212
if i < k and f in gt_set:
219213
hit_at.setdefault(k, False)
220214
hit_at[k] = True
221215

222-
for k in [1, 3, 5, 10]:
216+
for k in FS_HIT_LEVELS:
223217
hit_at.setdefault(k, False)
224218

225-
mrr = 1.0 / first_hit_rank if first_hit_rank else 0.0
226-
return hit_at, mrr
219+
return hit_at
227220

228221

229222
def extract_file_paths(db: TreeDB, tree_id: str, node_ids: list[str]) -> list[str]:
@@ -299,10 +292,9 @@ def run_retriever(
299292

300293
if mode == "fs" and db and ground_truth:
301294
retrieved_files = extract_file_paths(db, tree_id, result.nodes)
302-
hit_at, mrr = compute_fs_metrics(retrieved_files, ground_truth)
295+
hit_at = compute_fs_metrics(retrieved_files, ground_truth)
303296
metrics.retrieved_files = retrieved_files
304297
metrics.hit_at = hit_at
305-
metrics.mrr = mrr
306298

307299
return metrics
308300
except Exception as e:
@@ -328,7 +320,7 @@ def run_benchmark(
328320
*,
329321
mode: str = "doc",
330322
doc_path: Path = None,
331-
repo_dir: Path = None,
323+
fs_root: Path = None,
332324
query_list: list = None,
333325
queries_with_gt: list[dict] = None,
334326
queries_config_path: Path = None,
@@ -382,25 +374,25 @@ def make_tree(db):
382374
ignore_patterns = list(DEFAULT_IGNORE_PATTERNS)
383375
effective_queries_config = queries_config_path
384376
if effective_queries_config is None:
385-
default_queries = repo_dir / "queries.json"
377+
default_queries = fs_root / "queries.json"
386378
if default_queries.exists():
387379
effective_queries_config = default_queries
388380

389381
if effective_queries_config is not None:
390382
try:
391-
rel_cfg = str(effective_queries_config.resolve().relative_to(repo_dir.resolve())).replace("\\", "/")
383+
rel_cfg = str(effective_queries_config.resolve().relative_to(fs_root.resolve())).replace("\\", "/")
392384
ignore_patterns.append(rel_cfg)
393385
except ValueError:
394386
pass
395387

396388
if fs_query_order == "prefix":
397389
queries_with_gt = reorder_fs_queries_by_prefix(queries_with_gt)
398390

399-
adapter = FileSystemAdapter(str(repo_dir), ignore_patterns=ignore_patterns)
391+
adapter = FileSystemAdapter(str(fs_root), ignore_patterns=ignore_patterns)
400392
tree_structure, entities = adapter.convert()
401393
queries = [q["query"] for q in queries_with_gt]
402394
ground_truths = [q["ground_truth"] for q in queries_with_gt]
403-
doc_info = str(repo_dir)
395+
doc_info = str(fs_root)
404396
sections_count = len(entities)
405397

406398
def make_tree(db):
@@ -462,7 +454,7 @@ def make_tree(db):
462454
parts.append(f"Cache read: {metrics.cache_read_tokens:,}")
463455
if mode == "fs" and metrics.hit_at:
464456
parts.append(f"Hit@1: {metrics.hit_at.get(1, False)}")
465-
parts.append(f"MRR: {metrics.mrr:.3f}")
457+
parts.append(f"Hit@10: {metrics.hit_at.get(10, False)}")
466458
if metrics.retrieved_files:
467459
parts.append(f"Files: {metrics.retrieved_files[:5]}")
468460
print(f" {', '.join(parts)}")
@@ -486,7 +478,7 @@ def print_summary(result: BenchmarkResult):
486478
print(title)
487479
print("=" * 70)
488480
print(f"\nModel: {Config.LLM_PROVIDER}/{Config.LLM_MODEL}")
489-
print(f"{'Repository' if result.mode == 'fs' else 'Document'}: {result.document_path}")
481+
print(f"{'Filesystem Root' if result.mode == 'fs' else 'Document'}: {result.document_path}")
490482
print(f"Entities: {result.document_sections}")
491483
print(f"Queries: {summary['queries_run']}")
492484
print(f"Retrievers: {', '.join(result.retriever_names)}")
@@ -508,16 +500,12 @@ def print_summary(result: BenchmarkResult):
508500
rows.append(row)
509501

510502
if result.mode == "fs":
511-
for k in [1, 3, 5, 10]:
503+
for k in FS_DISPLAY_HIT_LEVELS:
512504
row = [f"Hit@{k}"]
513505
for name in result.retriever_names:
514506
val = summary[name].get(f"hit@{k}", 0)
515507
row.append(f"{val:.1%}")
516508
rows.append(row)
517-
mrr_row = ["MRR"]
518-
for name in result.retriever_names:
519-
mrr_row.append(f"{summary[name].get('mrr', 0):.3f}")
520-
rows.append(mrr_row)
521509

522510
col_widths = [max(len(str(row[i])) for row in [headers] + rows) for i in range(len(headers))]
523511
header_line = " | ".join(h.ljust(w) for h, w in zip(headers, col_widths))
@@ -549,8 +537,8 @@ def main():
549537
help="Path to document JSON file (doc mode)")
550538
parser.add_argument("--config", "-c", type=Path,
551539
help="Path to queries config JSON (doc mode)")
552-
parser.add_argument("--repo-dir", type=Path,
553-
help="Path to repository directory (fs mode)")
540+
parser.add_argument("--fs-root", type=Path,
541+
help="Path to filesystem root directory (fs mode)")
554542
parser.add_argument("--queries-config", type=Path,
555543
help="Path to queries+ground_truth JSON (fs mode)")
556544
parser.add_argument("--output", "-o", choices=["text", "json"], default="text")
@@ -589,7 +577,8 @@ def main():
589577
config = load_config(args.config)
590578
queries = config.get("queries", [])
591579
if not queries:
592-
print("ERROR: No queries in config"); sys.exit(1)
580+
print("ERROR: No queries in config")
581+
sys.exit(1)
593582

594583
print("=" * 70)
595584
print("Retriever Benchmark (DOC mode)")
@@ -603,22 +592,23 @@ def main():
603592
)
604593

605594
elif args.mode == "fs":
606-
if not args.repo_dir or not args.repo_dir.exists():
607-
print("ERROR: --repo-dir required and must exist for fs mode")
595+
if not args.fs_root or not args.fs_root.exists():
596+
print("ERROR: --fs-root required and must exist for fs mode")
608597
sys.exit(1)
609598
if not args.queries_config or not args.queries_config.exists():
610599
print("ERROR: --queries-config required and must exist for fs mode")
611600
sys.exit(1)
612601
config = load_config(args.queries_config)
613602
queries_with_gt = config.get("queries", [])
614603
if not queries_with_gt:
615-
print("ERROR: No queries in config"); sys.exit(1)
604+
print("ERROR: No queries in config")
605+
sys.exit(1)
616606

617607
print("=" * 70)
618608
print("Retriever Benchmark (FS mode)")
619609
print("=" * 70)
620610
result = run_benchmark(
621-
mode="fs", repo_dir=args.repo_dir, queries_with_gt=queries_with_gt,
611+
mode="fs", fs_root=args.fs_root, queries_with_gt=queries_with_gt,
622612
queries_config_path=args.queries_config,
623613
beam_size=args.beam_size, max_turns=args.max_turns,
624614
clear_cache=args.clear_cache,

0 commit comments

Comments
 (0)