Skip to content

Commit fbd07f8

Browse files
SonAIengineclaude
andcommitted
Release v0.16.0: engine flip, CDC batching, 30× eval coverage
Five changes that make Synaptic's benchmark numbers match what the SDK actually does: 1. graph.search() default engine: legacy → "evidence" - The hybrid BM25 + HNSW + PPR + MMR pipeline that the MCP knowledge_search / agent_search / deep_search paths already used. - engine="legacy" kept but deprecated; removal scheduled v0.17.0. - Six legacy-specific test files pass engine="legacy" explicitly. 2. CDC sync: N+1 → 1 batch SELECT - New SyncStateStore.get_pk_index_batch(...) issues one query per table instead of 3 per row. 10k-row sync drops from ~30s to <100ms of state-lookup cost. 3. Concurrent _ensure_graph() now holds asyncio.Lock with a double-checked fast path. Fixes a latent race where two MCP tools firing on the first turn could construct two SynapticGraph instances. 4. Carries forward v0.15.1's query-mode Kiwi (drops Korean interrogative / copular morphemes at query time only). 5. Evaluation coverage grows from 715 → 22,400 queries by adding HotPotQA-dev (66,635 docs), MuSiQue-Ans-dev (21,100), and 2WikiMultihopQA-dev (56,687). See examples/ablation/download_benchmarks.py + examples/ablation/run_tier1_benchmarks.py. Net effect, embedder-free: Allganize RAG-ko 0.621 → 0.947 (+0.326) Allganize RAG-Eval 0.615 → 0.911 (+0.296) AutoRAG KO 0.592 → 0.906 (+0.314) PublicHealthQA KO 0.318 → 0.546 (+0.228) HotPotQA-24 EN 0.727 → 0.875 (+0.148) English multi-hop debuts at HotPotQA-dev MRR@10 0.784 / 2Wiki-dev 0.795 / MuSiQue-dev 0.590 (500q subsets). Streaming invariance sharpens on the new default: top-1 identical 54.5% → 100%, |ΔMRR| 0.0100 → 0.0000. Tests: 819 pass, 3 skipped, 0 regressions. See docs/RELEASE_NOTES_v0.16.0.md for the full changelog. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9b633a1 commit fbd07f8

66 files changed

Lines changed: 12821 additions & 293 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,12 @@ eval/data/*.sqlite-wal
1313
eval/data/*.kuzu
1414
eval/data/*.kuzu.*
1515
eval/results/
16+
examples/benchmark_vs_competitors/results/
17+
examples/benchmark_vs_competitors/hipporag_bench/
18+
# Example-run output DBs (quickstart, langchain example, etc.)
19+
*.db
20+
*.db-shm
21+
*.db-wal
1622
temp/
1723
.DS_Store
1824
~/

CHANGELOG.md

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,161 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
66

77
## [Unreleased]
88

9+
## [0.16.0] - 2026-04-17
10+
11+
v0.16.0 is a **foundational cleanup release**. Five changes that add
12+
up to a noticeably different product: (1) the retrieval engine
13+
finally defaults to the hybrid EvidenceSearch pipeline the benchmarks
14+
have been advertising, (2) CDC sync drops N round-trips to a single
15+
batch SELECT, (3) concurrent MCP tool calls no longer race on
16+
first-use initialisation, (4) the query-mode Kiwi improvement
17+
from the v0.15.1 branch is carried forward, and (5) the evaluation
18+
surface is expanded 30× — from 715 queries across 5 Korean-heavy
19+
corpora to 22,400 queries across 8 corpora including standard English
20+
multi-hop benchmarks (HotPotQA-dev, MuSiQue-Ans, 2WikiMultihopQA).
21+
The net effect on public Korean benchmarks is large (Allganize RAG-ko
22+
MRR 0.621 → 0.947 since v0.15.0); English multi-hop debuts at
23+
HotPotQA-dev MRR 0.784 / 2Wiki-dev MRR 0.795 (500-query subsets,
24+
embedder-free).
25+
26+
### Changed — `graph.search()` default engine flipped to `"evidence"`
27+
28+
Public API semantics change: `SynapticGraph.search(query, ...)` with
29+
no `engine=` kwarg now uses the :class:`EvidenceSearch` hybrid
30+
pipeline (BM25 + HNSW + PPR + MMR + optional cross-encoder) that
31+
`agent_search` / `knowledge_search` / `deep_search` already used.
32+
33+
Why now. The previous default (``engine="legacy"`` → HybridSearch)
34+
was retained through v0.15.x for caller stability, but every
35+
measured benchmark — including the ones quoted in the README and in
36+
[docs/paper/draft.md](docs/paper/draft.md) — was run through
37+
EvidenceSearch via the MCP tool path. Leaving SDK callers on the
38+
legacy cascade meant our own quoted numbers didn't match what a
39+
first-time user of ``graph.search()`` saw. This release closes that
40+
gap.
41+
42+
Effect on public FTS-only benchmarks (no embedder, no reranker):
43+
44+
| Dataset | v0.15.0 (legacy) | v0.15.1 (legacy + kiwi) | **v0.16.0 (evidence + kiwi)** | Total Δ |
45+
|---------|------------------|-------------------------|-------------------------------|---------|
46+
| Allganize RAG-ko | 0.621 | 0.743 | **0.947** | +0.326 |
47+
| Allganize RAG-Eval | 0.615 | 0.695 | **0.911** | +0.296 |
48+
| PublicHealthQA KO | 0.318 | 0.466 | **0.546** | +0.228 |
49+
| AutoRAG KO | 0.592 | 0.692 | **0.906** | +0.314 |
50+
| HotPotQA-24 EN | 0.727 | 0.727 | **0.875** | +0.148 |
51+
52+
Migration. ``engine="legacy"`` still works and raises
53+
:class:`DeprecationWarning`. Legacy-specific features
54+
(`query_decomposer`, `reranker=LLMReranker(...)` injection, and
55+
reinforcement-based ranking boost) are only honoured on the legacy
56+
path and keep their tests under `engine="legacy"`. Scheduled for
57+
removal in v0.17.0.
58+
59+
### Changed — streaming invariance sharpens under the new default
60+
61+
`examples/ablation/streaming_experiment.py` re-run on the Evidence
62+
pipeline (Allganize RAG-ko, 200 docs / 200 queries, 10 random
63+
streaming batches vs. one batch):
64+
65+
| Metric | v0.15.x (legacy engine) | **v0.16.0 (evidence engine)** |
66+
|--------|-------------------------|-------------------------------|
67+
| Bit-wise identical top-10 | 103 / 200 (51.5 %) | **192 / 200 (96.0 %)** |
68+
| Top-1 identical | 109 / 200 (54.5 %) | **200 / 200 (100 %)** |
69+
| \|Δ MRR\| | 0.0100 | **0.0000** |
70+
| Set-equal top-10 | 197 / 200 (98.5 %) | 197 / 200 (98.5 %) |
71+
72+
The streaming-invariance theorem in [docs/paper/theorem.md](docs/paper/theorem.md)
73+
holds more tightly on the new default.
74+
75+
### Fixed — CDC sync: N+1 round-trips → one batch SELECT
76+
77+
Both `HashTableSyncer.sync_table()` and
78+
`TimestampTableSyncer.sync_table()`
79+
(`src/synaptic/extensions/cdc/sync.py`) previously issued 3 × N
80+
awaits per table (one `get_row_hash` + `get_node_id` + `get_fk_edges`
81+
call per row). They now issue a single
82+
`SyncStateStore.get_pk_index_batch(...)` call that returns
83+
`{pk: (node_id, row_hash, fk_json)}` for every changed PK in one
84+
SQLite round-trip (chunked to 500 PKs per statement defensively).
85+
86+
Impact. For a source table with N changed rows on a 1 ms local
87+
SQLite round-trip, CDC sync latency drops from 3N ms to ~1 ms — so
88+
a 100-row change sync moves from ~300 ms of sequential awaits to a
89+
single query. At 10 k rows the difference is 30 s → <100 ms.
90+
91+
### Fixed — concurrent `_ensure_graph()` no longer races
92+
93+
`src/synaptic/mcp/server.py` — lazy graph initialisation is now
94+
serialised through an `asyncio.Lock` with a double-checked fast path.
95+
Previously, two tool invocations firing on the same first turn could
96+
both see `_graph is None` and construct two SynapticGraph instances,
97+
leaking a backend connection. The fast path (graph already set) still
98+
requires no lock.
99+
100+
### Added — Tier-1 English multi-hop evaluation coverage
101+
102+
`examples/ablation/download_benchmarks.py` pulls HotPotQA-dev
103+
(distractor), MuSiQue-Ans-dev, and 2WikiMultihopQA-dev from
104+
HuggingFace and converts them to the BEIR-style JSON that
105+
`examples/ablation/run_tier1_benchmarks.py` consumes. Every file is
106+
gitignored and regenerated on demand, so the download is a one-shot
107+
per clone. Adds an `[eval]` optional extra with `datasets>=3.0`.
108+
109+
Initial numbers on a 500-query subset (embedder-free,
110+
`engine="evidence"`, 2026-04-17):
111+
112+
| Dataset | Docs | MRR @ 10 | R @ 5 | R @ 10 | Hit @ 10 |
113+
|---------|-----:|---------:|------:|-------:|---------:|
114+
| HotPotQA dev (distractor) | 66,635 | **0.784** | 0.585 | 0.658 | 459/500 |
115+
| MuSiQue-Ans dev | 21,100 | 0.590 | 0.379 | 0.440 | 381/500 |
116+
| 2WikiMultihopQA dev | 56,687 | **0.795** | 0.501 | 0.552 | 456/500 |
117+
118+
Wall clock on a laptop, total 31 minutes. A full-dataset rerun is
119+
scheduled for v0.16.1 after the PPR stage's first-hit latency is
120+
profiled (currently ~1.8 s / query at 66 k docs).
121+
122+
### Deprecated
123+
124+
- `graph.search(engine="legacy")` — emits DeprecationWarning,
125+
scheduled for removal in v0.17.0.
126+
- `LLMReranker` / `NoOpReranker` injection on `SynapticGraph`
127+
(legacy-engine-only).
128+
- `query_decomposer` kwarg on `SynapticGraph` (legacy-engine-only).
129+
130+
### Changed — query-mode Kiwi lifts FTS-only Korean retrieval quality
131+
132+
`_normalize_korean(text, query_mode=True)` — at search time only —
133+
drops verb (VV) and adjective (VA) stems and a small set of
134+
interrogative / copular noise forms that survive POS filtering but
135+
degrade BM25 ranking on natural-language Korean queries ("설명해주세요",
136+
"무엇인가요", "어떻게", "대해", …).
137+
138+
Index-time normalisation is **unchanged**. No graph rebuild required.
139+
Kiwi only fires on queries that are ≥50 % Hangul, so English and
140+
code-heavy queries are mathematically unchanged.
141+
142+
Measured on public FTS-only reproducible benchmarks
143+
(`examples/ablation/run_ablation.py`, 2026-04-17):
144+
145+
| Dataset | Pre-0.15.1 MRR | v0.15.1 MRR | Δ | Hit @ 10 |
146+
|---------|----------------|-------------|---|----------|
147+
| Allganize RAG-ko | 0.621 | **0.743** | +0.122 | 200/200 |
148+
| Allganize RAG-Eval | 0.615 | **0.695** | +0.080 | 286/300 |
149+
| PublicHealthQA KO | 0.318 | **0.466** | +0.148 | 65/77 |
150+
| AutoRAG KO | 0.592 | **0.692** | +0.100 | 114/114 |
151+
| HotPotQA-24 EN | 0.727 | 0.727 | 0.000 | 24/24 |
152+
153+
Diagnostic that led to the change:
154+
`examples/ablation/failure_diagnostic.py`. 16 of 20 Allganize RAG-ko
155+
misses were "generic question form" queries whose topic nouns were
156+
drowned out by Kiwi-surviving question tails. Applied the same filter
157+
in `MemoryBackend.search_fts` so benchmark adapters (which all use
158+
MemoryBackend) pick up the improvement without a SQLite rebuild.
159+
160+
> **Note**: the numbers above are the v0.15.1 delta measured against
161+
> the v0.15.0 legacy-engine baseline. v0.16.0's engine flip pushes
162+
> these to 0.947 / 0.911 / 0.546 / 0.906 — see the 0.16.0 entry above.
163+
9164
## [0.15.0] - 2026-04-15
10165

11166
### Added — `graph.search(engine="evidence")` opt-in modern path

0 commit comments

Comments
 (0)