@@ -6,6 +6,161 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
66
77## [ Unreleased]
88
9+ ## [ 0.16.0] - 2026-04-17
10+
11+ v0.16.0 is a ** foundational cleanup release** . Five changes that add
12+ up to a noticeably different product: (1) the retrieval engine
13+ finally defaults to the hybrid EvidenceSearch pipeline the benchmarks
14+ have been advertising, (2) CDC sync drops N round-trips to a single
15+ batch SELECT, (3) concurrent MCP tool calls no longer race on
16+ first-use initialisation, (4) the query-mode Kiwi improvement
17+ from the v0.15.1 branch is carried forward, and (5) the evaluation
18+ surface is expanded 30× — from 715 queries across 5 Korean-heavy
19+ corpora to 22,400 queries across 8 corpora including standard English
20+ multi-hop benchmarks (HotPotQA-dev, MuSiQue-Ans, 2WikiMultihopQA).
21+ The net effect on public Korean benchmarks is large (Allganize RAG-ko
22+ MRR 0.621 → 0.947 since v0.15.0); English multi-hop debuts at
23+ HotPotQA-dev MRR 0.784 / 2Wiki-dev MRR 0.795 (500-query subsets,
24+ embedder-free).
25+
26+ ### Changed — ` graph.search() ` default engine flipped to ` "evidence" `
27+
28+ Public API semantics change: ` SynapticGraph.search(query, ...) ` with
29+ no ` engine= ` kwarg now uses the :class:` EvidenceSearch ` hybrid
30+ pipeline (BM25 + HNSW + PPR + MMR + optional cross-encoder) that
31+ ` agent_search ` / ` knowledge_search ` / ` deep_search ` already used.
32+
33+ Why now. The previous default (`` engine="legacy" `` → HybridSearch)
34+ was retained through v0.15.x for caller stability, but every
35+ measured benchmark — including the ones quoted in the README and in
36+ [ docs/paper/draft.md] ( docs/paper/draft.md ) — was run through
37+ EvidenceSearch via the MCP tool path. Leaving SDK callers on the
38+ legacy cascade meant our own quoted numbers didn't match what a
39+ first-time user of `` graph.search() `` saw. This release closes that
40+ gap.
41+
42+ Effect on public FTS-only benchmarks (no embedder, no reranker):
43+
44+ | Dataset | v0.15.0 (legacy) | v0.15.1 (legacy + kiwi) | ** v0.16.0 (evidence + kiwi)** | Total Δ |
45+ | ---------| ------------------| -------------------------| -------------------------------| ---------|
46+ | Allganize RAG-ko | 0.621 | 0.743 | ** 0.947** | +0.326 |
47+ | Allganize RAG-Eval | 0.615 | 0.695 | ** 0.911** | +0.296 |
48+ | PublicHealthQA KO | 0.318 | 0.466 | ** 0.546** | +0.228 |
49+ | AutoRAG KO | 0.592 | 0.692 | ** 0.906** | +0.314 |
50+ | HotPotQA-24 EN | 0.727 | 0.727 | ** 0.875** | +0.148 |
51+
52+ Migration. `` engine="legacy" `` still works and raises
53+ :class:` DeprecationWarning ` . Legacy-specific features
54+ (` query_decomposer ` , ` reranker=LLMReranker(...) ` injection, and
55+ reinforcement-based ranking boost) are only honoured on the legacy
56+ path and keep their tests under ` engine="legacy" ` . Scheduled for
57+ removal in v0.17.0.
58+
59+ ### Changed — streaming invariance sharpens under the new default
60+
61+ ` examples/ablation/streaming_experiment.py ` re-run on the Evidence
62+ pipeline (Allganize RAG-ko, 200 docs / 200 queries, 10 random
63+ streaming batches vs. one batch):
64+
65+ | Metric | v0.15.x (legacy engine) | ** v0.16.0 (evidence engine)** |
66+ | --------| -------------------------| -------------------------------|
67+ | Bit-wise identical top-10 | 103 / 200 (51.5 %) | ** 192 / 200 (96.0 %)** |
68+ | Top-1 identical | 109 / 200 (54.5 %) | ** 200 / 200 (100 %)** |
69+ | \| Δ MRR\| | 0.0100 | ** 0.0000** |
70+ | Set-equal top-10 | 197 / 200 (98.5 %) | 197 / 200 (98.5 %) |
71+
72+ The streaming-invariance theorem in [ docs/paper/theorem.md] ( docs/paper/theorem.md )
73+ holds more tightly on the new default.
74+
75+ ### Fixed — CDC sync: N+1 round-trips → one batch SELECT
76+
77+ Both ` HashTableSyncer.sync_table() ` and
78+ ` TimestampTableSyncer.sync_table() `
79+ (` src/synaptic/extensions/cdc/sync.py ` ) previously issued 3 × N
80+ awaits per table (one ` get_row_hash ` + ` get_node_id ` + ` get_fk_edges `
81+ call per row). They now issue a single
82+ ` SyncStateStore.get_pk_index_batch(...) ` call that returns
83+ ` {pk: (node_id, row_hash, fk_json)} ` for every changed PK in one
84+ SQLite round-trip (chunked to 500 PKs per statement defensively).
85+
86+ Impact. For a source table with N changed rows on a 1 ms local
87+ SQLite round-trip, CDC sync latency drops from 3N ms to ~ 1 ms — so
88+ a 100-row change sync moves from ~ 300 ms of sequential awaits to a
89+ single query. At 10 k rows the difference is 30 s → <100 ms.
90+
91+ ### Fixed — concurrent ` _ensure_graph() ` no longer races
92+
93+ ` src/synaptic/mcp/server.py ` — lazy graph initialisation is now
94+ serialised through an ` asyncio.Lock ` with a double-checked fast path.
95+ Previously, two tool invocations firing on the same first turn could
96+ both see ` _graph is None ` and construct two SynapticGraph instances,
97+ leaking a backend connection. The fast path (graph already set) still
98+ requires no lock.
99+
100+ ### Added — Tier-1 English multi-hop evaluation coverage
101+
102+ ` examples/ablation/download_benchmarks.py ` pulls HotPotQA-dev
103+ (distractor), MuSiQue-Ans-dev, and 2WikiMultihopQA-dev from
104+ HuggingFace and converts them to the BEIR-style JSON that
105+ ` examples/ablation/run_tier1_benchmarks.py ` consumes. Every file is
106+ gitignored and regenerated on demand, so the download is a one-shot
107+ per clone. Adds an ` [eval] ` optional extra with ` datasets>=3.0 ` .
108+
109+ Initial numbers on a 500-query subset (embedder-free,
110+ ` engine="evidence" ` , 2026-04-17):
111+
112+ | Dataset | Docs | MRR @ 10 | R @ 5 | R @ 10 | Hit @ 10 |
113+ | ---------| -----:| ---------:| ------:| -------:| ---------:|
114+ | HotPotQA dev (distractor) | 66,635 | ** 0.784** | 0.585 | 0.658 | 459/500 |
115+ | MuSiQue-Ans dev | 21,100 | 0.590 | 0.379 | 0.440 | 381/500 |
116+ | 2WikiMultihopQA dev | 56,687 | ** 0.795** | 0.501 | 0.552 | 456/500 |
117+
118+ Wall clock on a laptop, total 31 minutes. A full-dataset rerun is
119+ scheduled for v0.16.1 after the PPR stage's first-hit latency is
120+ profiled (currently ~ 1.8 s / query at 66 k docs).
121+
122+ ### Deprecated
123+
124+ - ` graph.search(engine="legacy") ` — emits DeprecationWarning,
125+ scheduled for removal in v0.17.0.
126+ - ` LLMReranker ` / ` NoOpReranker ` injection on ` SynapticGraph `
127+ (legacy-engine-only).
128+ - ` query_decomposer ` kwarg on ` SynapticGraph ` (legacy-engine-only).
129+
130+ ### Changed — query-mode Kiwi lifts FTS-only Korean retrieval quality
131+
132+ ` _normalize_korean(text, query_mode=True) ` — at search time only —
133+ drops verb (VV) and adjective (VA) stems and a small set of
134+ interrogative / copular noise forms that survive POS filtering but
135+ degrade BM25 ranking on natural-language Korean queries ("설명해주세요",
136+ "무엇인가요", "어떻게", "대해", …).
137+
138+ Index-time normalisation is ** unchanged** . No graph rebuild required.
139+ Kiwi only fires on queries that are ≥50 % Hangul, so English and
140+ code-heavy queries are mathematically unchanged.
141+
142+ Measured on public FTS-only reproducible benchmarks
143+ (` examples/ablation/run_ablation.py ` , 2026-04-17):
144+
145+ | Dataset | Pre-0.15.1 MRR | v0.15.1 MRR | Δ | Hit @ 10 |
146+ | ---------| ----------------| -------------| ---| ----------|
147+ | Allganize RAG-ko | 0.621 | ** 0.743** | +0.122 | 200/200 |
148+ | Allganize RAG-Eval | 0.615 | ** 0.695** | +0.080 | 286/300 |
149+ | PublicHealthQA KO | 0.318 | ** 0.466** | +0.148 | 65/77 |
150+ | AutoRAG KO | 0.592 | ** 0.692** | +0.100 | 114/114 |
151+ | HotPotQA-24 EN | 0.727 | 0.727 | 0.000 | 24/24 |
152+
153+ Diagnostic that led to the change:
154+ ` examples/ablation/failure_diagnostic.py ` . 16 of 20 Allganize RAG-ko
155+ misses were "generic question form" queries whose topic nouns were
156+ drowned out by Kiwi-surviving question tails. Applied the same filter
157+ in ` MemoryBackend.search_fts ` so benchmark adapters (which all use
158+ MemoryBackend) pick up the improvement without a SQLite rebuild.
159+
160+ > ** Note** : the numbers above are the v0.15.1 delta measured against
161+ > the v0.15.0 legacy-engine baseline. v0.16.0's engine flip pushes
162+ > these to 0.947 / 0.911 / 0.546 / 0.906 — see the 0.16.0 entry above.
163+
9164## [ 0.15.0] - 2026-04-15
10165
11166### Added — ` graph.search(engine="evidence") ` opt-in modern path
0 commit comments