docs(task-61): DB adapter compat spec v1 — vector + graph cross-impl audit by earayu · Pull Request #1928 · apecloud/ApeRAG

earayu · 2026-04-30T04:53:17Z

Summary

架构师起草 task #61 DB adapter 兼容审计 spec v1，per @earayu2 directive (msg=8b989470 / msg=2bad8e75 / msg=f26b703e) + PM @不穷 task #72 dispatch.

关键 streaming evidence (8 lane parallel surface)

huangheng msg=ed2f2973: 3 vector P0 candidates
Bryce msg=8e895471 task feat: improve admin, add "User" #69: 11 findings (4 P0 / 3 P1 / 4 P2)
冬柏 msg=3e93bb64 task chore: fix lint and format errors #67: 3 Protocol method gap (bulk_upsert P0)
chenyexuan msg=f298011e + PR ci(compat): trigger Backend-Compat-Test on real graph_storage path #1926: workflow paths filter P0-W1
cuiwenbo msg=dfebf706 task [Features] add a chatbot to the kubeblocks.io website to answer user questions about KubeBlocks. #70: FE/UX 3 候选
Planetegg task [Improvement] Support anti-spam for email collection, put well-known email providers in the drop list #65: alias gather P2 + Singapore multitenant=True
ziang task [Features] A collection supports multi source of corpse #64 graph store + dongdong task [Features] support integrate with the DingTalk #71 deploy 待 fold-in

P0 必修清单 (4 vector + 1 graph + 1 workflow)

P0-V1 Qdrant legacy mode cross-tenant leak (Singapore not in legacy, no hot-fix)
P0-V2 filter translation silent return None (vs PGVector fail-loud)
P0-V3+V4 score normalization 跨 distance metric 解释方向不一致
P0-G1 bulk_upsert_entity_with_lineage_parts 0 cross-backend test
P0-W1 compat-test.yml paths filter dead reference (PR ci(compat): trigger Backend-Compat-Test on real graph_storage path #1926 in flight)

P1 允许差异显式 declaration

P1-V1 collection init failure 行为分化
P1-V2 batch upsert atomicity (atomic vs best-effort)
P1-V3 filter Or 语义 (Qdrant should-only match 全集 risk)
P1-G1 remove_relation_lineage_member dual-side rewrite gap
P1-G2 list_entities pagination/sort stability

sub-task dispatch (已 PM 派单 Phase A)

#64 ziang (graph) / #65 Planetegg (SRE) / #66 Weston (contract matrix) / #67 冬柏 (testing) / #69 Bryce (vector) / #70 cuiwenbo (FE) / #71 dongdong (deploy) / #72 我 (spec)

CR mandatory checklist

按 task-17-cr-review-checklist.md framework + huangheng PR #1916 + #1924 sediment family + chenyexuan PR #1922 Lesson #15 + 新 Lesson #16 候选 (workflow paths dead reference).

改动

新增 1 文件 / 237 行: docs/zh-CN/architecture/task-61-db-adapter-compat-spec-v1.md

Test plan

git diff --check passed
cross-link 路径正确 (task-30 / task-32 / task-system-invariants / cr-checklist)
grep evidence file:line cite 完整
team review (各 lane owner 看自己 lane evidence integration 准确性)
earayu2 ratify spec frame 后 PM 调度 Phase B per-P0 三 PR pattern

不阻塞主线

PR #1925 (task #30 B3) / PR #1926 (compat-test paths) / Singapore 2pm release / task #31 / task #33 P3.

🤖 Generated with Claude Code

…audit Architect spec v1 起草 per earayu2 directive (msg=8b989470 / msg=2bad8e75 / msg=f26b703e) + PM 不穷 task #72 dispatch. Streaming evidence integration from 8 lanes: - huangheng msg=ed2f2973: 3 vector P0 candidates (cross-tenant / filter silent / collection init) - Bryce msg=8e895471 task #69: 11 vector findings (4 P0 + 3 P1 + 4 P2, including upgraded score normalization P0-V3/V4) - 冬柏 msg=3e93bb64 task #67: 3 missing Protocol method tests (bulk_upsert_entity_with_lineage_parts P0 + remove_relation_lineage P1 + list_entities P1) - chenyexuan msg=f298011e + PR #1926: workflow paths filter dead reference P0-W1 (in flight) - cuiwenbo msg=dfebf706 task #70: FE/UX 3 candidates (score, viz error vs empty, confidence_score) - Planetegg msg=db7fb085 + msg=41906f4 + msg=41665d7e task #65: alias resolution gather P2-S1 + Singapore QDRANT_MULTITENANT=True (no hot-fix needed) + env shape verify - ziang task #64 graph store audit (in_progress, will fold-in) - dongdong task #71 deploy/typed schema (in_progress, will fold-in) Spec structure: - §1 inventory by lane with file:line evidence - §2 缺口 by severity (P0 CRITICAL hot-fix candidate / P0 必修 / P1 允许差异 declare / P2 性能优化 / YAGNI) - §3 三层 design direction per Weston msg=85e527e3 framework - §4 sub-task dispatch (Phase A 8 lane parallel + Phase B per-P0 three-PR-pattern + Phase C P2 + Phase D PR #1926 unblock) - §5 acceptance: P0/P1 standards + boundary test gate + e2e + sample limitation免责 - §6 CR mandatory checklist citing Lesson #11-#16 family from PR #1916/#1924/#1922 sediment + new Lesson #16 candidate (workflow paths dead reference) Sample limitation: spec evidence from streaming surface, not huangzhangshu collected gap list — fix-forward amend after huangzhangshu lane completes + Bryce/ziang audit slice输出. Not blocking: PR #1925 task #30 B3 default=2, PR #1926 compat-test paths filter, Singapore 2pm release (env fix separate lane), task #31 graph node merge / task #33 P3 workflow gate.

Weston msg=13dd5e91 BLOCKER (score normalization severity drift): 保持 P0-V3+V4 P0 across §1.1 / §2.2 / §5.3 — score 方向是 caller 语义硬契约，不能在 PGVector/Qdrant 间显示反向。§2.2 加 P0-V3+V4 显式行 + §5.3 加 test_score_normalization_in_vector.py boundary test (跨 metric × 跨 adapter 全 6 cell parametrize). Streaming integrations (5 lane): 1. Bryce msg=23a2f514 P0-V1 first-principles 重新定性 — Qdrant legacy mode tenant isolation 是 collection name level 不是 query filter level (verify qdrant_connector.py:442-446)，下沉 P1-V4 defense-in-depth (legacy mode deprecation follow-up 候选). 2. Bryce msg=8e895471 11 vector findings — 4 P0 (cross-tenant 下沉 / filter silent / score V3+V4) + 3 P1 (collection init / batch atomicity / filter Or 语义) + 4 P2. 3. dongdong msg=4201465a + PR #1929 + cuiwenbo msg=bcec38ad — P0-D1 Helm worker Neo4j env missing (Singapore graph viz root-cause); P1-D1 e2e shape matrix gap; P1-D2 Nebula no Helm first-class; P1-D3 typed schema 缺 vector backend exposure. 4. chenyexuan NIT — Lesson #16 candidate cite added §6. 5. Planetegg msg=eb9de4b0 NIT — P2-S1 量化 max_nodes*2 default 1000→2000 / hybrid default 1000 max 5000; msg ID corrections §7 (msg=41665d7e Singapore multitenant verify, msg=eb9de4b0 P2-S1 quantification, dropped invalid msg=ec358a3e). 冬柏 PR #1927 commit b2234ae fold-in §5.3 (38 cases incl zero-side-effect + replay idempotency post-NIT). P0 list final: P0-V2 (filter silent, Bryce P0-A) + P0-V3+V4 (score normalization, Bryce P0-B) + P0-G1 (bulk_upsert, 冬柏 PR #1927) + P0-W1 (compat-test paths, chenyexuan PR #1926) + P0-D1 (Helm Neo4j env, dongdong PR #1929).

…4a69 NIT — strike old P0 hot-fix path (P0-V1 已下沉 P1-V4 per Bryce first-principles verify)

…ne 14 count 4+3+4 to 3 P0 + 4 P1 + 4 P2; § 5.1 P0-V1 line removed; § 5.2 P1-V4 defense-in-depth boundary test added

@huangheng

…on (task #61 P0-A + P0-B) (#1930) * feat(vectorstore): cross-adapter filter fail-loud + score normalization (task #61 P0-A + P0-B) Closes the two task #61 vector-adapter contract gaps PM @不穷 dispatched to me (msg=a387a81e) and architect @符炫炜 ratified (msg=7646eb4f), collapsed onto a single PR per Weston's contract-matrix scope (msg=8beffab5). P0-A — filter fail-loud ----------------------- * Add ``UnsupportedFilterError`` to ``aperag.vectorstore.base`` as a cross-adapter exception type. Subclasses ``TypeError`` so existing ``except TypeError`` callers (pgvector translator pre-this-PR) keep working unchanged. * Qdrant ``_normalize_filter_input`` now raises instead of logging a warning + ``return None``. The previous behaviour silently dropped the filter and degraded the search into a tenant-wide unfiltered scan — a correctness footgun, not graceful degradation. * Pgvector ``_SqlFilter._walk`` re-types its raise to the same exception so both backends fail the same way on the same input. P0-B — score normalization onto [0, 1] with higher = better ----------------------------------------------------------- * Add ``normalize_score(metric, raw)`` and inverse ``denormalize_threshold_to_native(metric, normalized)`` to ``aperag.vectorstore.base``. Cosine clamps to [0, 1]; euclid maps ``-L2`` via ``1/(1+L2)`` onto (0, 1]; dot uses a numerically-stable sigmoid onto (0, 1). All three transforms are monotone so top-k ordering is preserved versus the raw form. * Both adapters apply ``normalize_score`` before constructing ``SearchHit`` and use ``denormalize_threshold_to_native`` to push ``QueryRequest.score_threshold`` down to the native query (SQL ``WHERE score >= …`` / Qdrant ``score_threshold=``) so the server- side cutoff is exactly equivalent to a Python post-filter on the normalized score. A belt-and-braces post-filter catches any inverse- roundoff drift so the [0, 1] contract holds exactly. * ``SearchHit.__post_init__`` now validates ``0.0 <= score <= 1.0`` so any future direct-build path that bypassed an adapter's normalization surfaces at the DTO boundary instead of polluting downstream score-threshold logic. * ``base.VectorStoreConnector`` docstring + ``search()`` contract updated to spell out the §5/§6 invariants. Tests ----- * New ``tests/unit_test/vectorstore/test_score_normalization.py``: range invariants per metric, ordering preservation, denormalize→normalize roundtrip on (0, 1), endpoint behaviour (-inf / +inf clamps for pushdown), and ``UnsupportedFilterError isinstance TypeError``. * Existing translator unit tests updated to assert the cross-adapter exception type while still asserting ``TypeError`` for back-compat. * ``tests/integration/compat/test_vector_compat.py`` adds three new cross-backend cases (filter fail-loud, score in [0, 1], threshold direction, top-k ranking monotone) so the contract is pinned across PGVector × Qdrant under compat-test, not just per-adapter. Per spec PR #1928 § 2.2 / § 5.3, follow-up boundary test sub-PR by @huangheng will extend the parametrize fixture to cover the full PGVector × Qdrant × {cosine, euclid, dot} 6-cell grid; this PR ships the cosine cell (the only metric currently exercised by the compat fixture) plus the per-metric unit tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(vectorstore): annotate cosine-tuned default score thresholds (huangheng NIT 1) huangheng PR #1930 line-level CR (msg=5eb7315c) NIT 1 fold-in: caller chain audit surfaced that all three in-tree default thresholds (``DEFAULT_VECTOR_SCORE_THRESHOLD = 0.72`` × 2 + retrieval ``score_threshold = 0.5``) were tuned on cosine-distance embeddings. After P0-B normalization the [0, 1] number is directly comparable across adapters but the *intent* is still cosine-grade strictness — collections that pick ``euclid`` or ``dot`` distance may want to override. This commit only adds explanatory docstrings; no behaviour change. The metric-aware default refactor (Lesson #12 v7.3 cross-PR default value alignment family) stays as a follow-up sub-PR per huangheng's non-blocker NIT framing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(vectorstore): negate Qdrant euclid raw at adapter boundary (Weston BLOCKER) Weston msg=86e05a8e caught a real bug in PR #1930's P0-B implementation: ``normalize_score("euclid", raw)`` assumes the canonical "negative L2, higher=better" raw form (which pgvector's ``_score_expr = -(<->)`` produces directly), but Qdrant returns positive L2 distance natively (smaller=better). Result: every Qdrant euclid hit was clamped to L2=0 → score=1.0, and a tight ``score_threshold=0.9`` returned an empty list because the inverse threshold was a negative number that Qdrant re-interpreted as a positive-L2 *upper* bound (vacuous). Per architect msg=06902347 + huangheng msg=99b52499, fix-forward Option A: keep the shared ``normalize_score`` / ``denormalize_threshold_to_native`` helpers' contract (input is canonical "higher-is-better raw", output is [0, 1]) and convert at the Qdrant adapter boundary for the asymmetric metric. Cosine + dot agree on convention across both backends so they need no boundary work; only euclid is asymmetric. Changes ------- * ``aperag/vectorstore/qdrant_connector.py``: * ``search()`` now negates ``p.score`` before calling ``normalize_score`` when the metric is euclid. * Threshold pushdown: when the metric is euclid, the helper-returned "negative L2" gets flipped back to a "positive L2 upper bound" before passing to Qdrant's native ``score_threshold``. Pre-existing ``+inf`` (return empty) / ``-inf`` (omit threshold) edge cases stay intact. * ``aperag/vectorstore/base.py``: docstring for the score-normalization block now documents the canonical "higher-is-better raw" convention the helpers operate on, calls out the Qdrant euclid asymmetry explicitly, and pins the responsibility on adapters (math-only helper, adapters do raw → canonical conversion). Tests (Weston requested cross-metric Qdrant-native verify) ---------------------------------------------------------- ``tests/unit_test/vectorstore/test_score_normalization.py`` adds four end-to-end Qdrant ``:memory:`` regressions: * ``test_qdrant_euclid_normalized_scores_strictly_decreasing_with_distance`` — pins Weston's exact failure mode: near/mid/far must produce strictly decreasing normalized scores. * ``test_qdrant_euclid_score_threshold_filters_far_keeps_near`` — pins the threshold-pushdown direction: ``score_threshold=0.9`` must keep the L2=0 near point and drop the L2=3 far point. * ``test_qdrant_dot_normalized_scores_strictly_increasing_with_inner_product`` — explicit pin that dot is *not* asymmetric and a future refactor must not negate it accidentally. * ``test_qdrant_cosine_normalized_scores_strictly_increasing_with_similarity`` — completes the per-metric Qdrant pin so all three native conventions are documented next to each other. Local ``uv run pytest tests/unit_test/vectorstore/`` → 146 passed, 10 skipped, 1 warning. Existing PGVector + cosine compat tests unchanged. Sediment fold-in candidates per huangheng msg=99b52499: * Lesson #12 v9 second-application demo (Weston msg=86e05a8e + Bryce msg=23a2f514, double-source) — first-principles verify catches surface-signal mistakes * Lesson #12 v7 extension candidate — external API contract verify (Qdrant ``p.score`` raw convention vs in-tree docstring assumption) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

earayu added 4 commits April 30, 2026 12:52

docs(task-61): § 3.1.1 historical residue cleanup per Weston msg=fdf0…

52d42cb

…4a69 NIT — strike old P0 hot-fix path (P0-V1 已下沉 P1-V4 per Bryce first-principles verify)

docs(task-61): final consistency cleanup per Weston msg=e414d3cf — li…

888896b

…ne 14 count 4+3+4 to 3 P0 + 4 P1 + 4 P2; § 5.1 P0-V1 line removed; § 5.2 P1-V4 defense-in-depth boundary test added

earayu mentioned this pull request Apr 30, 2026

feat(vectorstore): cross-adapter filter fail-loud + score normalization (task #61 P0-A + P0-B) #1930

Merged

6 tasks

earayu merged commit ed8def2 into main Apr 30, 2026
10 checks passed

earayu deleted the architect/task-61-db-compat-spec-v1 branch April 30, 2026 05:12

earayu mentioned this pull request Apr 30, 2026

docs(cr-checklist): task #61 + task #30 B3 close-out sediment fold-in #1932

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(task-61): DB adapter compat spec v1 — vector + graph cross-impl audit#1928

docs(task-61): DB adapter compat spec v1 — vector + graph cross-impl audit#1928
earayu merged 4 commits into
mainfrom
architect/task-61-db-compat-spec-v1

earayu commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

earayu commented Apr 30, 2026

Summary

关键 streaming evidence (8 lane parallel surface)

P0 必修清单 (4 vector + 1 graph + 1 workflow)

P1 允许差异显式 declaration

sub-task dispatch (已 PM 派单 Phase A)

CR mandatory checklist

改动

Test plan

不阻塞主线

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant