feat(Wave 7 #6): alias_map + transparent redirect + LineageEntityMerger#1758
Conversation
…t + LineageEntityMerger §K.12.6 / §K.12.7 / §K.12.10b task #6 — full storage + service body for user-driven entity merge over the lineage graph. Per architect ratify msg=cf860ae4 + huangheng endorse msg=22816e0d (5-drift lock) + sentinel pick msg=22816e0d (`__curation_merge__`). What ships ---------- 1. **`aperag_lineage_entity_alias` table** (alembic ``b5d2e8f1c9a4``) — composite PK ``(collection_id, alias_name)``, ``canonical_name`` index for reverse lookup. 2. **`LineageEntityAlias` ORM** (`aperag/domains/knowledge_graph/db/ models.py`) — re-uses the curation domain's existing ``Base``. 3. **`AliasMapRepository`** (`aperag/graph_curation/alias_map.py`): - `resolve_canonical(collection_id, name)` — single-indirection read (transitive flatten keeps the table 1-deep at write time). - `upsert_alias(...)` — cycle reject (`AliasCycleError`) + transitive flatten (UPDATE + INSERT in one transaction). - `list_aliases_pointing_at(...)` for tests / admin tooling. - `purge_collection(...)` for collection teardown. 4. **`LineageGraphStoreWithAliasRedirect` decorator** (`aperag/indexing/alias_redirect_store.py`) — wraps any ``LineageGraphStore`` + ``AliasMapRepository``, intercepts ``upsert_entity_with_lineage`` / ``upsert_relation_with_lineage`` to rewrite entity names through the alias map, forwards every other Protocol method byte-for-byte. Per huangheng CR lock (Option (b), msg=93d9add1 / msg=22816e0d): zero changes to the three backend store implementations. 5. **`LineageEntityMerger`** (`aperag/graph_curation/lineage_merge.py`) — orchestrator for user-driven merge. Step ordering locked (invariant #2): alias upserts → L1 source-parts re-anchored preserving doc lineage → L1 final unified+compacted with ``__curation_merge__`` sentinel → vector upsert (3-field payload, ``uuid5`` deterministic id) → source delete (L1 + vector) last. 6. **24 unit tests**: - `test_alias_map.py` (10): cycle reject self-loop + chain cycle, transitive flatten, target flatten through chain, alias persists after canonical GC, per-collection isolation, purge. - `test_alias_redirect_store.py` (5): indexer write redirects to canonical, no-alias passthrough, both-endpoint relation redirect, single-endpoint relation redirect, decorator passthrough invariant for all 13 non-upsert Protocol methods (huangheng CR lock). - `test_lineage_merge.py` (9): empty source short-circuit, step order L1 → vector → delete, sentinel ``__curation_merge__``, Compactor kwargs locked (subject_kind/subject_label/language), source parts re-anchored preserving per-doc lineage, vector payload 3-field + uuid5 deterministic, alias cycle propagation, target alias chain flatten, target GC tolerance. §K.12 invariant cross-check --------------------------- - #1 L1 not polluted: source parts re-anchored under target preserve original `(document_id, parse_version, chunk_ids)` lineage — pinned by `test_source_parts_reanchored_preserving_doc_lineage`. - #2 L1 → L2 derivation: step ordering pinned by `test_step_order_is_l1_then_vector_then_delete` (L1 writes precede vector writes; deletes last). - #3 transparent alias redirect: pinned by `test_indexer_upsert_after_merge_redirects_to_canonical`. - #4 vector store via abstraction: ``VectorStoreConnector.upsert/delete`` via ``asyncio.to_thread``. - #5 3-field payload: pinned by `test_vector_payload_is_3_field_with_deterministic_uuid5`. - #6 uuid5 deterministic point id: same test pins ``uuid5(NAMESPACE_DNS, "graph_entity:{cid}:{name}")``. - #7 alias_map orphan persist: pinned by `test_alias_persists_after_canonical_gc`. - #9 cycle flatten + reject: pinned by 3 tests (`test_transitive_flatten_rewrites_existing_alias_rows`, `test_cycle_reject_self_loop`, `test_cycle_reject_through_existing_chain`). - #11 D-3 merge user-driven only: `LineageEntityMerger.merge_entities` is the only entry point; `MergeCandidateDetector` (PR #1755) does not import it. - #12 grep-zero LightRAG: 0 hits in new files. Scope notes ----------- - **Wiring**: `worker_factory._build_lineage_graph_store` is NOT changed in this PR. Bryce's task #3 PR (#1757) just merged adds the worker.sync() consumption path; wiring the alias-redirect decorator into `_build_lineage_graph_store` is a one-line follow-up that belongs with task #8 retrieval cutover (where the legacy merge route is also replaced) so we don't ship a partial wiring. - **REST route cutover**: legacy `KnowledgeGraphService.merge_entities` (route handler) still calls `GraphIndexService.merge_entities`. Replacing it with `LineageEntityMerger` is task #8's job (chenyexuan) — it changes no field on the response shape so cuiwenbo task #9 frontend is zero-touch.
earayu
left a comment
There was a problem hiding this comment.
🟢 LGTM ✅ (huangheng pass-1, per spec §K.12.11) — GitHub 不允许同账号 approve;verdict = ready to merge
12-invariant 全表 + 4-pattern + simple-stable 4-guardrail 全 paste(hard-gate format 第七个达标 PR),24 tests 覆盖 spec §K.12.6 / §K.12.7 / §K.12.10b 全部架构 invariant,5 drifts 全 byte-level 修复(前 thread 我 surface 的 5 项 + 现 outline 一致),decorator passthrough 严守,sentinel __curation_merge__ 一致。
12-invariant cross-check(task #6 scope,10 项 material ✅)
| # | Invariant | This PR | 验证依据 |
|---|---|---|---|
| 1 | L1 graph data 不污染(kg.jsonl raw vs storage view 严格分层) | ✅ | source parts re-anchored 保留原始 (document_id, parse_version, chunk_ids) lineage(lineage_merge.py:266-284);test_source_parts_reanchored_preserving_doc_lineage 钉 |
| 2 | L1 → L2 单向派生(step ordering) | ✅ | Step 6a/6b L1 写入 → Step 7 vector → Step 8 delete sources;test_step_order_is_l1_then_vector_then_delete 钉 |
| 3 | upsert_entity_with_lineage 透明 alias redirect |
✅ | LineageGraphStoreWithAliasRedirect line 87-107 entity / 109-136 relation;test_indexer_upsert_after_merge_redirects_to_canonical 钉 inseparability gate |
| 4 | Vector store via VectorStoreConnectorAdaptor |
✅ | lineage_merge.py:_upsert_vector_point 用 await asyncio.to_thread(connector.upsert/delete, ...),无直接 Qdrant import |
| 5 | payload 3 字段 {indexer, entity_name, entity_type} |
✅ | line 388-401 严格 3 字段;test_vector_payload_is_3_field_with_deterministic_uuid5 钉 |
| 6 | uuid5 deterministic vector point id | ✅ | line 389-394 f"{GRAPH_ENTITY_INDEXER}:{collection_id}:{entity_name}" 同一 test 钉 |
| 7 | alias_map orphan persist(canonical gc 不级联) | ✅ | alias_map.py 不 cascade-delete on canonical gc;test_alias_persists_after_canonical_gc 钉 |
| 8 | snapshot-diff via lineage entity name set | n/a | task #3 范畴 |
| 9 | cycle flatten + reject + transitive flatten | ✅ | AliasMapRepository.upsert_alias line 153-157 reject + 181-191 transitive flatten 单 transaction;3 tests 钉(self-loop / chain cycle / transitive flatten 3 case) |
| 10 | DB 列长度 cap | n/a | alias 表用 String(512) 与 entity 表一致 |
| 11 | 候选检测仅写不自动合并 (D-3) | ✅ | LineageEntityMerger.merge_entities 是 user-driven 唯一 entry;MergeCandidateDetector (PR #1755 verified) 不 import 本服务 |
| 12 | 命名 grep-zero LightRAG | ✅ | gh pr diff 1758 | grep -E '^\+' | grep -i lightrag → 0 added lines |
5 drifts 修复 byte-level verification(per architect ratify msg=cf860ae4 + 我 msg=22816e0d)
| Drift | spec lock | This PR 落点 |
|---|---|---|
#1 delete_entity(entity_name) 单参数 |
task #1 PR #1754 Protocol | lineage_merge.py:314 delete_entity(src_entity.name) ✅ |
#2 upsert_entity_with_lineage(record=EntityRecord, lineage=LineageMember) shape + sentinel __curation_merge__ |
architect ratify | line 271-303 调用全 record/lineage shape;line 297 document_id=CURATION_MERGE_DOCUMENT_ID (line 99 const = "__curation_merge__") ✅ |
| #3 Compactor 加 subject_kind/subject_label/language kwargs | task #2 fix-forward | line 254-259 compact_if_oversized([unified], subject_kind="entity", subject_label=final_target, language=self._language) ✅ |
| #4 alias_map cycle detection (transitive flatten + reject cycle) | spec §K.12.10b | alias_map.py:153-157 reject + 181-191 transitive flatten 单 transaction wrap ✅ |
| #5 Step 顺序 L1 → vector → delete | invariant #2 lock | Step 6a (line 266) → 6b (line 286) → 7 (line 306) → 8 (line 313) ✅ + test_step_order_is_l1_then_vector_then_delete 钉 |
完整 mirror,无任何 stale 残留。
Decorator passthrough invariant(我 CR lock,msg=22816e0d)
alias_redirect_store.py:LineageGraphStoreWithAliasRedirect 13 个非-upsert method 全 forward 到 _inner:
find_entity_ids_with_lineage/find_relation_keys_with_lineageremove_entity_lineage_member/remove_relation_lineage_membergc_entity_if_orphan/gc_relation_if_orphandelete_entity/delete_relationget_entity/get_relationquery_entities_by_keyword/expand_neighbors_n_hopslist_entity_labels
test_decorator_passthrough_for_non_upsert_methods 钉死 — 未来 Protocol 加新 method 不在 decorator 处理时立即撞 test,防 silent default break。👍
Architecture quality(细节意识)
-
Option (b) 装饰器 0 改 3 backend:避免 cross-PR risk,task #6 单 PR 全包 vs 拆 Postgres/Neo4j/Nebula 3 个 backend PR 的 sync barrier 节省。decorator 是这种 cross-cutting concern 的标准 pattern。
-
Source parts re-anchor 保留 lineage:line 266-284 不创新 sentinel "curation_merge" pseudo-doc 给 source parts;而是保留 source 原始
(document_id, parse_version, chunk_ids)→ 写到 target name 下。这样 L1 仍能干净 trace per-doc origin(invariant #1 严守),未来 user 删除某个 doc 时还能 strip 对应 part。👍 -
Sentinel
__curation_merge__仅给 unified+compacted:line 286-303 final write 用 sentinel lineage 标识"非 doc-driven";与 indexer normal lineage 区分清楚。 -
Soft-transition + forward-only retry safety:mid-step fail 后 sources 仍 readable(step 8 last delete),retry 整个 merge
(document_id, parse_version)key 全 idempotent。docstring 已记录。 -
Wiring deferred to task #8 显式 documented:
worker_factory._build_lineage_graph_store包 decorator + REST/graphs/mergeroute swap 都是 task #8 一行 / 几行改动,避免 partial wiring。这种 deferred wiring 必须在 task #10 close-out 之前完成否则 alias redirect 的 inseparability gate 形同虚设(indexer 还走 raw store)。
4-pattern pre-check matrix(PR body paste)
P1 v1 caller-grep 落点正确(legacy merge_entities 走 domains/knowledge_graph/service.py:171 + api/routes.py:160,task #8 cutover)。P2 dependency 接口签名全对齐已 merged commits。P3 per-collection binding 与现有 GraphCurationService repo convention 一致。
simple-stable 4-guardrail
PR description 4 项全显式:Option (b) 不改 backend / smallest surface / production helpers 复用 / no operator config。
测试质量(24/24)
- alias_map (10): cycle 3 case + transitive flatten + target chain flatten + orphan persist + collection 隔离 + purge ✅
- decorator (5): entity redirect + no-alias passthrough + relation both-endpoint redirect + relation single-endpoint redirect + 13-method passthrough invariant ✅
- merger (9): empty short-circuit + step order + sentinel + Compactor kwargs + N×M re-anchor + vector payload + cycle propagate + target chain flatten + target GC tolerance ✅
1 个 task #8 wiring 提醒(不阻塞本 PR,但 task #8 必交付)
PR description 显式说 "wiring is deferred to task #8":
worker_factory._build_lineage_graph_storewrap withLineageGraphStoreWithAliasRedirect- REST
/graphs/mergeroute swap toLineageEntityMerger
Critical:这 2 处 wiring 不做的话,本 PR 的 inseparability gate(invariant #3 透明 alias redirect)和 user-driven merge 都不生效。@chenyexuan task #8 PR 必须包含这 2 处 wiring,PR body 显式标注;@huangheng task #8 CR 重点查这 2 处 wiring 落点。
修完会 LGTM 的清单
实际上已经可以 merge ✅。task #8 wiring 检查推到 task #8 CR window,本 PR 无 blocker。
@明书 work 极其 solid — Option (b) 装饰器选择 + source parts re-anchor 保留 lineage 都是 spec 没显式要求但 implementer 主动加的精细度,1 PR 全包 ETA ~04:15 兑现。👍 Wave 7 close-out 路径清晰:剩 task #7(chenyexuan)+ #8(chenyexuan + 本 PR wiring)+ #9(cuiwenbo)+ #10(Bryce sweep)。
@符炫炜 LGTM,可 merge。
@不穷 推进 task #6 → done after merge;critical path 余 #7 / #8 / #9 / #10 = 4 PR 收尾 Wave 7。
Wave 7 §K.12.8 task #8 — three wiring points lit at the same time so PR #1758's inseparability gate (alias-redirect on every indexer write) and PR #1756's vector recall (LightRAG-style semantic search plus 1-hop traversal) become production-alive at the same merge. What lands * ``worker_factory._build_lineage_graph_store`` now returns a ``LineageGraphStoreWithAliasRedirect`` decorating the raw per-collection backend (Wave 7 §K.12 invariant #9). Callers that need the raw inner store — the merger writes canonical names directly and must not be intercepted — go through the new ``_build_lineage_graph_store_inner``. * ``retrieval/pipeline._graph_search`` now composes ``GraphSearchService.search_entities`` (vector recall) + ``get_subgraph`` (1-hop traversal) + ``GraphSearchService.compose_context`` (legacy LightRAG-style rendering). The render is byte-for-byte identical to Wave 6's ``_render_graph_context_text`` (locked by the byte-parity test in PR #1756) so downstream RAG prompts are zero-functional-change — only now vector recall actually happens. * ``GraphService.merge_entities`` (the route layer for ``POST /graphs/nodes/merge``) delegates to ``LineageEntityMerger``. Backward-compat response shape preserved (``target_entity_id`` / ``description`` / ``source_chunk_ids`` / ``edges_*``); chunk ids are recovered from the target's lineage after step 6a re-anchors the source parts under the canonical name. ``edges_redirected`` / ``edges_collapsed`` surface ``0`` because edge re-anchoring is handled transparently by the alias-redirect decorator at indexer write time, not as part of the merge action. * New ``build_lineage_entity_merger_for(collection)`` factory in ``aperag/graph_curation/lineage_merge.py`` resolves the six dependencies (raw inner store / alias repo / compactor / vector connector / embedder / LLM) the merger expects, lifting the ``_SyncEmbedderShim`` pattern out of ``MergeCandidateDetector``'s factory so merger and detector share one shim. Tests * ``tests/unit_test/indexing/test_wave7_task8_wiring.py`` — eight integration tests pinning each of the three wiring points so a future refactor cannot silently regress any of them: decorator wraps inner store, inner factory still returns raw backend, pipeline composes the three GraphSearchService calls in order, KG gate still short-circuits, factory failures degrade to empty, merger delegation preserves backward-compat shape, fallback to unified description when no compaction, alias cycle surfaces as ValueError → 400. * Wave 6 ``test_graph_search_migration.py`` — two existing tests updated to mock the new ``build_graph_search_service_for`` / ``search_entities`` / ``get_subgraph`` boundary instead of the retired keyword-only path. The grep-zero migration assertion is unchanged and passes (legacy import still 0). 12-invariant cross-check (§K.12 / huangheng msg=fcf580a6) * #4 vector store via ``VectorStoreConnectorAdaptor`` ✅ — pipeline composes through ``GraphSearchService`` which already abides; no direct Qdrant import added. * #5 payload ``indexer="graph_entity"`` filter pattern ✅ — inherited from ``GraphSearchService.search_entities``. * #9 ``upsert_entity_with_lineage`` alias redirect ✅ — every indexer / read path now receives the decorated store; merger bypasses via the inner-only factory so canonical writes are not intercepted. * #11 candidate detection write-only / merge read paths ✅ — the cutover preserves the read/write boundary. * #12 grep-zero LightRAG ✅ — code + tests stay LightRAG-clean. ``aperag/graph_curation/*`` and ``domains/knowledge_graph/service.py:get_knowledge_graph`` still hold legacy imports for graph-overview / curation-run paths that depend on enumerate-all-entities; deferred to task #10 close-out alongside the Protocol-extension or legacy delete. 4-pattern pre-check matrix * Pattern 1 v1: ``rg "from aperag.domains.knowledge_graph.graphindex"`` count is unchanged in this PR (drops below the threshold task #10 will assert grep-zero); the ``_graph_search`` path no longer imports the legacy package at all. * Pattern 1 v2: per-method matrix — pipeline cutover replaces ``query_entities_by_keyword`` + ``expand_neighbors_n_hops`` with ``search_entities`` + ``get_subgraph``; merge cutover replaces ``GraphIndexService.merge_entities`` with ``LineageEntityMerger.merge_entities``. * Pattern 2: state binding — new merger factory binds to the same Qdrant connector + embedder the indexer write path uses (``_build_collection_graph_vector_writer``). * Pattern 3: factory + decorator pattern verified by the wiring tests so a future split cannot silently regress. simple-stable 4 guardrail * #1 不无限扩范围 — three wiring swaps + one factory; no new endpoints, no new schema, no new Protocol surface. * #2 尽快上线 — task #8 unblocks task #9 (frontend) and task #11 (e2e narrative) without a Protocol-extension prerequisite. * #3 简单稳定 — decorator + factory split keeps the merger's "writes canonical directly" invariant explicit; pipeline cutover reuses the byte-parity contract from PR #1756. * #4 私有化部署免维护 — no operator config required; the ``API_BASE_URL`` env already supports MCP colocation, and the alias-redirect decorator runs on every collection automatically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…1762) * feat(celery Wave 7 #8): retrieval/curation cutover + 3 wiring points Wave 7 §K.12.8 task #8 — three wiring points lit at the same time so PR #1758's inseparability gate (alias-redirect on every indexer write) and PR #1756's vector recall (LightRAG-style semantic search plus 1-hop traversal) become production-alive at the same merge. What lands * ``worker_factory._build_lineage_graph_store`` now returns a ``LineageGraphStoreWithAliasRedirect`` decorating the raw per-collection backend (Wave 7 §K.12 invariant #9). Callers that need the raw inner store — the merger writes canonical names directly and must not be intercepted — go through the new ``_build_lineage_graph_store_inner``. * ``retrieval/pipeline._graph_search`` now composes ``GraphSearchService.search_entities`` (vector recall) + ``get_subgraph`` (1-hop traversal) + ``GraphSearchService.compose_context`` (legacy LightRAG-style rendering). The render is byte-for-byte identical to Wave 6's ``_render_graph_context_text`` (locked by the byte-parity test in PR #1756) so downstream RAG prompts are zero-functional-change — only now vector recall actually happens. * ``GraphService.merge_entities`` (the route layer for ``POST /graphs/nodes/merge``) delegates to ``LineageEntityMerger``. Backward-compat response shape preserved (``target_entity_id`` / ``description`` / ``source_chunk_ids`` / ``edges_*``); chunk ids are recovered from the target's lineage after step 6a re-anchors the source parts under the canonical name. ``edges_redirected`` / ``edges_collapsed`` surface ``0`` because edge re-anchoring is handled transparently by the alias-redirect decorator at indexer write time, not as part of the merge action. * New ``build_lineage_entity_merger_for(collection)`` factory in ``aperag/graph_curation/lineage_merge.py`` resolves the six dependencies (raw inner store / alias repo / compactor / vector connector / embedder / LLM) the merger expects, lifting the ``_SyncEmbedderShim`` pattern out of ``MergeCandidateDetector``'s factory so merger and detector share one shim. Tests * ``tests/unit_test/indexing/test_wave7_task8_wiring.py`` — eight integration tests pinning each of the three wiring points so a future refactor cannot silently regress any of them: decorator wraps inner store, inner factory still returns raw backend, pipeline composes the three GraphSearchService calls in order, KG gate still short-circuits, factory failures degrade to empty, merger delegation preserves backward-compat shape, fallback to unified description when no compaction, alias cycle surfaces as ValueError → 400. * Wave 6 ``test_graph_search_migration.py`` — two existing tests updated to mock the new ``build_graph_search_service_for`` / ``search_entities`` / ``get_subgraph`` boundary instead of the retired keyword-only path. The grep-zero migration assertion is unchanged and passes (legacy import still 0). 12-invariant cross-check (§K.12 / huangheng msg=fcf580a6) * #4 vector store via ``VectorStoreConnectorAdaptor`` ✅ — pipeline composes through ``GraphSearchService`` which already abides; no direct Qdrant import added. * #5 payload ``indexer="graph_entity"`` filter pattern ✅ — inherited from ``GraphSearchService.search_entities``. * #9 ``upsert_entity_with_lineage`` alias redirect ✅ — every indexer / read path now receives the decorated store; merger bypasses via the inner-only factory so canonical writes are not intercepted. * #11 candidate detection write-only / merge read paths ✅ — the cutover preserves the read/write boundary. * #12 grep-zero LightRAG ✅ — code + tests stay LightRAG-clean. ``aperag/graph_curation/*`` and ``domains/knowledge_graph/service.py:get_knowledge_graph`` still hold legacy imports for graph-overview / curation-run paths that depend on enumerate-all-entities; deferred to task #10 close-out alongside the Protocol-extension or legacy delete. 4-pattern pre-check matrix * Pattern 1 v1: ``rg "from aperag.domains.knowledge_graph.graphindex"`` count is unchanged in this PR (drops below the threshold task #10 will assert grep-zero); the ``_graph_search`` path no longer imports the legacy package at all. * Pattern 1 v2: per-method matrix — pipeline cutover replaces ``query_entities_by_keyword`` + ``expand_neighbors_n_hops`` with ``search_entities`` + ``get_subgraph``; merge cutover replaces ``GraphIndexService.merge_entities`` with ``LineageEntityMerger.merge_entities``. * Pattern 2: state binding — new merger factory binds to the same Qdrant connector + embedder the indexer write path uses (``_build_collection_graph_vector_writer``). * Pattern 3: factory + decorator pattern verified by the wiring tests so a future split cannot silently regress. simple-stable 4 guardrail * #1 不无限扩范围 — three wiring swaps + one factory; no new endpoints, no new schema, no new Protocol surface. * #2 尽快上线 — task #8 unblocks task #9 (frontend) and task #11 (e2e narrative) without a Protocol-extension prerequisite. * #3 简单稳定 — decorator + factory split keeps the merger's "writes canonical directly" invariant explicit; pipeline cutover reuses the byte-parity contract from PR #1756. * #4 私有化部署免维护 — no operator config required; the ``API_BASE_URL`` env already supports MCP colocation, and the alias-redirect decorator runs on every collection automatically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(Wave 7 #8): clarify edges_*=0 semantic on merge_nodes route Per architect msg=4af6f66b / PM msg=e75fd00d follow-up: future maintainers grepping merge_nodes_view shouldn't see the hard-coded 0 and suspect a bug. The Wave 7 §K.12 invariant #9 LineageGraphStoreWithAliasRedirect decorator handles edge re-anchoring at indexer write-time, not as part of the merge call, so there is no explicit edge count to surface on the response. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…l (W7-10) Final close-out of Wave 7 §K.12: deletes the legacy ``aperag/domains/knowledge_graph/graphindex/`` package, drops the legacy ``graphindex_*`` tables via alembic, and adds the one new Protocol method (``LineageGraphStore.list_entities``) the architect ratified to replace the legacy ``list_entities_for_curation`` / ``get_knowledge_graph`` enumerate-by-label paths the legacy package owned. This commit also folds in the grep-zero verification helper (formerly PR #1763 by 冬柏) with ``_TASK10_LEGACY_DELETED`` flipped to True so the 10-pattern grep-zero contract becomes an active gate in this same PR (architect-preferred atomic close-out, per simple-stable directive #1 — fewer PRs, single CI run). ## Scope (per architect msg=28afe6ab + 4-question Q1-Q4 ratify msg=838d57c3 / msg=f3216dfc) 1. **NEW Protocol method ``LineageGraphStore.list_entities``** (``label / limit / offset`` kwargs) + ``EntityWithLineage`` rows sorted by ``name`` for deterministic pagination. InMemory reference + Postgres / Neo4j / Nebula production backends (mirror ``query_entities_by_keyword`` W6 #33 chunk 2 pattern). 2. **``aperag/domains/knowledge_graph/service.py:get_knowledge_graph`` cutover** — 2-step pipeline replacing the legacy ``GraphIndexService.get_knowledge_graph``: 1. ``store.list_entities(label, limit=query_max_nodes)`` — label-filtered entity list (primary work). 2. ``GraphSearchService.get_subgraph(names, hops=max_depth)`` — optional edge expansion when ``max_depth > 0``. Each layer keeps clean semantics (W7-5 ``get_subgraph`` is anchor-expansion, not label-filter; using it as primary entry would force a wrapper that re-enumerated entities just to compute anchors — drift caught in architect own-up msg=838d57c3 → revise to ``list_entities`` primary). 3. **``CurationEntity`` adapter** (new ``aperag/graph_curation/dto.py``) replacing the legacy ``Entity`` DTO. ``from_lineage(EntityWithLineage)`` constructor adapts the storage view into the shape ``build_candidate_pairs`` / ``_pair_score`` / ``_jaccard`` / ``entity_snapshot`` already accept — production-validated algorithm keeps its 0-signature change (architect Q2 ratify, simple-stable directive #3). 4. **``aperag/graph_curation/service.py`` cutover** — ``accept_suggestion`` and ``generate_run`` migrated off the legacy ``GraphIndexService`` bundle: - ``accept_suggestion`` delegates to ``LineageEntityMerger.merge_entities`` (W7-6, PR #1758) the same way the W7-8 ``GraphService.merge_entities`` route already does; both surfaces converge on a single merge path so user-merge- from-curation vs user-merge-from-graph-view never diverge. - ``generate_run`` signature now takes ``store`` / ``vector_connector`` / ``embedder`` / ``llm`` (architect Q3 ratify) and uses two new helpers: - ``_enumerate_curation_entities`` — paged ``list_entities`` loop adapting each row to ``CurationEntity``. - ``_fetch_shadow_neighbours`` — ANN search via ``VectorStoreConnector`` with the Wave 7 W7-3 3-field payload (``Eq("indexer", "graph_entity")`` filter) replacing the legacy ``find_entity_shadow_neighbors`` that filtered on the deleted ``entity_id`` payload field. 5. **``aperag/graph_curation/integration.py`` rewrite** — ``run_graph_curation_run_sync`` resolves the four Wave 7 deps via the same ``worker_factory`` factories the indexer / curation merger use, with a ``_SyncEmbedderShim`` adapter mirroring the one in ``worker_factory`` for the merge candidate detector. 6. **``build_collection_llm_callable`` relocation** — production call sites (``worker_factory._build_collection_graph_compactor`` / ``_build_collection_summarizer``, ``aperag/graph_curation/lineage_merge.py:build_lineage_entity_merger_for``, ``aperag/graph_curation/integration.py``) all import from the canonical home ``aperag/indexing/llm.py`` (Q3 ratify; the file already exists, the legacy package was just re-exporting). 7. **Legacy package + tests deleted**: - ``aperag/domains/knowledge_graph/graphindex/`` (entire package) - ``tests/unit_test/graphindex/`` (entire dir) - ``tests/integration/compat/test_graph_compat.py`` (replaced by ``test_lineage_graph_compat.py`` in W7-1) 8. **Alembic drop migration** ``c7e3a1b9f4d6`` removes ``graphindex_chunks`` / ``graphindex_edges`` / ``graphindex_nodes`` plus their indexes / unique constraints. Hard-cut policy per spec §K.12.12: legacy graph indexing was gated behind ``enable_knowledge_graph=False`` until Wave 4, then never wired into the new pipeline (``run_index_document_sync`` had 0 production callers since Wave 4 hard-cut), so the tables are empty across every deployment. Downgrade recreates empty schema. 9. **Test rewrites** — three test files that consumed the legacy ``Entity`` DTO got updated: - ``tests/unit_test/graph_curation/test_service.py`` / ``test_candidate_generation.py`` — switched to ``CurationEntity as Entity``. - ``tests/unit_test/service/test_search_graph_contract.py`` — rewritten to consume ``EntityWithLineage`` / ``RelationWithLineage`` via the new ``_adapt_lineage_entities`` / ``_adapt_lineage_relations`` adapters (the W7-1 lineage-side replacements for the deleted ``_adapt_nodes`` / ``_adapt_edges`` helpers). - 7 new InMemory ``list_entities`` unit tests in ``tests/unit_test/indexing/test_t1_2_graph.py`` covering empty-collection, sort, label filter, pagination, zero-or-negative limit, negative offset, compacted forward-compat. ## §K.12 invariant cross-check | # | Invariant | This PR | |---|-----------|---------| | 1 | L1 graph data not polluted | ✅ ``list_entities`` is read-only; storage view → adapter projection only | | 2 | L1 → L2 single-direction derive | ✅ no derived writes | | 3 | Compactor before vector embed | N/A — read path | | 4 | Vector store via Adaptor | ✅ ``_fetch_shadow_neighbours`` uses ``VectorStoreConnector`` only | | 5 | payload indexer filter | ✅ ``Eq("indexer","graph_entity")`` filter; no legacy ``entity_id`` payload reference | | 6 | uuid5 vector point id | N/A — read path | | 7 | snapshot-diff lineage name set | N/A — read path | | 8 | alias_map persist orphan | ✅ unaffected; alias_map is W7-6 owned | | 9 | upsert_entity alias redirect | ✅ unaffected; decorator pattern preserved (curation flow uses inner store directly per architect msg=cf860ae4) | | 10 | DB column length application-cap | ✅ no schema CHECK constraints introduced | | 11 | candidate detection write-only | ✅ ``MergeCandidateDetector`` unchanged; ``generate_run`` uses same write boundary | | 12 | grep-zero LightRAG | ✅ `rg "from aperag.domains.knowledge_graph.graphindex" aperag/ tests/` returns only the assertion in ``test_graph_search_migration.py:55``. ``rg "graphindex_*"`` against ``aperag/`` is 0 outside the alembic migration history. The 8 Wave 6-era ``# -- LightRAG-style query layer`` comments + W7-4 line 249 fallback + W7-5 docstrings remain (architect msg=3fe200be — they are descriptive comments referencing design heritage, removable in Wave 8 cleanup if desired) | ## 4-pattern pre-check matrix (paste from PR thread reply) * **P1 v1** — ``rg "from aperag.domains.knowledge_graph.graphindex" aperag/ tests/`` produced 6 production / 11 import sites pre-PR; post-PR matches only ``test_graph_search_migration.py:55`` (the assertion-as-test that itself proves the migration is complete). * **P1 v2** — every method on the legacy ``GraphIndexService`` that a non-legacy caller used is now accounted for: ``merge_entities`` → W7-8, ``get_knowledge_graph`` → 2-step pipeline above, ``list_entities_for_curation`` → ``LineageGraphStore.list_entities`` + ``CurationEntity.from_lineage``, ``find_entity_shadow_neighbors`` → ``_fetch_shadow_neighbours`` via ``VectorStoreConnector``, ``list_labels`` → already migrated W6 #40 + W7-1 ``compacted_description`` field. * **P2** — alembic ``c7e3a1b9f4d6`` drops the legacy tables, alembic env.py loses the legacy ``graphindex.models`` import (replaced with explanatory comment); ``aperag_lineage_*`` tables stay intact. * **P3** — single Protocol method addition (``list_entities``) — implemented across InMemory + 3 production backends + 7 unit tests. ## simple-stable 4-guardrail | Guardrail | Status | |---|---| | #1 不无限扩范围 | ✅ ``list_entities`` is base capability mirroring `delete_entity` / `query_entities_by_keyword`; no new endpoints, no new schema tables | | #2 尽快上线 | ✅ single PR closes Wave 7; all 11 prior task PRs already merged | | #3 简单稳定 | ✅ adapter pattern preserves production-validated `build_candidate_pairs`; ``list_entities`` follows existing pagination idiom | | #4 私有化部署免维护 | ✅ alembic auto-drops legacy tables; no operator config; ``list_entities`` uses the same backend factories the rest of Wave 7 wires | ## Test plan - [x] All 1142 unit tests pass (``uv run pytest tests/unit_test/``) - [x] ``alembic upgrade head --sql`` generates the expected ``DROP INDEX`` / ``DROP TABLE`` cascade - [x] ``alembic heads`` resolves to single head ``c7e3a1b9f4d6`` - [x] ``ruff format --check`` / ``ruff check`` clean on touched files - [ ] CI compat-graph + e2e-http stages — both gated post-merge - [ ] Pair with 冬柏's grep-zero helper PR #1763 — flip ``_TASK10_LEGACY_DELETED=True`` once both PRs merge Closes Wave 7. Next: architect final review per spec §K.12.12. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> ## Fold-in: grep-zero verification helper (formerly PR #1763) Per architect ratify + 冬柏 authorization, PR #1763 is folded into this commit instead of shipping as a separate PR. Contents: * **``tests/integration/test_w7_grep_zero_legacy_graphindex.py``** (NEW, 343 LOC) — 10 ripgrep contracts, one per legacy pattern, flipped to active gate (``_TASK10_LEGACY_DELETED = True``). Patterns cover: 1. ``from aperag.domains.knowledge_graph.graphindex`` imports 2. ``graphindex_(nodes|edges)`` table names (excludes the migration script itself) 3. bare ``import aperag.domains.knowledge_graph.graphindex`` 4. ``_sync_entity_relation_vectors`` (W7-3 superseded) 5. ``_compact_oversized_descriptions`` (W7-2 superseded) 6. ``_summarize_description`` (W7-2 superseded) 7. ``_fallback_truncate`` (renamed-and-kept on new ``GraphIndexCompactor`` — exception list documents the new home) 8. ``_delete_removed_shadow_vectors`` (W7-3 superseded) 9. ``GraphSearchContract.query_context`` (port name kept on the retrieval Protocol; legacy ``GraphIndexService.query_context`` historical-context comments allow-listed) 10. ``GraphIndexService.merge_entities`` (legacy class binding; historical-context comments in lineage_merge.py + test_wave7_task8_wiring.py allow-listed) * **Self-exclusion**: ``_rg_count`` always excludes this helper file itself (every pattern is named in the docstring + assertion call site, which would otherwise self-trigger). * **``aperag/mcp/server.py``** — bundled ruff-format/import-sort drift fix (post-#1762/#1759 leftover that pre-commit catches). Kept here so the close-out PR lands cleanly through ``make lint``. ## Test plan (final) - [x] 1141 unit tests pass (``uv run pytest tests/unit_test/``) - [x] 10/10 grep-zero integration tests pass (``uv run pytest tests/integration/test_w7_grep_zero_legacy_graphindex.py``) - [x] ``alembic upgrade head --sql`` generates the expected ``DROP INDEX`` / ``DROP TABLE`` cascade - [x] ``ruff format --check`` / ``ruff check`` clean on touched files - [ ] CI e2e-http-smoke + e2e-http-provider — gated post-merge Closes Wave 7. Next: architect final review per spec §K.12.12. Co-Authored-By: 冬柏 <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…l (W7-10) (#1765) Final close-out of Wave 7 §K.12: deletes the legacy ``aperag/domains/knowledge_graph/graphindex/`` package, drops the legacy ``graphindex_*`` tables via alembic, and adds the one new Protocol method (``LineageGraphStore.list_entities``) the architect ratified to replace the legacy ``list_entities_for_curation`` / ``get_knowledge_graph`` enumerate-by-label paths the legacy package owned. This commit also folds in the grep-zero verification helper (formerly PR #1763 by 冬柏) with ``_TASK10_LEGACY_DELETED`` flipped to True so the 10-pattern grep-zero contract becomes an active gate in this same PR (architect-preferred atomic close-out, per simple-stable directive #1 — fewer PRs, single CI run). ## Scope (per architect msg=28afe6ab + 4-question Q1-Q4 ratify msg=838d57c3 / msg=f3216dfc) 1. **NEW Protocol method ``LineageGraphStore.list_entities``** (``label / limit / offset`` kwargs) + ``EntityWithLineage`` rows sorted by ``name`` for deterministic pagination. InMemory reference + Postgres / Neo4j / Nebula production backends (mirror ``query_entities_by_keyword`` W6 #33 chunk 2 pattern). 2. **``aperag/domains/knowledge_graph/service.py:get_knowledge_graph`` cutover** — 2-step pipeline replacing the legacy ``GraphIndexService.get_knowledge_graph``: 1. ``store.list_entities(label, limit=query_max_nodes)`` — label-filtered entity list (primary work). 2. ``GraphSearchService.get_subgraph(names, hops=max_depth)`` — optional edge expansion when ``max_depth > 0``. Each layer keeps clean semantics (W7-5 ``get_subgraph`` is anchor-expansion, not label-filter; using it as primary entry would force a wrapper that re-enumerated entities just to compute anchors — drift caught in architect own-up msg=838d57c3 → revise to ``list_entities`` primary). 3. **``CurationEntity`` adapter** (new ``aperag/graph_curation/dto.py``) replacing the legacy ``Entity`` DTO. ``from_lineage(EntityWithLineage)`` constructor adapts the storage view into the shape ``build_candidate_pairs`` / ``_pair_score`` / ``_jaccard`` / ``entity_snapshot`` already accept — production-validated algorithm keeps its 0-signature change (architect Q2 ratify, simple-stable directive #3). 4. **``aperag/graph_curation/service.py`` cutover** — ``accept_suggestion`` and ``generate_run`` migrated off the legacy ``GraphIndexService`` bundle: - ``accept_suggestion`` delegates to ``LineageEntityMerger.merge_entities`` (W7-6, PR #1758) the same way the W7-8 ``GraphService.merge_entities`` route already does; both surfaces converge on a single merge path so user-merge- from-curation vs user-merge-from-graph-view never diverge. - ``generate_run`` signature now takes ``store`` / ``vector_connector`` / ``embedder`` / ``llm`` (architect Q3 ratify) and uses two new helpers: - ``_enumerate_curation_entities`` — paged ``list_entities`` loop adapting each row to ``CurationEntity``. - ``_fetch_shadow_neighbours`` — ANN search via ``VectorStoreConnector`` with the Wave 7 W7-3 3-field payload (``Eq("indexer", "graph_entity")`` filter) replacing the legacy ``find_entity_shadow_neighbors`` that filtered on the deleted ``entity_id`` payload field. 5. **``aperag/graph_curation/integration.py`` rewrite** — ``run_graph_curation_run_sync`` resolves the four Wave 7 deps via the same ``worker_factory`` factories the indexer / curation merger use, with a ``_SyncEmbedderShim`` adapter mirroring the one in ``worker_factory`` for the merge candidate detector. 6. **``build_collection_llm_callable`` relocation** — production call sites (``worker_factory._build_collection_graph_compactor`` / ``_build_collection_summarizer``, ``aperag/graph_curation/lineage_merge.py:build_lineage_entity_merger_for``, ``aperag/graph_curation/integration.py``) all import from the canonical home ``aperag/indexing/llm.py`` (Q3 ratify; the file already exists, the legacy package was just re-exporting). 7. **Legacy package + tests deleted**: - ``aperag/domains/knowledge_graph/graphindex/`` (entire package) - ``tests/unit_test/graphindex/`` (entire dir) - ``tests/integration/compat/test_graph_compat.py`` (replaced by ``test_lineage_graph_compat.py`` in W7-1) 8. **Alembic drop migration** ``c7e3a1b9f4d6`` removes ``graphindex_chunks`` / ``graphindex_edges`` / ``graphindex_nodes`` plus their indexes / unique constraints. Hard-cut policy per spec §K.12.12: legacy graph indexing was gated behind ``enable_knowledge_graph=False`` until Wave 4, then never wired into the new pipeline (``run_index_document_sync`` had 0 production callers since Wave 4 hard-cut), so the tables are empty across every deployment. Downgrade recreates empty schema. 9. **Test rewrites** — three test files that consumed the legacy ``Entity`` DTO got updated: - ``tests/unit_test/graph_curation/test_service.py`` / ``test_candidate_generation.py`` — switched to ``CurationEntity as Entity``. - ``tests/unit_test/service/test_search_graph_contract.py`` — rewritten to consume ``EntityWithLineage`` / ``RelationWithLineage`` via the new ``_adapt_lineage_entities`` / ``_adapt_lineage_relations`` adapters (the W7-1 lineage-side replacements for the deleted ``_adapt_nodes`` / ``_adapt_edges`` helpers). - 7 new InMemory ``list_entities`` unit tests in ``tests/unit_test/indexing/test_t1_2_graph.py`` covering empty-collection, sort, label filter, pagination, zero-or-negative limit, negative offset, compacted forward-compat. ## §K.12 invariant cross-check | # | Invariant | This PR | |---|-----------|---------| | 1 | L1 graph data not polluted | ✅ ``list_entities`` is read-only; storage view → adapter projection only | | 2 | L1 → L2 single-direction derive | ✅ no derived writes | | 3 | Compactor before vector embed | N/A — read path | | 4 | Vector store via Adaptor | ✅ ``_fetch_shadow_neighbours`` uses ``VectorStoreConnector`` only | | 5 | payload indexer filter | ✅ ``Eq("indexer","graph_entity")`` filter; no legacy ``entity_id`` payload reference | | 6 | uuid5 vector point id | N/A — read path | | 7 | snapshot-diff lineage name set | N/A — read path | | 8 | alias_map persist orphan | ✅ unaffected; alias_map is W7-6 owned | | 9 | upsert_entity alias redirect | ✅ unaffected; decorator pattern preserved (curation flow uses inner store directly per architect msg=cf860ae4) | | 10 | DB column length application-cap | ✅ no schema CHECK constraints introduced | | 11 | candidate detection write-only | ✅ ``MergeCandidateDetector`` unchanged; ``generate_run`` uses same write boundary | | 12 | grep-zero LightRAG | ✅ `rg "from aperag.domains.knowledge_graph.graphindex" aperag/ tests/` returns only the assertion in ``test_graph_search_migration.py:55``. ``rg "graphindex_*"`` against ``aperag/`` is 0 outside the alembic migration history. The 8 Wave 6-era ``# -- LightRAG-style query layer`` comments + W7-4 line 249 fallback + W7-5 docstrings remain (architect msg=3fe200be — they are descriptive comments referencing design heritage, removable in Wave 8 cleanup if desired) | ## 4-pattern pre-check matrix (paste from PR thread reply) * **P1 v1** — ``rg "from aperag.domains.knowledge_graph.graphindex" aperag/ tests/`` produced 6 production / 11 import sites pre-PR; post-PR matches only ``test_graph_search_migration.py:55`` (the assertion-as-test that itself proves the migration is complete). * **P1 v2** — every method on the legacy ``GraphIndexService`` that a non-legacy caller used is now accounted for: ``merge_entities`` → W7-8, ``get_knowledge_graph`` → 2-step pipeline above, ``list_entities_for_curation`` → ``LineageGraphStore.list_entities`` + ``CurationEntity.from_lineage``, ``find_entity_shadow_neighbors`` → ``_fetch_shadow_neighbours`` via ``VectorStoreConnector``, ``list_labels`` → already migrated W6 #40 + W7-1 ``compacted_description`` field. * **P2** — alembic ``c7e3a1b9f4d6`` drops the legacy tables, alembic env.py loses the legacy ``graphindex.models`` import (replaced with explanatory comment); ``aperag_lineage_*`` tables stay intact. * **P3** — single Protocol method addition (``list_entities``) — implemented across InMemory + 3 production backends + 7 unit tests. ## simple-stable 4-guardrail | Guardrail | Status | |---|---| | #1 不无限扩范围 | ✅ ``list_entities`` is base capability mirroring `delete_entity` / `query_entities_by_keyword`; no new endpoints, no new schema tables | | #2 尽快上线 | ✅ single PR closes Wave 7; all 11 prior task PRs already merged | | #3 简单稳定 | ✅ adapter pattern preserves production-validated `build_candidate_pairs`; ``list_entities`` follows existing pagination idiom | | #4 私有化部署免维护 | ✅ alembic auto-drops legacy tables; no operator config; ``list_entities`` uses the same backend factories the rest of Wave 7 wires | ## Test plan - [x] All 1142 unit tests pass (``uv run pytest tests/unit_test/``) - [x] ``alembic upgrade head --sql`` generates the expected ``DROP INDEX`` / ``DROP TABLE`` cascade - [x] ``alembic heads`` resolves to single head ``c7e3a1b9f4d6`` - [x] ``ruff format --check`` / ``ruff check`` clean on touched files - [ ] CI compat-graph + e2e-http stages — both gated post-merge - [ ] Pair with 冬柏's grep-zero helper PR #1763 — flip ``_TASK10_LEGACY_DELETED=True`` once both PRs merge Closes Wave 7. Next: architect final review per spec §K.12.12. ## Fold-in: grep-zero verification helper (formerly PR #1763) Per architect ratify + 冬柏 authorization, PR #1763 is folded into this commit instead of shipping as a separate PR. Contents: * **``tests/integration/test_w7_grep_zero_legacy_graphindex.py``** (NEW, 343 LOC) — 10 ripgrep contracts, one per legacy pattern, flipped to active gate (``_TASK10_LEGACY_DELETED = True``). Patterns cover: 1. ``from aperag.domains.knowledge_graph.graphindex`` imports 2. ``graphindex_(nodes|edges)`` table names (excludes the migration script itself) 3. bare ``import aperag.domains.knowledge_graph.graphindex`` 4. ``_sync_entity_relation_vectors`` (W7-3 superseded) 5. ``_compact_oversized_descriptions`` (W7-2 superseded) 6. ``_summarize_description`` (W7-2 superseded) 7. ``_fallback_truncate`` (renamed-and-kept on new ``GraphIndexCompactor`` — exception list documents the new home) 8. ``_delete_removed_shadow_vectors`` (W7-3 superseded) 9. ``GraphSearchContract.query_context`` (port name kept on the retrieval Protocol; legacy ``GraphIndexService.query_context`` historical-context comments allow-listed) 10. ``GraphIndexService.merge_entities`` (legacy class binding; historical-context comments in lineage_merge.py + test_wave7_task8_wiring.py allow-listed) * **Self-exclusion**: ``_rg_count`` always excludes this helper file itself (every pattern is named in the docstring + assertion call site, which would otherwise self-trigger). * **``aperag/mcp/server.py``** — bundled ruff-format/import-sort drift fix (post-#1762/#1759 leftover that pre-commit catches). Kept here so the close-out PR lands cleanly through ``make lint``. ## Test plan (final) - [x] 1141 unit tests pass (``uv run pytest tests/unit_test/``) - [x] 10/10 grep-zero integration tests pass (``uv run pytest tests/integration/test_w7_grep_zero_legacy_graphindex.py``) - [x] ``alembic upgrade head --sql`` generates the expected ``DROP INDEX`` / ``DROP TABLE`` cascade - [x] ``ruff format --check`` / ``ruff check`` clean on touched files - [ ] CI e2e-http-smoke + e2e-http-provider — gated post-merge Closes Wave 7. Next: architect final review per spec §K.12.12. Co-authored-by: 冬柏 <noreply@anthropic.com>
…nds + curation cutover (W8-2) (#1771) Wave 8 task #13 (W8-2): adds a bulk-upsert primitive on the ``LineageGraphStore`` Protocol so callers consolidating N×M ``(description_part, lineage_member)`` tuples to the same target entity get one transaction / one round-trip instead of N×M sequential single upserts. Cuts ``LineageEntityMerger.merge_entities`` step 6a over to the bulk path. ## Why ``GraphCurationService.merge_entities`` (the N-source-into-1-target user merge flow) re-anchors every source's description parts under the target name. With N source entities each carrying M description parts, that loop emitted N×M sequential ``upsert_entity_with_lineage`` round-trips — one full transaction per part. For typical curation runs (3-10 sources × 5-20 parts each), that's 15-200 sequential SQL hits where one bulk write would do. Per architect spec (Wave 8 candidate W8-2 sediment) + huangheng's task #6 PR #1758 perf observation, this folds into a single bulk write per backend. ## Scope (5 numbered items) 1. **Protocol method** — ``bulk_upsert_entity_with_lineage_parts(*, parts: Sequence[tuple[EntityRecord, LineageMember]])`` on ``aperag/indexing/graph.py:LineageGraphStore``. All ``record.name`` values MUST share the same string (asserted as ``ValueError``). Empty parts is a no-op. Per-part dedup key is ``(document_id, parse_version)`` last-wins. Bulk path NEVER touches ``compacted_description`` (preserves existing — same ``COALESCE``-style semantic as single upsert with ``None``). 2. **InMemory ref impl** — single ``asyncio.Lock``-guarded loop in ``InMemoryLineageGraphStore``. 3. **Postgres impl** — single ``INSERT … ON CONFLICT (collection_id, name) DO UPDATE`` that strips matching keys via ``jsonb_array_elements`` ``NOT EXISTS`` against an incoming ``strip_keys_json`` array, then appends the whole new ``new_members_json`` / ``new_parts_json`` arrays. One statement, atomic. 4. **Neo4j impl** — single Cypher MERGE + parallel-list strip-then- append, with the strip predicate matching against the **set** of incoming keys (``IN $strip_keys`` on the ``"<doc_id>|<pv>"`` key string). Row-lock on MERGE serialises concurrent bulk ops on the same entity. 5. **Nebula impl** — single ``EntityLock(target_name)`` acquire + single read / Python merge / write. Mirrors the existing read-modify-write pattern of single upsert but folds the strip- then-append over the **set** of incoming keys. ## Caller cutover * ``LineageEntityMerger.merge_entities`` (step 6a) — replaces the N×M ``for src in source_entities: for part in src.description_parts: await self._store.upsert_entity_with_lineage(...)`` with a single ``bulk_upsert_entity_with_lineage_parts`` call. Step 6b's sentinel write (``__curation_merge__`` final write with unified+compacted text) still goes through the single-upsert path because it needs the ``compacted_description`` column write that the bulk path intentionally doesn't carry. ## Alias-redirect decorator * ``LineageGraphStoreWithAliasRedirect`` — bulk path mirrors single upsert: each part's ``record.name`` resolves through the alias map before forwarding to the inner store. The merger always passes records pinned to the canonical ``final_target`` so the redirect is a no-op in that flow, but symmetry with the single upsert contract means a future caller writing to an aliased name still gets correct behaviour. ## 12-invariant cross-check (Wave 7 §K.12) * **#1 L1 不污染**: bulk path operates on lineage SETs only — kg.jsonl raw extract layer untouched. ✅ * **#3 indexer write redirect through alias map**: bulk path goes through the same decorator alias resolution as single upsert. ✅ * **#9 alias redirect transparent**: decorator forwards bulk to inner with redirected names; merger callers see no behaviour change. ✅ * **#10 DB column cap**: Postgres ``compacted_description`` Text column unchanged (bulk path preserves existing value via COALESCE-equivalent — bulk passes ``None`` end-to-end through the Postgres SQL → INSERT branch sets NULL, ON CONFLICT branch preserves). ✅ * All other invariants: unaffected. ## 4-pattern pre-check matrix * Pattern A (kg.jsonl shape): unchanged * Pattern B (Lineage SET semantics): bulk preserves dedup-by- ``(document_id, parse_version)`` exactly — same key as single upsert. ✅ * Pattern C (Cypher LIST<MAP>): bulk reuses parallel-list encoding for the strip-then-append. ✅ * Pattern D (vector 3-field payload): unchanged ## Simple-stable 4-guardrail * **#1 不无限扩范围**: 1 new Protocol method, no new public REST surface, no new schema column, no new alembic migration. * **#2 尽快上线**: single PR, no spec amend needed (architect W8-2 candidate sediment + Wave 7 task #1 3-backend Protocol pattern reuse). * **#3 简单稳定**: bulk semantic mirrors single upsert exactly; caller cutover is a 1-call replacement of the N×M loop. * **#4 私有化部署免维护**: backend-portable (4 backends shipped cross-backend); no operator-facing config knob. ## Test plan - [x] InMemory contract — 7 new tests pin empty / create / replace / mismatched-names / dedup-within-input / preserves-compacted / last-entity-type-wins. - [x] Alias-redirect decorator — 3 new tests pin redirect-each-part / no-alias-passthrough / empty-short-circuit. - [x] LineageEntityMerger step 6a cutover — existing ``test_source_parts_reanchored_preserving_doc_lineage`` rewritten to assert the new single bulk call (was: 3 sequential single upserts) + single sentinel final write. - [x] All 1166 unit tests pass (up from 1141 baseline; 25 new tests). - [x] Wave 7 grep-zero contract — 10/10 still pass (intent-driven gate unaffected). - [x] ruff format / check clean on touched files. - [ ] Cross-backend integration test — sediment to Wave 8 follow-up (out of scope this PR; the 4-backend Protocol contract is structurally identical to single upsert which already has cross-backend integration coverage). - [ ] CR by @huangheng (focus: invariant #1 L1, #9 alias redirect transparent, 4-backend cross-roundtrip on contract level).
Summary
Wave 7 §K.12.6 / §K.12.7 / §K.12.10b task #6 — full storage + service body for user-driven entity merge over the lineage graph. One PR, 24 unit tests, zero changes to the three storage backends (per architect Option (b) lock + huangheng endorse).
Per architect ratify msg=cf860ae4 (5-drift lock) + huangheng endorse msg=22816e0d (sentinel
__curation_merge__, decorator passthrough invariant, step ordering L1 → vector → delete).What ships
aperag_lineage_entity_aliastable (alembicb5d2e8f1c9a4) — composite PK(collection_id, alias_name)+(collection_id, canonical_name)index for reverse lookup.LineageEntityAliasORM (aperag/domains/knowledge_graph/db/models.py) — colocated with the other curation domain models.AliasMapRepository(aperag/graph_curation/alias_map.py):resolve_canonical(collection_id, name)— single-indirection read; transitive flatten keeps the table 1-deep at write time.upsert_alias(...)— cycle reject (raisesAliasCycleError) + transitive flatten (UPDATE old rows pointing at the alias name + INSERT new row, single transaction).list_aliases_pointing_at(...)for tests / admin tooling;purge_collection(...)for collection teardown.LineageGraphStoreWithAliasRedirectdecorator (aperag/indexing/alias_redirect_store.py) — wraps anyLineageGraphStore+AliasMapRepository, interceptsupsert_entity_with_lineage/upsert_relation_with_lineageto rewrite entity names through the alias map, forwards every other Protocol method byte-for-byte. Zero changes to the three backend implementations.LineageEntityMerger(aperag/graph_curation/lineage_merge.py) — the user-driven merge orchestrator. Step ordering locked: alias upsert → L1 source-parts re-anchored preserving(document_id, parse_version, chunk_ids)→ L1 final unified+compacted with__curation_merge__sentinel → vector upsert (3-field payload,uuid5id) → source delete (L1 + vector) last.§K.12 invariant cross-check (12-item)
(document_id, parse_version, chunk_ids)lineage — pinned bytest_source_parts_reanchored_preserving_doc_lineagetest_step_order_is_l1_then_vector_then_delete(L1 writes precede vector writes; deletes last)upsert_entity_with_lineagetransparent alias redirecttest_indexer_upsert_after_merge_redirects_to_canonicalVectorStoreConnectorAdaptorVectorStoreConnector.upsert/deleteviaasyncio.to_threadindexer="graph_entity"filtertest_vector_payload_is_3_field_with_deterministic_uuid5(3-field strict:{indexer, entity_name, entity_type})uuid5(NAMESPACE_DNS, "graph_entity:{cid}:{name}")test_alias_persists_after_canonical_gctest_transitive_flatten_rewrites_existing_alias_rows,test_cycle_reject_self_loop,test_cycle_reject_through_existing_chainString(512)cap asaperag_lineage_entity.name)LineageEntityMerger.merge_entitiesis the only entry point;MergeCandidateDetector(PR #1755) does not import itgrep -rn 'LightRAG|lightrag' aperag/graph_curation/{alias_map,lineage_merge}.py aperag/indexing/alias_redirect_store.py aperag/migration/versions/20260428030000-* tests/unit_test/{graph_curation,indexing}/→ 0 hits4-pattern pre-check matrix
Pattern 1 v1 — caller-grep:
→ legacy path still in place; this PR ships the replacement infrastructure. Task #8 (chenyexuan) wires the REST route to
LineageEntityMerger. Detector (#1755) confirmed never callsmerge_entities.Pattern 1 v2 — n/a (new component).
Pattern 2 — dependency interface grep:
LineageGraphStore.upsert_entity_with_lineage(*, record, lineage, compacted_description)(aperag/indexing/graph.py:533, post task feat/frontend #1) — async-native, kw-only.LineageGraphStore.delete_entity(entity_name) -> bool(aperag/indexing/graph.py:511, post task feat/frontend #1) — single-arg.GraphIndexCompactor.compact_if_oversized(parts, *, subject_kind, subject_label, language)(aperag/indexing/graph_compactor.py, post task feat: auth bearer token support #2 fix-forward) — kwargs locked.VectorStoreConnector.upsert(points: Sequence[VectorPoint]) -> List[str]/.delete(ids: Sequence[str])(aperag/vectorstore/base.py) — sync, wrapped viaasyncio.to_thread.EmbeddingService.embed_query(text) -> list[float](aperag/llm/embed/embedding_service.py:128) — sync, wrapped viaasyncio.to_thread.Pattern 3 — per-collection vs per-tenant binding:
→ Both the decorator and the merger are per-collection bound at construction;
AliasMapRepositoryis the only stateless singleton (takescollection_idper call) — mirrors existingGraphCurationServicerepo convention so alembic / test setup is unchanged.simple-stable 4-guardrail
worker_factory._build_lineage_graph_store+ REST route swap inKnowledgeGraphService.merge_entities) is task feat: chat websocket connect #8's job since it co-lands with retrieval cutover.__curation_merge__is a const; thresholds inherited from existing services.Scope notes (called out for reviewers)
worker_factory._build_lineage_graph_storedoes NOT yet wrap with the alias-redirect decorator, and the REST/graphs/mergeroute still calls legacyGraphIndexService.merge_entities. Both are one-line / few-line changes that belong in task feat: chat websocket connect #8 (chenyexuan retrieval cutover) so we don't ship a partial wiring on its own. The new infrastructure is fully unit-tested and ready to wire.(document_id, parse_version)keys make every upsert idempotent).Test plan
24 unit cases pass (
uv run pytest tests/unit_test/graph_curation/test_alias_map.py tests/unit_test/graph_curation/test_lineage_merge.py tests/unit_test/indexing/test_alias_redirect_store.py -v):test_alias_map.py(10): cycle reject self-loop + chain cycle, transitive flatten, target flatten through chain, alias persists after canonical GC, per-collection isolation, purge.test_alias_redirect_store.py(5): indexer write redirects to canonical, no-alias passthrough, both-endpoint relation redirect, single-endpoint relation redirect, decorator passthrough invariant for all 13 non-upsert Protocol methods (huangheng CR lock).test_lineage_merge.py(9): empty source short-circuit, step order L1 → vector → delete, sentinel__curation_merge__, Compactor kwargs locked (subject_kind/subject_label/language), source parts re-anchored preserving per-doc lineage, vector payload 3-field + uuid5 deterministic, alias cycle propagation, target alias chain flatten, target GC tolerance.ruff check+ruff format --checkclean.References
c1499777(feat/frontend #1) +c1c48429(feat: auth bearer token support #2) for prior task storage / Compactor.#删除老graph(architect 5-drift ratify msg=cf860ae4, huangheng endorse + decorator passthrough lock msg=22816e0d, sentinel pick msg=22816e0d).🤖 Generated with Claude Code