Skip to content

feat(Wave 7 #6): alias_map + transparent redirect + LineageEntityMerger#1758

Merged
earayu merged 1 commit into
mainfrom
w7-6-graph-curation-merge
Apr 27, 2026
Merged

feat(Wave 7 #6): alias_map + transparent redirect + LineageEntityMerger#1758
earayu merged 1 commit into
mainfrom
w7-6-graph-curation-merge

Conversation

@earayu
Copy link
Copy Markdown
Collaborator

@earayu earayu commented Apr 27, 2026

Summary

Wave 7 §K.12.6 / §K.12.7 / §K.12.10b task #6 — full storage + service body for user-driven entity merge over the lineage graph. One PR, 24 unit tests, zero changes to the three storage backends (per architect Option (b) lock + huangheng endorse).

Per architect ratify msg=cf860ae4 (5-drift lock) + huangheng endorse msg=22816e0d (sentinel __curation_merge__, decorator passthrough invariant, step ordering L1 → vector → delete).

What ships

  1. aperag_lineage_entity_alias table (alembic b5d2e8f1c9a4) — composite PK (collection_id, alias_name) + (collection_id, canonical_name) index for reverse lookup.
  2. LineageEntityAlias ORM (aperag/domains/knowledge_graph/db/models.py) — colocated with the other curation domain models.
  3. AliasMapRepository (aperag/graph_curation/alias_map.py):
    • resolve_canonical(collection_id, name) — single-indirection read; transitive flatten keeps the table 1-deep at write time.
    • upsert_alias(...) — cycle reject (raises AliasCycleError) + transitive flatten (UPDATE old rows pointing at the alias name + INSERT new row, single transaction).
    • list_aliases_pointing_at(...) for tests / admin tooling; purge_collection(...) for collection teardown.
  4. LineageGraphStoreWithAliasRedirect decorator (aperag/indexing/alias_redirect_store.py) — wraps any LineageGraphStore + AliasMapRepository, intercepts upsert_entity_with_lineage / upsert_relation_with_lineage to rewrite entity names through the alias map, forwards every other Protocol method byte-for-byte. Zero changes to the three backend implementations.
  5. LineageEntityMerger (aperag/graph_curation/lineage_merge.py) — the user-driven merge orchestrator. Step ordering locked: alias upsert → L1 source-parts re-anchored preserving (document_id, parse_version, chunk_ids) → L1 final unified+compacted with __curation_merge__ sentinel → vector upsert (3-field payload, uuid5 id) → source delete (L1 + vector) last.

§K.12 invariant cross-check (12-item)

# Invariant This PR
1 L1 graph data 不污染 ✅ source parts re-anchored under target preserve original (document_id, parse_version, chunk_ids) lineage — pinned by test_source_parts_reanchored_preserving_doc_lineage
2 L1 → L2 单向派生 ✅ step ordering pinned by test_step_order_is_l1_then_vector_then_delete (L1 writes precede vector writes; deletes last)
3 upsert_entity_with_lineage transparent alias redirect ✅ pinned by test_indexer_upsert_after_merge_redirects_to_canonical
4 Vector store via VectorStoreConnectorAdaptor ✅ uses VectorStoreConnector.upsert/delete via asyncio.to_thread
5 payload indexer="graph_entity" filter ✅ pinned by test_vector_payload_is_3_field_with_deterministic_uuid5 (3-field strict: {indexer, entity_name, entity_type})
6 uuid5 deterministic vector point id ✅ same test pins uuid5(NAMESPACE_DNS, "graph_entity:{cid}:{name}")
7 alias_map persistence independent of doc lifecycle ✅ pinned by test_alias_persists_after_canonical_gc
8 snapshot-diff via lineage entity name set n/a (covered by task #3 #1757)
9 cycle flatten + reject ✅ 3 tests: test_transitive_flatten_rewrites_existing_alias_rows, test_cycle_reject_self_loop, test_cycle_reject_through_existing_chain
10 DB column length application-layer cap n/a (alias names use the same String(512) cap as aperag_lineage_entity.name)
11 候选检测仅写不自动合并 (D-3) LineageEntityMerger.merge_entities is the only entry point; MergeCandidateDetector (PR #1755) does not import it
12 命名 grep-zero LightRAG grep -rn 'LightRAG|lightrag' aperag/graph_curation/{alias_map,lineage_merge}.py aperag/indexing/alias_redirect_store.py aperag/migration/versions/20260428030000-* tests/unit_test/{graph_curation,indexing}/ → 0 hits

4-pattern pre-check matrix

Pattern 1 v1 — caller-grep:

$ grep -rn 'GraphIndexService.merge_entities\|class.*Alias\|alias_map' aperag/ tests/ | grep -v __pycache__ | grep -v __init__
aperag/domains/knowledge_graph/service.py:171: result = await svc.merge_entities(...)  # legacy KnowledgeGraphService.merge_entities → GraphIndexService.merge_entities (route handler — task #8 cutover)
aperag/domains/knowledge_graph/api/routes.py:160: ...graph_service.merge_entities(...)  # REST POST /graphs/merge (task #8 cutover)

→ legacy path still in place; this PR ships the replacement infrastructure. Task #8 (chenyexuan) wires the REST route to LineageEntityMerger. Detector (#1755) confirmed never calls merge_entities.

Pattern 1 v2 — n/a (new component).

Pattern 2 — dependency interface grep:

  • LineageGraphStore.upsert_entity_with_lineage(*, record, lineage, compacted_description) (aperag/indexing/graph.py:533, post task feat/frontend #1) — async-native, kw-only.
  • LineageGraphStore.delete_entity(entity_name) -> bool (aperag/indexing/graph.py:511, post task feat/frontend #1) — single-arg.
  • GraphIndexCompactor.compact_if_oversized(parts, *, subject_kind, subject_label, language) (aperag/indexing/graph_compactor.py, post task feat: auth bearer token support #2 fix-forward) — kwargs locked.
  • VectorStoreConnector.upsert(points: Sequence[VectorPoint]) -> List[str] / .delete(ids: Sequence[str]) (aperag/vectorstore/base.py) — sync, wrapped via asyncio.to_thread.
  • EmbeddingService.embed_query(text) -> list[float] (aperag/llm/embed/embedding_service.py:128) — sync, wrapped via asyncio.to_thread.

Pattern 3 — per-collection vs per-tenant binding:
→ Both the decorator and the merger are per-collection bound at construction; AliasMapRepository is the only stateless singleton (takes collection_id per call) — mirrors existing GraphCurationService repo convention so alembic / test setup is unchanged.

simple-stable 4-guardrail

  1. 不无限扩范围 ✅ Option (b) decorator → 0 changes to 3 storage backends; LLM judge tier deferred (architect ratify); cross-doc full sweep deferred (no caller).
  2. 先做实尽快上线 ✅ ships in one PR; wiring (one-line in worker_factory._build_lineage_graph_store + REST route swap in KnowledgeGraphService.merge_entities) is task feat: chat websocket connect #8's job since it co-lands with retrieval cutover.
  3. 简单稳定优于复杂 ✅ existing algorithms reused (Compactor, Vector connector, LLM); transitive flatten is the simplest cycle-safe pattern (1 SQL UPDATE + 1 SQL INSERT in a transaction); alphabetical-first canonical is deterministic.
  4. 私有化部署后免维护 ✅ no operator config; sentinel __curation_merge__ is a const; thresholds inherited from existing services.

Scope notes (called out for reviewers)

  • Wiring is deferred to task feat: chat websocket connect #8: worker_factory._build_lineage_graph_store does NOT yet wrap with the alias-redirect decorator, and the REST /graphs/merge route still calls legacy GraphIndexService.merge_entities. Both are one-line / few-line changes that belong in task feat: chat websocket connect #8 (chenyexuan retrieval cutover) so we don't ship a partial wiring on its own. The new infrastructure is fully unit-tested and ready to wire.
  • Soft-transition semantics documented in the merger module docstring: a mid-step failure leaves a partially-merged target (sources still readable until step 8 deletes them) and the merge is forward-only retry safe ((document_id, parse_version) keys make every upsert idempotent).
  • N×M upsert performance (huangheng note in msg=22816e0d): 10 sources × 100 parts = 1000 serial upserts is theoretical — Wave 8+ optimization candidate, not in scope.

Test plan

24 unit cases pass (uv run pytest tests/unit_test/graph_curation/test_alias_map.py tests/unit_test/graph_curation/test_lineage_merge.py tests/unit_test/indexing/test_alias_redirect_store.py -v):

  • test_alias_map.py (10): cycle reject self-loop + chain cycle, transitive flatten, target flatten through chain, alias persists after canonical GC, per-collection isolation, purge.
  • test_alias_redirect_store.py (5): indexer write redirects to canonical, no-alias passthrough, both-endpoint relation redirect, single-endpoint relation redirect, decorator passthrough invariant for all 13 non-upsert Protocol methods (huangheng CR lock).
  • test_lineage_merge.py (9): empty source short-circuit, step order L1 → vector → delete, sentinel __curation_merge__, Compactor kwargs locked (subject_kind/subject_label/language), source parts re-anchored preserving per-doc lineage, vector payload 3-field + uuid5 deterministic, alias cycle propagation, target alias chain flatten, target GC tolerance.
  • ruff check + ruff format --check clean.

References

🤖 Generated with Claude Code

…t + LineageEntityMerger

§K.12.6 / §K.12.7 / §K.12.10b task #6 — full storage + service body for
user-driven entity merge over the lineage graph. Per architect ratify
msg=cf860ae4 + huangheng endorse msg=22816e0d (5-drift lock) +
sentinel pick msg=22816e0d (`__curation_merge__`).

What ships
----------

1. **`aperag_lineage_entity_alias` table** (alembic
   ``b5d2e8f1c9a4``) — composite PK ``(collection_id, alias_name)``,
   ``canonical_name`` index for reverse lookup.
2. **`LineageEntityAlias` ORM** (`aperag/domains/knowledge_graph/db/
   models.py`) — re-uses the curation domain's existing ``Base``.
3. **`AliasMapRepository`** (`aperag/graph_curation/alias_map.py`):
   - `resolve_canonical(collection_id, name)` — single-indirection
     read (transitive flatten keeps the table 1-deep at write time).
   - `upsert_alias(...)` — cycle reject (`AliasCycleError`) +
     transitive flatten (UPDATE + INSERT in one transaction).
   - `list_aliases_pointing_at(...)` for tests / admin tooling.
   - `purge_collection(...)` for collection teardown.
4. **`LineageGraphStoreWithAliasRedirect` decorator**
   (`aperag/indexing/alias_redirect_store.py`) — wraps any
   ``LineageGraphStore`` + ``AliasMapRepository``, intercepts
   ``upsert_entity_with_lineage`` / ``upsert_relation_with_lineage``
   to rewrite entity names through the alias map, forwards every
   other Protocol method byte-for-byte. Per huangheng CR lock
   (Option (b), msg=93d9add1 / msg=22816e0d): zero changes to the
   three backend store implementations.
5. **`LineageEntityMerger`** (`aperag/graph_curation/lineage_merge.py`)
   — orchestrator for user-driven merge. Step ordering locked
   (invariant #2): alias upserts → L1 source-parts re-anchored
   preserving doc lineage → L1 final unified+compacted with
   ``__curation_merge__`` sentinel → vector upsert (3-field payload,
   ``uuid5`` deterministic id) → source delete (L1 + vector) last.
6. **24 unit tests**:
   - `test_alias_map.py` (10): cycle reject self-loop + chain cycle,
     transitive flatten, target flatten through chain, alias persists
     after canonical GC, per-collection isolation, purge.
   - `test_alias_redirect_store.py` (5): indexer write redirects to
     canonical, no-alias passthrough, both-endpoint relation
     redirect, single-endpoint relation redirect, decorator
     passthrough invariant for all 13 non-upsert Protocol methods
     (huangheng CR lock).
   - `test_lineage_merge.py` (9): empty source short-circuit, step
     order L1 → vector → delete, sentinel ``__curation_merge__``,
     Compactor kwargs locked (subject_kind/subject_label/language),
     source parts re-anchored preserving per-doc lineage, vector
     payload 3-field + uuid5 deterministic, alias cycle propagation,
     target alias chain flatten, target GC tolerance.

§K.12 invariant cross-check
---------------------------

- #1 L1 not polluted: source parts re-anchored under target preserve
  original `(document_id, parse_version, chunk_ids)` lineage —
  pinned by `test_source_parts_reanchored_preserving_doc_lineage`.
- #2 L1 → L2 derivation: step ordering pinned by
  `test_step_order_is_l1_then_vector_then_delete` (L1 writes precede
  vector writes; deletes last).
- #3 transparent alias redirect: pinned by
  `test_indexer_upsert_after_merge_redirects_to_canonical`.
- #4 vector store via abstraction: ``VectorStoreConnector.upsert/delete``
  via ``asyncio.to_thread``.
- #5 3-field payload: pinned by
  `test_vector_payload_is_3_field_with_deterministic_uuid5`.
- #6 uuid5 deterministic point id: same test pins
  ``uuid5(NAMESPACE_DNS, "graph_entity:{cid}:{name}")``.
- #7 alias_map orphan persist: pinned by
  `test_alias_persists_after_canonical_gc`.
- #9 cycle flatten + reject: pinned by 3 tests
  (`test_transitive_flatten_rewrites_existing_alias_rows`,
  `test_cycle_reject_self_loop`,
  `test_cycle_reject_through_existing_chain`).
- #11 D-3 merge user-driven only: `LineageEntityMerger.merge_entities`
  is the only entry point; `MergeCandidateDetector` (PR #1755) does
  not import it.
- #12 grep-zero LightRAG: 0 hits in new files.

Scope notes
-----------

- **Wiring**: `worker_factory._build_lineage_graph_store` is NOT
  changed in this PR. Bryce's task #3 PR (#1757) just merged adds the
  worker.sync() consumption path; wiring the alias-redirect decorator
  into `_build_lineage_graph_store` is a one-line follow-up that
  belongs with task #8 retrieval cutover (where the legacy merge route
  is also replaced) so we don't ship a partial wiring.
- **REST route cutover**: legacy
  `KnowledgeGraphService.merge_entities` (route handler) still calls
  `GraphIndexService.merge_entities`. Replacing it with
  `LineageEntityMerger` is task #8's job (chenyexuan) — it changes
  no field on the response shape so cuiwenbo task #9 frontend is
  zero-touch.
Copy link
Copy Markdown
Collaborator Author

@earayu earayu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 LGTM ✅ (huangheng pass-1, per spec §K.12.11) — GitHub 不允许同账号 approve;verdict = ready to merge

12-invariant 全表 + 4-pattern + simple-stable 4-guardrail 全 paste(hard-gate format 第七个达标 PR),24 tests 覆盖 spec §K.12.6 / §K.12.7 / §K.12.10b 全部架构 invariant,5 drifts 全 byte-level 修复(前 thread 我 surface 的 5 项 + 现 outline 一致),decorator passthrough 严守,sentinel __curation_merge__ 一致。

12-invariant cross-check(task #6 scope,10 项 material ✅)

# Invariant This PR 验证依据
1 L1 graph data 不污染(kg.jsonl raw vs storage view 严格分层) source parts re-anchored 保留原始 (document_id, parse_version, chunk_ids) lineage(lineage_merge.py:266-284);test_source_parts_reanchored_preserving_doc_lineage
2 L1 → L2 单向派生(step ordering) Step 6a/6b L1 写入 → Step 7 vector → Step 8 delete sources;test_step_order_is_l1_then_vector_then_delete
3 upsert_entity_with_lineage 透明 alias redirect LineageGraphStoreWithAliasRedirect line 87-107 entity / 109-136 relation;test_indexer_upsert_after_merge_redirects_to_canonical 钉 inseparability gate
4 Vector store via VectorStoreConnectorAdaptor lineage_merge.py:_upsert_vector_pointawait asyncio.to_thread(connector.upsert/delete, ...),无直接 Qdrant import
5 payload 3 字段 {indexer, entity_name, entity_type} line 388-401 严格 3 字段;test_vector_payload_is_3_field_with_deterministic_uuid5
6 uuid5 deterministic vector point id line 389-394 f"{GRAPH_ENTITY_INDEXER}:{collection_id}:{entity_name}" 同一 test 钉
7 alias_map orphan persist(canonical gc 不级联) alias_map.py 不 cascade-delete on canonical gc;test_alias_persists_after_canonical_gc
8 snapshot-diff via lineage entity name set n/a task #3 范畴
9 cycle flatten + reject + transitive flatten AliasMapRepository.upsert_alias line 153-157 reject + 181-191 transitive flatten 单 transaction;3 tests 钉(self-loop / chain cycle / transitive flatten 3 case)
10 DB 列长度 cap n/a alias 表用 String(512) 与 entity 表一致
11 候选检测仅写不自动合并 (D-3) LineageEntityMerger.merge_entities 是 user-driven 唯一 entry;MergeCandidateDetector (PR #1755 verified) 不 import 本服务
12 命名 grep-zero LightRAG gh pr diff 1758 | grep -E '^\+' | grep -i lightrag → 0 added lines

5 drifts 修复 byte-level verification(per architect ratify msg=cf860ae4 + 我 msg=22816e0d)

Drift spec lock This PR 落点
#1 delete_entity(entity_name) 单参数 task #1 PR #1754 Protocol lineage_merge.py:314 delete_entity(src_entity.name)
#2 upsert_entity_with_lineage(record=EntityRecord, lineage=LineageMember) shape + sentinel __curation_merge__ architect ratify line 271-303 调用全 record/lineage shape;line 297 document_id=CURATION_MERGE_DOCUMENT_ID (line 99 const = "__curation_merge__") ✅
#3 Compactor 加 subject_kind/subject_label/language kwargs task #2 fix-forward line 254-259 compact_if_oversized([unified], subject_kind="entity", subject_label=final_target, language=self._language)
#4 alias_map cycle detection (transitive flatten + reject cycle) spec §K.12.10b alias_map.py:153-157 reject + 181-191 transitive flatten 单 transaction wrap ✅
#5 Step 顺序 L1 → vector → delete invariant #2 lock Step 6a (line 266) → 6b (line 286) → 7 (line 306) → 8 (line 313) ✅ + test_step_order_is_l1_then_vector_then_delete

完整 mirror,无任何 stale 残留。

Decorator passthrough invariant(我 CR lock,msg=22816e0d)

alias_redirect_store.py:LineageGraphStoreWithAliasRedirect 13 个非-upsert method 全 forward 到 _inner

  • find_entity_ids_with_lineage / find_relation_keys_with_lineage
  • remove_entity_lineage_member / remove_relation_lineage_member
  • gc_entity_if_orphan / gc_relation_if_orphan
  • delete_entity / delete_relation
  • get_entity / get_relation
  • query_entities_by_keyword / expand_neighbors_n_hops
  • list_entity_labels

test_decorator_passthrough_for_non_upsert_methods 钉死 — 未来 Protocol 加新 method 不在 decorator 处理时立即撞 test,防 silent default break。👍

Architecture quality(细节意识)

  1. Option (b) 装饰器 0 改 3 backend:避免 cross-PR risk,task #6 单 PR 全包 vs 拆 Postgres/Neo4j/Nebula 3 个 backend PR 的 sync barrier 节省。decorator 是这种 cross-cutting concern 的标准 pattern。

  2. Source parts re-anchor 保留 lineage:line 266-284 不创新 sentinel "curation_merge" pseudo-doc 给 source parts;而是保留 source 原始 (document_id, parse_version, chunk_ids) → 写到 target name 下。这样 L1 仍能干净 trace per-doc origin(invariant #1 严守),未来 user 删除某个 doc 时还能 strip 对应 part。👍

  3. Sentinel __curation_merge__ 仅给 unified+compacted:line 286-303 final write 用 sentinel lineage 标识"非 doc-driven";与 indexer normal lineage 区分清楚。

  4. Soft-transition + forward-only retry safety:mid-step fail 后 sources 仍 readable(step 8 last delete),retry 整个 merge (document_id, parse_version) key 全 idempotent。docstring 已记录。

  5. Wiring deferred to task #8 显式 documentedworker_factory._build_lineage_graph_store 包 decorator + REST /graphs/merge route swap 都是 task #8 一行 / 几行改动,避免 partial wiring。这种 deferred wiring 必须在 task #10 close-out 之前完成否则 alias redirect 的 inseparability gate 形同虚设(indexer 还走 raw store)。

4-pattern pre-check matrix(PR body paste)

P1 v1 caller-grep 落点正确(legacy merge_entitiesdomains/knowledge_graph/service.py:171 + api/routes.py:160,task #8 cutover)。P2 dependency 接口签名全对齐已 merged commits。P3 per-collection binding 与现有 GraphCurationService repo convention 一致。

simple-stable 4-guardrail

PR description 4 项全显式:Option (b) 不改 backend / smallest surface / production helpers 复用 / no operator config。

测试质量(24/24)

  • alias_map (10): cycle 3 case + transitive flatten + target chain flatten + orphan persist + collection 隔离 + purge ✅
  • decorator (5): entity redirect + no-alias passthrough + relation both-endpoint redirect + relation single-endpoint redirect + 13-method passthrough invariant
  • merger (9): empty short-circuit + step order + sentinel + Compactor kwargs + N×M re-anchor + vector payload + cycle propagate + target chain flatten + target GC tolerance ✅

1 个 task #8 wiring 提醒(不阻塞本 PR,但 task #8 必交付)

PR description 显式说 "wiring is deferred to task #8":

  1. worker_factory._build_lineage_graph_store wrap with LineageGraphStoreWithAliasRedirect
  2. REST /graphs/merge route swap to LineageEntityMerger

Critical:这 2 处 wiring 不做的话,本 PR 的 inseparability gate(invariant #3 透明 alias redirect)和 user-driven merge 都不生效。@chenyexuan task #8 PR 必须包含这 2 处 wiring,PR body 显式标注;@huangheng task #8 CR 重点查这 2 处 wiring 落点。

修完会 LGTM 的清单

实际上已经可以 merge ✅。task #8 wiring 检查推到 task #8 CR window,本 PR 无 blocker。

@明书 work 极其 solid — Option (b) 装饰器选择 + source parts re-anchor 保留 lineage 都是 spec 没显式要求但 implementer 主动加的精细度,1 PR 全包 ETA ~04:15 兑现。👍 Wave 7 close-out 路径清晰:剩 task #7(chenyexuan)+ #8(chenyexuan + 本 PR wiring)+ #9(cuiwenbo)+ #10(Bryce sweep)。

@符炫炜 LGTM,可 merge。
@不穷 推进 task #6 → done after merge;critical path 余 #7 / #8 / #9 / #10 = 4 PR 收尾 Wave 7。

@earayu earayu merged commit c7e9068 into main Apr 27, 2026
10 checks passed
@earayu earayu deleted the w7-6-graph-curation-merge branch April 27, 2026 18:34
earayu added a commit that referenced this pull request Apr 27, 2026
Wave 7 §K.12.8 task #8 — three wiring points lit at the same time
so PR #1758's inseparability gate (alias-redirect on every indexer
write) and PR #1756's vector recall (LightRAG-style semantic search
plus 1-hop traversal) become production-alive at the same merge.

What lands

* ``worker_factory._build_lineage_graph_store`` now returns a
  ``LineageGraphStoreWithAliasRedirect`` decorating the raw
  per-collection backend (Wave 7 §K.12 invariant #9). Callers that
  need the raw inner store — the merger writes canonical names
  directly and must not be intercepted — go through the new
  ``_build_lineage_graph_store_inner``.

* ``retrieval/pipeline._graph_search`` now composes
  ``GraphSearchService.search_entities`` (vector recall) +
  ``get_subgraph`` (1-hop traversal) +
  ``GraphSearchService.compose_context`` (legacy LightRAG-style
  rendering). The render is byte-for-byte identical to Wave 6's
  ``_render_graph_context_text`` (locked by the byte-parity test in
  PR #1756) so downstream RAG prompts are zero-functional-change —
  only now vector recall actually happens.

* ``GraphService.merge_entities`` (the route layer for
  ``POST /graphs/nodes/merge``) delegates to ``LineageEntityMerger``.
  Backward-compat response shape preserved (``target_entity_id`` /
  ``description`` / ``source_chunk_ids`` / ``edges_*``); chunk ids
  are recovered from the target's lineage after step 6a re-anchors
  the source parts under the canonical name. ``edges_redirected`` /
  ``edges_collapsed`` surface ``0`` because edge re-anchoring is
  handled transparently by the alias-redirect decorator at indexer
  write time, not as part of the merge action.

* New ``build_lineage_entity_merger_for(collection)`` factory in
  ``aperag/graph_curation/lineage_merge.py`` resolves the six
  dependencies (raw inner store / alias repo / compactor /
  vector connector / embedder / LLM) the merger expects, lifting
  the ``_SyncEmbedderShim`` pattern out of
  ``MergeCandidateDetector``'s factory so merger and detector
  share one shim.

Tests

* ``tests/unit_test/indexing/test_wave7_task8_wiring.py`` — eight
  integration tests pinning each of the three wiring points so a
  future refactor cannot silently regress any of them: decorator
  wraps inner store, inner factory still returns raw backend,
  pipeline composes the three GraphSearchService calls in order,
  KG gate still short-circuits, factory failures degrade to empty,
  merger delegation preserves backward-compat shape, fallback to
  unified description when no compaction, alias cycle surfaces as
  ValueError → 400.
* Wave 6 ``test_graph_search_migration.py`` — two existing tests
  updated to mock the new ``build_graph_search_service_for`` /
  ``search_entities`` / ``get_subgraph`` boundary instead of the
  retired keyword-only path. The grep-zero migration assertion is
  unchanged and passes (legacy import still 0).

12-invariant cross-check (§K.12 / huangheng msg=fcf580a6)

* #4 vector store via ``VectorStoreConnectorAdaptor`` ✅ — pipeline
  composes through ``GraphSearchService`` which already abides;
  no direct Qdrant import added.
* #5 payload ``indexer="graph_entity"`` filter pattern ✅ —
  inherited from ``GraphSearchService.search_entities``.
* #9 ``upsert_entity_with_lineage`` alias redirect ✅ — every
  indexer / read path now receives the decorated store; merger
  bypasses via the inner-only factory so canonical writes are
  not intercepted.
* #11 candidate detection write-only / merge read paths ✅ — the
  cutover preserves the read/write boundary.
* #12 grep-zero LightRAG ✅ — code + tests stay LightRAG-clean.
  ``aperag/graph_curation/*`` and ``domains/knowledge_graph/service.py:get_knowledge_graph``
  still hold legacy imports for graph-overview / curation-run
  paths that depend on enumerate-all-entities; deferred to task #10
  close-out alongside the Protocol-extension or legacy delete.

4-pattern pre-check matrix

* Pattern 1 v1: ``rg "from aperag.domains.knowledge_graph.graphindex"``
  count is unchanged in this PR (drops below the threshold task #10
  will assert grep-zero); the ``_graph_search`` path no longer
  imports the legacy package at all.
* Pattern 1 v2: per-method matrix — pipeline cutover replaces
  ``query_entities_by_keyword`` + ``expand_neighbors_n_hops`` with
  ``search_entities`` + ``get_subgraph``; merge cutover replaces
  ``GraphIndexService.merge_entities`` with
  ``LineageEntityMerger.merge_entities``.
* Pattern 2: state binding — new merger factory binds to the same
  Qdrant connector + embedder the indexer write path uses
  (``_build_collection_graph_vector_writer``).
* Pattern 3: factory + decorator pattern verified by the wiring
  tests so a future split cannot silently regress.

simple-stable 4 guardrail

* #1 不无限扩范围 — three wiring swaps + one factory; no new
  endpoints, no new schema, no new Protocol surface.
* #2 尽快上线 — task #8 unblocks task #9 (frontend) and task #11
  (e2e narrative) without a Protocol-extension prerequisite.
* #3 简单稳定 — decorator + factory split keeps the merger's
  "writes canonical directly" invariant explicit; pipeline cutover
  reuses the byte-parity contract from PR #1756.
* #4 私有化部署免维护 — no operator config required; the
  ``API_BASE_URL`` env already supports MCP colocation, and the
  alias-redirect decorator runs on every collection automatically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
earayu added a commit that referenced this pull request Apr 27, 2026
…1762)

* feat(celery Wave 7 #8): retrieval/curation cutover + 3 wiring points

Wave 7 §K.12.8 task #8 — three wiring points lit at the same time
so PR #1758's inseparability gate (alias-redirect on every indexer
write) and PR #1756's vector recall (LightRAG-style semantic search
plus 1-hop traversal) become production-alive at the same merge.

What lands

* ``worker_factory._build_lineage_graph_store`` now returns a
  ``LineageGraphStoreWithAliasRedirect`` decorating the raw
  per-collection backend (Wave 7 §K.12 invariant #9). Callers that
  need the raw inner store — the merger writes canonical names
  directly and must not be intercepted — go through the new
  ``_build_lineage_graph_store_inner``.

* ``retrieval/pipeline._graph_search`` now composes
  ``GraphSearchService.search_entities`` (vector recall) +
  ``get_subgraph`` (1-hop traversal) +
  ``GraphSearchService.compose_context`` (legacy LightRAG-style
  rendering). The render is byte-for-byte identical to Wave 6's
  ``_render_graph_context_text`` (locked by the byte-parity test in
  PR #1756) so downstream RAG prompts are zero-functional-change —
  only now vector recall actually happens.

* ``GraphService.merge_entities`` (the route layer for
  ``POST /graphs/nodes/merge``) delegates to ``LineageEntityMerger``.
  Backward-compat response shape preserved (``target_entity_id`` /
  ``description`` / ``source_chunk_ids`` / ``edges_*``); chunk ids
  are recovered from the target's lineage after step 6a re-anchors
  the source parts under the canonical name. ``edges_redirected`` /
  ``edges_collapsed`` surface ``0`` because edge re-anchoring is
  handled transparently by the alias-redirect decorator at indexer
  write time, not as part of the merge action.

* New ``build_lineage_entity_merger_for(collection)`` factory in
  ``aperag/graph_curation/lineage_merge.py`` resolves the six
  dependencies (raw inner store / alias repo / compactor /
  vector connector / embedder / LLM) the merger expects, lifting
  the ``_SyncEmbedderShim`` pattern out of
  ``MergeCandidateDetector``'s factory so merger and detector
  share one shim.

Tests

* ``tests/unit_test/indexing/test_wave7_task8_wiring.py`` — eight
  integration tests pinning each of the three wiring points so a
  future refactor cannot silently regress any of them: decorator
  wraps inner store, inner factory still returns raw backend,
  pipeline composes the three GraphSearchService calls in order,
  KG gate still short-circuits, factory failures degrade to empty,
  merger delegation preserves backward-compat shape, fallback to
  unified description when no compaction, alias cycle surfaces as
  ValueError → 400.
* Wave 6 ``test_graph_search_migration.py`` — two existing tests
  updated to mock the new ``build_graph_search_service_for`` /
  ``search_entities`` / ``get_subgraph`` boundary instead of the
  retired keyword-only path. The grep-zero migration assertion is
  unchanged and passes (legacy import still 0).

12-invariant cross-check (§K.12 / huangheng msg=fcf580a6)

* #4 vector store via ``VectorStoreConnectorAdaptor`` ✅ — pipeline
  composes through ``GraphSearchService`` which already abides;
  no direct Qdrant import added.
* #5 payload ``indexer="graph_entity"`` filter pattern ✅ —
  inherited from ``GraphSearchService.search_entities``.
* #9 ``upsert_entity_with_lineage`` alias redirect ✅ — every
  indexer / read path now receives the decorated store; merger
  bypasses via the inner-only factory so canonical writes are
  not intercepted.
* #11 candidate detection write-only / merge read paths ✅ — the
  cutover preserves the read/write boundary.
* #12 grep-zero LightRAG ✅ — code + tests stay LightRAG-clean.
  ``aperag/graph_curation/*`` and ``domains/knowledge_graph/service.py:get_knowledge_graph``
  still hold legacy imports for graph-overview / curation-run
  paths that depend on enumerate-all-entities; deferred to task #10
  close-out alongside the Protocol-extension or legacy delete.

4-pattern pre-check matrix

* Pattern 1 v1: ``rg "from aperag.domains.knowledge_graph.graphindex"``
  count is unchanged in this PR (drops below the threshold task #10
  will assert grep-zero); the ``_graph_search`` path no longer
  imports the legacy package at all.
* Pattern 1 v2: per-method matrix — pipeline cutover replaces
  ``query_entities_by_keyword`` + ``expand_neighbors_n_hops`` with
  ``search_entities`` + ``get_subgraph``; merge cutover replaces
  ``GraphIndexService.merge_entities`` with
  ``LineageEntityMerger.merge_entities``.
* Pattern 2: state binding — new merger factory binds to the same
  Qdrant connector + embedder the indexer write path uses
  (``_build_collection_graph_vector_writer``).
* Pattern 3: factory + decorator pattern verified by the wiring
  tests so a future split cannot silently regress.

simple-stable 4 guardrail

* #1 不无限扩范围 — three wiring swaps + one factory; no new
  endpoints, no new schema, no new Protocol surface.
* #2 尽快上线 — task #8 unblocks task #9 (frontend) and task #11
  (e2e narrative) without a Protocol-extension prerequisite.
* #3 简单稳定 — decorator + factory split keeps the merger's
  "writes canonical directly" invariant explicit; pipeline cutover
  reuses the byte-parity contract from PR #1756.
* #4 私有化部署免维护 — no operator config required; the
  ``API_BASE_URL`` env already supports MCP colocation, and the
  alias-redirect decorator runs on every collection automatically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(Wave 7 #8): clarify edges_*=0 semantic on merge_nodes route

Per architect msg=4af6f66b / PM msg=e75fd00d follow-up: future
maintainers grepping merge_nodes_view shouldn't see the
hard-coded 0 and suspect a bug. The Wave 7 §K.12 invariant #9
LineageGraphStoreWithAliasRedirect decorator handles edge
re-anchoring at indexer write-time, not as part of the merge call,
so there is no explicit edge count to surface on the response.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
earayu added a commit that referenced this pull request Apr 27, 2026
…l (W7-10)

Final close-out of Wave 7 §K.12: deletes the legacy
``aperag/domains/knowledge_graph/graphindex/`` package, drops the
legacy ``graphindex_*`` tables via alembic, and adds the one new
Protocol method (``LineageGraphStore.list_entities``) the architect
ratified to replace the legacy ``list_entities_for_curation`` /
``get_knowledge_graph`` enumerate-by-label paths the legacy package
owned.

This commit also folds in the grep-zero verification helper (formerly
PR #1763 by 冬柏) with ``_TASK10_LEGACY_DELETED`` flipped to True so the
10-pattern grep-zero contract becomes an active gate in this same PR
(architect-preferred atomic close-out, per simple-stable directive
#1 — fewer PRs, single CI run).

## Scope (per architect msg=28afe6ab + 4-question Q1-Q4 ratify msg=838d57c3 / msg=f3216dfc)

1. **NEW Protocol method ``LineageGraphStore.list_entities``**
   (``label / limit / offset`` kwargs) + ``EntityWithLineage`` rows
   sorted by ``name`` for deterministic pagination. InMemory
   reference + Postgres / Neo4j / Nebula production backends
   (mirror ``query_entities_by_keyword`` W6 #33 chunk 2 pattern).

2. **``aperag/domains/knowledge_graph/service.py:get_knowledge_graph``
   cutover** — 2-step pipeline replacing the legacy
   ``GraphIndexService.get_knowledge_graph``:

   1. ``store.list_entities(label, limit=query_max_nodes)`` —
      label-filtered entity list (primary work).
   2. ``GraphSearchService.get_subgraph(names, hops=max_depth)`` —
      optional edge expansion when ``max_depth > 0``.

   Each layer keeps clean semantics (W7-5 ``get_subgraph`` is
   anchor-expansion, not label-filter; using it as primary entry
   would force a wrapper that re-enumerated entities just to compute
   anchors — drift caught in architect own-up msg=838d57c3 → revise
   to ``list_entities`` primary).

3. **``CurationEntity`` adapter** (new
   ``aperag/graph_curation/dto.py``) replacing the legacy ``Entity``
   DTO. ``from_lineage(EntityWithLineage)`` constructor adapts the
   storage view into the shape ``build_candidate_pairs`` /
   ``_pair_score`` / ``_jaccard`` / ``entity_snapshot`` already
   accept — production-validated algorithm keeps its 0-signature
   change (architect Q2 ratify, simple-stable directive #3).

4. **``aperag/graph_curation/service.py`` cutover** —
   ``accept_suggestion`` and ``generate_run`` migrated off the
   legacy ``GraphIndexService`` bundle:

   - ``accept_suggestion`` delegates to
     ``LineageEntityMerger.merge_entities`` (W7-6, PR #1758) the same
     way the W7-8 ``GraphService.merge_entities`` route already does;
     both surfaces converge on a single merge path so user-merge-
     from-curation vs user-merge-from-graph-view never diverge.
   - ``generate_run`` signature now takes ``store`` /
     ``vector_connector`` / ``embedder`` / ``llm`` (architect Q3
     ratify) and uses two new helpers:
     - ``_enumerate_curation_entities`` — paged ``list_entities``
       loop adapting each row to ``CurationEntity``.
     - ``_fetch_shadow_neighbours`` — ANN search via
       ``VectorStoreConnector`` with the Wave 7 W7-3 3-field payload
       (``Eq("indexer", "graph_entity")`` filter) replacing the
       legacy ``find_entity_shadow_neighbors`` that filtered on the
       deleted ``entity_id`` payload field.

5. **``aperag/graph_curation/integration.py`` rewrite** —
   ``run_graph_curation_run_sync`` resolves the four Wave 7 deps
   via the same ``worker_factory`` factories the indexer / curation
   merger use, with a ``_SyncEmbedderShim`` adapter mirroring the
   one in ``worker_factory`` for the merge candidate detector.

6. **``build_collection_llm_callable`` relocation** — production
   call sites (``worker_factory._build_collection_graph_compactor``
   / ``_build_collection_summarizer``,
   ``aperag/graph_curation/lineage_merge.py:build_lineage_entity_merger_for``,
   ``aperag/graph_curation/integration.py``) all import from the
   canonical home ``aperag/indexing/llm.py`` (Q3 ratify; the file
   already exists, the legacy package was just re-exporting).

7. **Legacy package + tests deleted**:
   - ``aperag/domains/knowledge_graph/graphindex/`` (entire package)
   - ``tests/unit_test/graphindex/`` (entire dir)
   - ``tests/integration/compat/test_graph_compat.py`` (replaced
     by ``test_lineage_graph_compat.py`` in W7-1)

8. **Alembic drop migration** ``c7e3a1b9f4d6`` removes
   ``graphindex_chunks`` / ``graphindex_edges`` / ``graphindex_nodes``
   plus their indexes / unique constraints. Hard-cut policy per
   spec §K.12.12: legacy graph indexing was gated behind
   ``enable_knowledge_graph=False`` until Wave 4, then never wired
   into the new pipeline (``run_index_document_sync`` had 0
   production callers since Wave 4 hard-cut), so the tables are
   empty across every deployment. Downgrade recreates empty schema.

9. **Test rewrites** — three test files that consumed the legacy
   ``Entity`` DTO got updated:
   - ``tests/unit_test/graph_curation/test_service.py`` /
     ``test_candidate_generation.py`` — switched to
     ``CurationEntity as Entity``.
   - ``tests/unit_test/service/test_search_graph_contract.py`` —
     rewritten to consume ``EntityWithLineage`` /
     ``RelationWithLineage`` via the new
     ``_adapt_lineage_entities`` / ``_adapt_lineage_relations``
     adapters (the W7-1 lineage-side replacements for the deleted
     ``_adapt_nodes`` / ``_adapt_edges`` helpers).
   - 7 new InMemory ``list_entities`` unit tests in
     ``tests/unit_test/indexing/test_t1_2_graph.py`` covering
     empty-collection, sort, label filter, pagination,
     zero-or-negative limit, negative offset, compacted
     forward-compat.

## §K.12 invariant cross-check

| # | Invariant | This PR |
|---|-----------|---------|
| 1 | L1 graph data not polluted | ✅ ``list_entities`` is read-only; storage view → adapter projection only |
| 2 | L1 → L2 single-direction derive | ✅ no derived writes |
| 3 | Compactor before vector embed | N/A — read path |
| 4 | Vector store via Adaptor | ✅ ``_fetch_shadow_neighbours`` uses ``VectorStoreConnector`` only |
| 5 | payload indexer filter | ✅ ``Eq("indexer","graph_entity")`` filter; no legacy ``entity_id`` payload reference |
| 6 | uuid5 vector point id | N/A — read path |
| 7 | snapshot-diff lineage name set | N/A — read path |
| 8 | alias_map persist orphan | ✅ unaffected; alias_map is W7-6 owned |
| 9 | upsert_entity alias redirect | ✅ unaffected; decorator pattern preserved (curation flow uses inner store directly per architect msg=cf860ae4) |
| 10 | DB column length application-cap | ✅ no schema CHECK constraints introduced |
| 11 | candidate detection write-only | ✅ ``MergeCandidateDetector`` unchanged; ``generate_run`` uses same write boundary |
| 12 | grep-zero LightRAG | ✅ `rg "from aperag.domains.knowledge_graph.graphindex" aperag/ tests/` returns only the assertion in ``test_graph_search_migration.py:55``. ``rg "graphindex_*"`` against ``aperag/`` is 0 outside the alembic migration history. The 8 Wave 6-era ``# -- LightRAG-style query layer`` comments + W7-4 line 249 fallback + W7-5 docstrings remain (architect msg=3fe200be — they are descriptive comments referencing design heritage, removable in Wave 8 cleanup if desired) |

## 4-pattern pre-check matrix (paste from PR thread reply)

* **P1 v1** — ``rg "from aperag.domains.knowledge_graph.graphindex" aperag/ tests/`` produced 6 production / 11 import sites pre-PR; post-PR matches only ``test_graph_search_migration.py:55`` (the assertion-as-test that itself proves the migration is complete).
* **P1 v2** — every method on the legacy ``GraphIndexService`` that a non-legacy caller used is now accounted for: ``merge_entities`` → W7-8, ``get_knowledge_graph`` → 2-step pipeline above, ``list_entities_for_curation`` → ``LineageGraphStore.list_entities`` + ``CurationEntity.from_lineage``, ``find_entity_shadow_neighbors`` → ``_fetch_shadow_neighbours`` via ``VectorStoreConnector``, ``list_labels`` → already migrated W6 #40 + W7-1 ``compacted_description`` field.
* **P2** — alembic ``c7e3a1b9f4d6`` drops the legacy tables, alembic env.py loses the legacy ``graphindex.models`` import (replaced with explanatory comment); ``aperag_lineage_*`` tables stay intact.
* **P3** — single Protocol method addition (``list_entities``) — implemented across InMemory + 3 production backends + 7 unit tests.

## simple-stable 4-guardrail

| Guardrail | Status |
|---|---|
| #1 不无限扩范围 | ✅ ``list_entities`` is base capability mirroring `delete_entity` / `query_entities_by_keyword`; no new endpoints, no new schema tables |
| #2 尽快上线 | ✅ single PR closes Wave 7; all 11 prior task PRs already merged |
| #3 简单稳定 | ✅ adapter pattern preserves production-validated `build_candidate_pairs`; ``list_entities`` follows existing pagination idiom |
| #4 私有化部署免维护 | ✅ alembic auto-drops legacy tables; no operator config; ``list_entities`` uses the same backend factories the rest of Wave 7 wires |

## Test plan

- [x] All 1142 unit tests pass (``uv run pytest tests/unit_test/``)
- [x] ``alembic upgrade head --sql`` generates the expected
      ``DROP INDEX`` / ``DROP TABLE`` cascade
- [x] ``alembic heads`` resolves to single head ``c7e3a1b9f4d6``
- [x] ``ruff format --check`` / ``ruff check`` clean on touched files
- [ ] CI compat-graph + e2e-http stages — both gated post-merge
- [ ] Pair with 冬柏's grep-zero helper PR #1763 — flip
      ``_TASK10_LEGACY_DELETED=True`` once both PRs merge

Closes Wave 7. Next: architect final review per spec §K.12.12.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

## Fold-in: grep-zero verification helper (formerly PR #1763)

Per architect ratify + 冬柏 authorization, PR #1763 is folded into
this commit instead of shipping as a separate PR. Contents:

* **``tests/integration/test_w7_grep_zero_legacy_graphindex.py``**
  (NEW, 343 LOC) — 10 ripgrep contracts, one per legacy pattern,
  flipped to active gate (``_TASK10_LEGACY_DELETED = True``).
  Patterns cover:
  1. ``from aperag.domains.knowledge_graph.graphindex`` imports
  2. ``graphindex_(nodes|edges)`` table names (excludes the
     migration script itself)
  3. bare ``import aperag.domains.knowledge_graph.graphindex``
  4. ``_sync_entity_relation_vectors`` (W7-3 superseded)
  5. ``_compact_oversized_descriptions`` (W7-2 superseded)
  6. ``_summarize_description`` (W7-2 superseded)
  7. ``_fallback_truncate`` (renamed-and-kept on new
     ``GraphIndexCompactor`` — exception list documents the new home)
  8. ``_delete_removed_shadow_vectors`` (W7-3 superseded)
  9. ``GraphSearchContract.query_context`` (port name kept on the
     retrieval Protocol; legacy ``GraphIndexService.query_context``
     historical-context comments allow-listed)
  10. ``GraphIndexService.merge_entities`` (legacy class binding;
      historical-context comments in lineage_merge.py +
      test_wave7_task8_wiring.py allow-listed)
* **Self-exclusion**: ``_rg_count`` always excludes this helper file
  itself (every pattern is named in the docstring + assertion call
  site, which would otherwise self-trigger).
* **``aperag/mcp/server.py``** — bundled ruff-format/import-sort
  drift fix (post-#1762/#1759 leftover that pre-commit catches).
  Kept here so the close-out PR lands cleanly through ``make lint``.

## Test plan (final)

- [x] 1141 unit tests pass (``uv run pytest tests/unit_test/``)
- [x] 10/10 grep-zero integration tests pass
      (``uv run pytest tests/integration/test_w7_grep_zero_legacy_graphindex.py``)
- [x] ``alembic upgrade head --sql`` generates the expected
      ``DROP INDEX`` / ``DROP TABLE`` cascade
- [x] ``ruff format --check`` / ``ruff check`` clean on touched files
- [ ] CI e2e-http-smoke + e2e-http-provider — gated post-merge

Closes Wave 7. Next: architect final review per spec §K.12.12.

Co-Authored-By: 冬柏 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
earayu added a commit that referenced this pull request Apr 27, 2026
…l (W7-10) (#1765)

Final close-out of Wave 7 §K.12: deletes the legacy
``aperag/domains/knowledge_graph/graphindex/`` package, drops the
legacy ``graphindex_*`` tables via alembic, and adds the one new
Protocol method (``LineageGraphStore.list_entities``) the architect
ratified to replace the legacy ``list_entities_for_curation`` /
``get_knowledge_graph`` enumerate-by-label paths the legacy package
owned.

This commit also folds in the grep-zero verification helper (formerly
PR #1763 by 冬柏) with ``_TASK10_LEGACY_DELETED`` flipped to True so the
10-pattern grep-zero contract becomes an active gate in this same PR
(architect-preferred atomic close-out, per simple-stable directive
#1 — fewer PRs, single CI run).

## Scope (per architect msg=28afe6ab + 4-question Q1-Q4 ratify msg=838d57c3 / msg=f3216dfc)

1. **NEW Protocol method ``LineageGraphStore.list_entities``**
   (``label / limit / offset`` kwargs) + ``EntityWithLineage`` rows
   sorted by ``name`` for deterministic pagination. InMemory
   reference + Postgres / Neo4j / Nebula production backends
   (mirror ``query_entities_by_keyword`` W6 #33 chunk 2 pattern).

2. **``aperag/domains/knowledge_graph/service.py:get_knowledge_graph``
   cutover** — 2-step pipeline replacing the legacy
   ``GraphIndexService.get_knowledge_graph``:

   1. ``store.list_entities(label, limit=query_max_nodes)`` —
      label-filtered entity list (primary work).
   2. ``GraphSearchService.get_subgraph(names, hops=max_depth)`` —
      optional edge expansion when ``max_depth > 0``.

   Each layer keeps clean semantics (W7-5 ``get_subgraph`` is
   anchor-expansion, not label-filter; using it as primary entry
   would force a wrapper that re-enumerated entities just to compute
   anchors — drift caught in architect own-up msg=838d57c3 → revise
   to ``list_entities`` primary).

3. **``CurationEntity`` adapter** (new
   ``aperag/graph_curation/dto.py``) replacing the legacy ``Entity``
   DTO. ``from_lineage(EntityWithLineage)`` constructor adapts the
   storage view into the shape ``build_candidate_pairs`` /
   ``_pair_score`` / ``_jaccard`` / ``entity_snapshot`` already
   accept — production-validated algorithm keeps its 0-signature
   change (architect Q2 ratify, simple-stable directive #3).

4. **``aperag/graph_curation/service.py`` cutover** —
   ``accept_suggestion`` and ``generate_run`` migrated off the
   legacy ``GraphIndexService`` bundle:

   - ``accept_suggestion`` delegates to
     ``LineageEntityMerger.merge_entities`` (W7-6, PR #1758) the same
     way the W7-8 ``GraphService.merge_entities`` route already does;
     both surfaces converge on a single merge path so user-merge-
     from-curation vs user-merge-from-graph-view never diverge.
   - ``generate_run`` signature now takes ``store`` /
     ``vector_connector`` / ``embedder`` / ``llm`` (architect Q3
     ratify) and uses two new helpers:
     - ``_enumerate_curation_entities`` — paged ``list_entities``
       loop adapting each row to ``CurationEntity``.
     - ``_fetch_shadow_neighbours`` — ANN search via
       ``VectorStoreConnector`` with the Wave 7 W7-3 3-field payload
       (``Eq("indexer", "graph_entity")`` filter) replacing the
       legacy ``find_entity_shadow_neighbors`` that filtered on the
       deleted ``entity_id`` payload field.

5. **``aperag/graph_curation/integration.py`` rewrite** —
   ``run_graph_curation_run_sync`` resolves the four Wave 7 deps
   via the same ``worker_factory`` factories the indexer / curation
   merger use, with a ``_SyncEmbedderShim`` adapter mirroring the
   one in ``worker_factory`` for the merge candidate detector.

6. **``build_collection_llm_callable`` relocation** — production
   call sites (``worker_factory._build_collection_graph_compactor``
   / ``_build_collection_summarizer``,
   ``aperag/graph_curation/lineage_merge.py:build_lineage_entity_merger_for``,
   ``aperag/graph_curation/integration.py``) all import from the
   canonical home ``aperag/indexing/llm.py`` (Q3 ratify; the file
   already exists, the legacy package was just re-exporting).

7. **Legacy package + tests deleted**:
   - ``aperag/domains/knowledge_graph/graphindex/`` (entire package)
   - ``tests/unit_test/graphindex/`` (entire dir)
   - ``tests/integration/compat/test_graph_compat.py`` (replaced
     by ``test_lineage_graph_compat.py`` in W7-1)

8. **Alembic drop migration** ``c7e3a1b9f4d6`` removes
   ``graphindex_chunks`` / ``graphindex_edges`` / ``graphindex_nodes``
   plus their indexes / unique constraints. Hard-cut policy per
   spec §K.12.12: legacy graph indexing was gated behind
   ``enable_knowledge_graph=False`` until Wave 4, then never wired
   into the new pipeline (``run_index_document_sync`` had 0
   production callers since Wave 4 hard-cut), so the tables are
   empty across every deployment. Downgrade recreates empty schema.

9. **Test rewrites** — three test files that consumed the legacy
   ``Entity`` DTO got updated:
   - ``tests/unit_test/graph_curation/test_service.py`` /
     ``test_candidate_generation.py`` — switched to
     ``CurationEntity as Entity``.
   - ``tests/unit_test/service/test_search_graph_contract.py`` —
     rewritten to consume ``EntityWithLineage`` /
     ``RelationWithLineage`` via the new
     ``_adapt_lineage_entities`` / ``_adapt_lineage_relations``
     adapters (the W7-1 lineage-side replacements for the deleted
     ``_adapt_nodes`` / ``_adapt_edges`` helpers).
   - 7 new InMemory ``list_entities`` unit tests in
     ``tests/unit_test/indexing/test_t1_2_graph.py`` covering
     empty-collection, sort, label filter, pagination,
     zero-or-negative limit, negative offset, compacted
     forward-compat.

## §K.12 invariant cross-check

| # | Invariant | This PR |
|---|-----------|---------|
| 1 | L1 graph data not polluted | ✅ ``list_entities`` is read-only; storage view → adapter projection only |
| 2 | L1 → L2 single-direction derive | ✅ no derived writes |
| 3 | Compactor before vector embed | N/A — read path |
| 4 | Vector store via Adaptor | ✅ ``_fetch_shadow_neighbours`` uses ``VectorStoreConnector`` only |
| 5 | payload indexer filter | ✅ ``Eq("indexer","graph_entity")`` filter; no legacy ``entity_id`` payload reference |
| 6 | uuid5 vector point id | N/A — read path |
| 7 | snapshot-diff lineage name set | N/A — read path |
| 8 | alias_map persist orphan | ✅ unaffected; alias_map is W7-6 owned |
| 9 | upsert_entity alias redirect | ✅ unaffected; decorator pattern preserved (curation flow uses inner store directly per architect msg=cf860ae4) |
| 10 | DB column length application-cap | ✅ no schema CHECK constraints introduced |
| 11 | candidate detection write-only | ✅ ``MergeCandidateDetector`` unchanged; ``generate_run`` uses same write boundary |
| 12 | grep-zero LightRAG | ✅ `rg "from aperag.domains.knowledge_graph.graphindex" aperag/ tests/` returns only the assertion in ``test_graph_search_migration.py:55``. ``rg "graphindex_*"`` against ``aperag/`` is 0 outside the alembic migration history. The 8 Wave 6-era ``# -- LightRAG-style query layer`` comments + W7-4 line 249 fallback + W7-5 docstrings remain (architect msg=3fe200be — they are descriptive comments referencing design heritage, removable in Wave 8 cleanup if desired) |

## 4-pattern pre-check matrix (paste from PR thread reply)

* **P1 v1** — ``rg "from aperag.domains.knowledge_graph.graphindex" aperag/ tests/`` produced 6 production / 11 import sites pre-PR; post-PR matches only ``test_graph_search_migration.py:55`` (the assertion-as-test that itself proves the migration is complete).
* **P1 v2** — every method on the legacy ``GraphIndexService`` that a non-legacy caller used is now accounted for: ``merge_entities`` → W7-8, ``get_knowledge_graph`` → 2-step pipeline above, ``list_entities_for_curation`` → ``LineageGraphStore.list_entities`` + ``CurationEntity.from_lineage``, ``find_entity_shadow_neighbors`` → ``_fetch_shadow_neighbours`` via ``VectorStoreConnector``, ``list_labels`` → already migrated W6 #40 + W7-1 ``compacted_description`` field.
* **P2** — alembic ``c7e3a1b9f4d6`` drops the legacy tables, alembic env.py loses the legacy ``graphindex.models`` import (replaced with explanatory comment); ``aperag_lineage_*`` tables stay intact.
* **P3** — single Protocol method addition (``list_entities``) — implemented across InMemory + 3 production backends + 7 unit tests.

## simple-stable 4-guardrail

| Guardrail | Status |
|---|---|
| #1 不无限扩范围 | ✅ ``list_entities`` is base capability mirroring `delete_entity` / `query_entities_by_keyword`; no new endpoints, no new schema tables |
| #2 尽快上线 | ✅ single PR closes Wave 7; all 11 prior task PRs already merged |
| #3 简单稳定 | ✅ adapter pattern preserves production-validated `build_candidate_pairs`; ``list_entities`` follows existing pagination idiom |
| #4 私有化部署免维护 | ✅ alembic auto-drops legacy tables; no operator config; ``list_entities`` uses the same backend factories the rest of Wave 7 wires |

## Test plan

- [x] All 1142 unit tests pass (``uv run pytest tests/unit_test/``)
- [x] ``alembic upgrade head --sql`` generates the expected
      ``DROP INDEX`` / ``DROP TABLE`` cascade
- [x] ``alembic heads`` resolves to single head ``c7e3a1b9f4d6``
- [x] ``ruff format --check`` / ``ruff check`` clean on touched files
- [ ] CI compat-graph + e2e-http stages — both gated post-merge
- [ ] Pair with 冬柏's grep-zero helper PR #1763 — flip
      ``_TASK10_LEGACY_DELETED=True`` once both PRs merge

Closes Wave 7. Next: architect final review per spec §K.12.12.



## Fold-in: grep-zero verification helper (formerly PR #1763)

Per architect ratify + 冬柏 authorization, PR #1763 is folded into
this commit instead of shipping as a separate PR. Contents:

* **``tests/integration/test_w7_grep_zero_legacy_graphindex.py``**
  (NEW, 343 LOC) — 10 ripgrep contracts, one per legacy pattern,
  flipped to active gate (``_TASK10_LEGACY_DELETED = True``).
  Patterns cover:
  1. ``from aperag.domains.knowledge_graph.graphindex`` imports
  2. ``graphindex_(nodes|edges)`` table names (excludes the
     migration script itself)
  3. bare ``import aperag.domains.knowledge_graph.graphindex``
  4. ``_sync_entity_relation_vectors`` (W7-3 superseded)
  5. ``_compact_oversized_descriptions`` (W7-2 superseded)
  6. ``_summarize_description`` (W7-2 superseded)
  7. ``_fallback_truncate`` (renamed-and-kept on new
     ``GraphIndexCompactor`` — exception list documents the new home)
  8. ``_delete_removed_shadow_vectors`` (W7-3 superseded)
  9. ``GraphSearchContract.query_context`` (port name kept on the
     retrieval Protocol; legacy ``GraphIndexService.query_context``
     historical-context comments allow-listed)
  10. ``GraphIndexService.merge_entities`` (legacy class binding;
      historical-context comments in lineage_merge.py +
      test_wave7_task8_wiring.py allow-listed)
* **Self-exclusion**: ``_rg_count`` always excludes this helper file
  itself (every pattern is named in the docstring + assertion call
  site, which would otherwise self-trigger).
* **``aperag/mcp/server.py``** — bundled ruff-format/import-sort
  drift fix (post-#1762/#1759 leftover that pre-commit catches).
  Kept here so the close-out PR lands cleanly through ``make lint``.

## Test plan (final)

- [x] 1141 unit tests pass (``uv run pytest tests/unit_test/``)
- [x] 10/10 grep-zero integration tests pass
      (``uv run pytest tests/integration/test_w7_grep_zero_legacy_graphindex.py``)
- [x] ``alembic upgrade head --sql`` generates the expected
      ``DROP INDEX`` / ``DROP TABLE`` cascade
- [x] ``ruff format --check`` / ``ruff check`` clean on touched files
- [ ] CI e2e-http-smoke + e2e-http-provider — gated post-merge

Closes Wave 7. Next: architect final review per spec §K.12.12.

Co-authored-by: 冬柏 <noreply@anthropic.com>
earayu added a commit that referenced this pull request Apr 28, 2026
…nds + curation cutover (W8-2) (#1771)

Wave 8 task #13 (W8-2): adds a bulk-upsert primitive on the
``LineageGraphStore`` Protocol so callers consolidating N×M
``(description_part, lineage_member)`` tuples to the same target
entity get one transaction / one round-trip instead of N×M
sequential single upserts. Cuts ``LineageEntityMerger.merge_entities``
step 6a over to the bulk path.

## Why

``GraphCurationService.merge_entities`` (the N-source-into-1-target
user merge flow) re-anchors every source's description parts under
the target name. With N source entities each carrying M description
parts, that loop emitted N×M sequential ``upsert_entity_with_lineage``
round-trips — one full transaction per part. For typical curation
runs (3-10 sources × 5-20 parts each), that's 15-200 sequential SQL
hits where one bulk write would do.

Per architect spec (Wave 8 candidate W8-2 sediment) + huangheng's
task #6 PR #1758 perf observation, this folds into a single bulk
write per backend.

## Scope (5 numbered items)

1. **Protocol method** — ``bulk_upsert_entity_with_lineage_parts(*,
   parts: Sequence[tuple[EntityRecord, LineageMember]])`` on
   ``aperag/indexing/graph.py:LineageGraphStore``. All ``record.name``
   values MUST share the same string (asserted as ``ValueError``).
   Empty parts is a no-op. Per-part dedup key is
   ``(document_id, parse_version)`` last-wins. Bulk path NEVER
   touches ``compacted_description`` (preserves existing — same
   ``COALESCE``-style semantic as single upsert with ``None``).
2. **InMemory ref impl** — single ``asyncio.Lock``-guarded loop in
   ``InMemoryLineageGraphStore``.
3. **Postgres impl** — single ``INSERT … ON CONFLICT (collection_id,
   name) DO UPDATE`` that strips matching keys via
   ``jsonb_array_elements`` ``NOT EXISTS`` against an incoming
   ``strip_keys_json`` array, then appends the whole new
   ``new_members_json`` / ``new_parts_json`` arrays. One statement,
   atomic.
4. **Neo4j impl** — single Cypher MERGE + parallel-list strip-then-
   append, with the strip predicate matching against the **set** of
   incoming keys (``IN $strip_keys`` on the ``"<doc_id>|<pv>"`` key
   string). Row-lock on MERGE serialises concurrent bulk ops on the
   same entity.
5. **Nebula impl** — single ``EntityLock(target_name)`` acquire +
   single read / Python merge / write. Mirrors the existing
   read-modify-write pattern of single upsert but folds the strip-
   then-append over the **set** of incoming keys.

## Caller cutover

* ``LineageEntityMerger.merge_entities`` (step 6a) — replaces the
  N×M ``for src in source_entities: for part in
  src.description_parts: await self._store.upsert_entity_with_lineage(...)``
  with a single ``bulk_upsert_entity_with_lineage_parts`` call.
  Step 6b's sentinel write (``__curation_merge__`` final write with
  unified+compacted text) still goes through the single-upsert path
  because it needs the ``compacted_description`` column write that
  the bulk path intentionally doesn't carry.

## Alias-redirect decorator

* ``LineageGraphStoreWithAliasRedirect`` — bulk path mirrors single
  upsert: each part's ``record.name`` resolves through the alias
  map before forwarding to the inner store. The merger always
  passes records pinned to the canonical ``final_target`` so the
  redirect is a no-op in that flow, but symmetry with the single
  upsert contract means a future caller writing to an aliased name
  still gets correct behaviour.

## 12-invariant cross-check (Wave 7 §K.12)

* **#1 L1 不污染**: bulk path operates on lineage SETs only — kg.jsonl
  raw extract layer untouched. ✅
* **#3 indexer write redirect through alias map**: bulk path goes
  through the same decorator alias resolution as single upsert. ✅
* **#9 alias redirect transparent**: decorator forwards bulk to
  inner with redirected names; merger callers see no behaviour
  change. ✅
* **#10 DB column cap**: Postgres ``compacted_description`` Text
  column unchanged (bulk path preserves existing value via
  COALESCE-equivalent — bulk passes ``None`` end-to-end through
  the Postgres SQL → INSERT branch sets NULL, ON CONFLICT branch
  preserves). ✅
* All other invariants: unaffected.

## 4-pattern pre-check matrix

* Pattern A (kg.jsonl shape): unchanged
* Pattern B (Lineage SET semantics): bulk preserves dedup-by-
  ``(document_id, parse_version)`` exactly — same key as single
  upsert. ✅
* Pattern C (Cypher LIST<MAP>): bulk reuses parallel-list encoding
  for the strip-then-append. ✅
* Pattern D (vector 3-field payload): unchanged

## Simple-stable 4-guardrail

* **#1 不无限扩范围**: 1 new Protocol method, no new public REST
  surface, no new schema column, no new alembic migration.
* **#2 尽快上线**: single PR, no spec amend needed (architect
  W8-2 candidate sediment + Wave 7 task #1 3-backend Protocol
  pattern reuse).
* **#3 简单稳定**: bulk semantic mirrors single upsert exactly;
  caller cutover is a 1-call replacement of the N×M loop.
* **#4 私有化部署免维护**: backend-portable (4 backends shipped
  cross-backend); no operator-facing config knob.

## Test plan

- [x] InMemory contract — 7 new tests pin empty / create / replace /
      mismatched-names / dedup-within-input / preserves-compacted /
      last-entity-type-wins.
- [x] Alias-redirect decorator — 3 new tests pin redirect-each-part /
      no-alias-passthrough / empty-short-circuit.
- [x] LineageEntityMerger step 6a cutover — existing
      ``test_source_parts_reanchored_preserving_doc_lineage``
      rewritten to assert the new single bulk call (was: 3
      sequential single upserts) + single sentinel final write.
- [x] All 1166 unit tests pass (up from 1141 baseline; 25 new tests).
- [x] Wave 7 grep-zero contract — 10/10 still pass (intent-driven
      gate unaffected).
- [x] ruff format / check clean on touched files.
- [ ] Cross-backend integration test — sediment to Wave 8 follow-up
      (out of scope this PR; the 4-backend Protocol contract is
      structurally identical to single upsert which already has
      cross-backend integration coverage).
- [ ] CR by @huangheng (focus: invariant #1 L1, #9 alias redirect
      transparent, 4-backend cross-roundtrip on contract level).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant