Skip to content

feat(graph Wave 7 W7-3): GraphModalityWorker.sync() Phase 3#1757

Merged
earayu merged 1 commit into
mainfrom
bryce/wave7-task3-sync-extension
Apr 27, 2026
Merged

feat(graph Wave 7 W7-3): GraphModalityWorker.sync() Phase 3#1757
earayu merged 1 commit into
mainfrom
bryce/wave7-task3-sync-extension

Conversation

@earayu
Copy link
Copy Markdown
Collaborator

@earayu earayu commented Apr 27, 2026

Summary

Wave 7 task #3 — extends GraphModalityWorker.sync() with the four-step Phase 3 that turns the lineage-store rebuild (Phase 1+2, Wave 4-6) into a complete LightRAG-style graph index. Activates entity / relation vector recall in production for the first time since Wave 4 (the legacy _sync_entity_relation_vectors write path was 0-callers since hard-cut, leaving the Qdrant entity index empty).

Phase 3 step ordering (per spec §K.12.3 + huangheng msg=16a38734)

  1. Compactor (W7-2 dep): for each entity / relation just touched, run GraphIndexCompactor.compact_if_oversized over description_parts text and write the unified summary back via upsert_*_with_lineage(..., compacted_description=...). None → COALESCE preserves the existing column.
  2. Embed + vector upsert: hash compacted (or name + raw parts fallback) → await asyncio.to_thread(connector.upsert, [point]). Point id = uuid5(NAMESPACE_DNS, f"graph_entity:{collection_id}:{name}") → deterministic overwrite. Payload = 3 fields locked at spec §K.12.5 ratify msg=acbd0003 / msg=d3f4e6f8: {indexer, entity_name, entity_type} (no collection_id payload — it lives in the uuid5 id while the connector handles per-tenant guard).
  3. Snapshot-diff delete: pre-sync entity-name set minus post-sync set → ids to delete. Source: lineage store, never an ANN list-all (invariant feat: chat animate #7).
  4. MergeCandidateDetector (W7-4 dep, optional): pass affected entity names to detect_for_sync. D-3 lock — detector only writes PENDING suggestions, never auto-merges.

Optional dependencies + degradation

The Phase 3 deps (compactor, embedder, vector_connector, merge_detector) are all optional. Without embedder + vector_connector Phase 3 short-circuits → Wave 6 lineage-only behaviour preserved (every existing test still passes). Failures inside any individual step are logged and swallowed — the lineage critical path always survives a partial Phase 3.

worker_factory wiring

  • compactor via the shared build_collection_llm_callable (same resolver the graph extractor uses).
  • vector_connector + embedder via the existing _build_collection_qdrant_connector helper — graph entity / relation vectors share the per-collection Qdrant collection that holds chunk vectors, distinguished by the indexer payload key.
  • merge_detector constructed when both connector + embedder resolved, with a thin sync-embedder shim adapting the graph worker's (text -> list[float]) callable into the embed_query shape the detector expects.

When any production dep fails to construct (no completion model, broken embedder config), the factory logs a warning and the worker falls back to the no-op default — mirrors the summary worker pattern.

§K.12 invariant cross-check (per architect msg=fcf580a6 directive)

# Invariant This PR
1 L1 graph data not polluted ✅ kg.jsonl untouched; Phase 3 only reads from / writes derived columns
2 L1 → L2 single-direction derivation (compacted is derived) ✅ Compactor reads description_parts, writes compacted_description and vector; never the reverse
3 Compactor at sync end, before vector embed ✅ Step ordering Phase 3 = compactor → embed; test_w7_phase3_compactor_runs_before_embed pins this with a "COMPACTED"-stub + capturing embedder
4 Vector store via VectorStoreConnector(Adaptor) abstraction ✅ no Qdrant client import; type hint is VectorStoreConnector Protocol; sync connector.upsert / delete wrapped in await asyncio.to_thread(...) per architect msg=acbd0003
5 indexer="graph_entity"/"graph_relation" payload filter test_w7_phase3_writes_vector_point_with_3_field_payload asserts payload keys exactly + no collection_id leakage
6 uuid5 deterministic vector point id test_w7_phase3_uuid5_id_is_deterministic re-syncs same content + asserts id stable; format graph_entity:<cid>:<name> / graph_relation:<cid>:<src>-><tgt>:<type>
7 snapshot-diff via lineage entity name set _capture_entity_names builds post-sync set from kg_entity_records ∪ store-survived-pre-sync; never calls vector_connector.scroll/list. test_w7_phase3_snapshot_diff_preserves_cross_doc_entity proves shared entities don't get deleted
8 alias_map orphan-persists across canonical gc N/A (W7-6)
9 upsert_entity_with_lineage transparent alias redirect N/A (W7-6)
10 DB column length is application-layer cap, not schema CHECK N/A (W7-1 / W7-2)
11 Candidate detection writes-only, no auto-merge (D-3) ✅ Step D calls merge_detector.detect_for_sync which only writes PENDING GraphCurationSuggestion rows; test_w7_phase3_merge_detector_invoked_with_affected_names pins the call shape
12 grep-zero LightRAG naming ✅ no LightRAG strings introduced

4-pattern pre-check matrix

  • P1 v1 (caller import count): GraphModalityWorker.sync callers stay in aperag/indexing/ orchestrator wiring only; new optional kwargs default to None so external call shape is backward-compat — no caller updates needed in this PR.
  • P1 v2 (caller method coverage): sync() still returns None; nothing observable beyond Phase 3 side effects on the vector store + suggestion table. Existing 1083 unit tests pass unchanged.
  • P2 (state binding): dependencies all already on main —
    • aperag/indexing/graph_compactor.py (Wave 7 W7-2, merged c1c48429)
    • aperag/indexing/merge_candidate_detector.py (W7-4, merged 0dbf9fd1)
    • aperag/vectorstore/connector.py VectorStoreConnector(Adaptor) (Wave 4)
  • P3 (Protocol method state): no new LineageGraphStore Protocol methods needed. Phase 3 uses existing get_entity / get_relation for the read-modify-write loop and writes via the compacted_description kwarg shipped in W7-1.

simple-stable directive 4 guardrails

Guardrail Status
#1 don't expand scope ✅ no new Protocol methods; deps optional; ~330-line worker delta + ~470-line test delta; worker_factory wiring follows existing helper patterns
#2 ship fast ✅ behind merged W7-1 / W7-2 / W7-4; production wiring + unit test suite in one PR
#3 simple > complex ✅ failures swallowed (lineage critical path inviolate); deterministic uuid5 ids; same connector + embedder helpers reused across vector / summary / vision workers
#4 private-deploy maintenance-free ✅ all knobs via existing collection LLM / embedder resolvers; no new operator config; alembic-free (vector schema lives in Qdrant collection from chunk vectors)

Test plan

  • 12 new InMemory unit tests in tests/unit_test/indexing/test_t1_2_graph.py covering:
    • Phase 3 skipped when vector deps unwired (Wave 6 backward-compat)
    • 3-field payload exact (no collection_id leak)
    • uuid5 id deterministic across resyncs
    • Compactor runs before embed (ordering invariant)
    • Compactor None falls back to name + raw parts
    • Snapshot-diff deletes gc'd vector points
    • Cross-doc shared entity NOT deleted (Linus mentioned in doc_A + doc_B; doc_A re-parsed without him; vector survives)
    • Relation Phase 3 mirrors entity (3-field payload, indexer="graph_relation", uuid5 id format)
    • MergeCandidateDetector.detect_for_sync invoked with correct affected names
    • Detector failure non-fatal (lineage + vector upsert intact)
    • Compactor failure non-fatal (embedder gets fallback text)
    • Vector upsert failure non-fatal (lineage row intact)
  • All 1117 unit tests pass (1083 + 34 new across W7-1 + W7-3): uv run pytest tests/unit_test/
  • ruff format --check + ruff check clean on touched files
  • CI compat-graph stage (pure unit testing only — Phase 3 integration with real Qdrant lives in W7-5 / W7-8 verification, intentionally not in this PR scope)

Spec / decision references

🤖 Generated with Claude Code

Adds Phase 3 to ``GraphModalityWorker.sync()``: the four steps that
turn the lineage-store rebuild (Phase 1+2, Wave 4-6) into a complete
LightRAG-style graph index — compact, embed, vector upsert,
snapshot-diff delete, and merge-candidate detect.

Step ordering (per spec §K.12.3 + huangheng msg=16a38734 invariant
list):

1. **Compactor** (W7-2): for each entity / relation just touched by
   this sync, run ``GraphIndexCompactor.compact_if_oversized`` over the
   per-doc ``description_parts`` and write the unified summary back via
   ``upsert_*_with_lineage(..., compacted_description=...)``. Returning
   ``None`` (below threshold or compactor unwired) leaves the
   COALESCE-preserved column alone.
2. **Embed + vector upsert**: hash the compacted summary (or the
   ``name`` + raw parts fallback when the compactor opted out) and
   ``await asyncio.to_thread(connector.upsert, [VectorPoint(...)])``
   the result. Vector point id is
   ``uuid5(NAMESPACE_DNS, f"graph_entity:{collection_id}:{name}")`` —
   deterministic so a re-sync overwrites instead of leaking. Payload is
   the 3-field shape locked at spec §K.12.5 ratify msg=acbd0003 /
   msg=d3f4e6f8: ``{indexer, entity_name, entity_type}``. No
   ``collection_id`` payload — that lives in the uuid5 id (cross-
   collection uniqueness in a shared backing store) while the connector
   handles per-tenant guard.
3. **Snapshot-diff delete**: pre-sync entity-name set
   (``find_entity_ids_with_lineage`` from Phase 1) minus post-sync set
   (kg.jsonl entities ∪ pre-sync entities still alive after gc) → ids
   to delete. Computed against the lineage store, never an ANN
   ``list_all`` (invariant #7).
4. **MergeCandidateDetector** (W7-4, optional): pass the affected
   entity names to ``detect_for_sync`` so PENDING auto-detect
   suggestions get persisted for the curator UI. D-3 lock — detector
   never auto-merges.

The Phase 3 dependencies (``compactor``, ``embedder``,
``vector_connector``, ``merge_detector``) are all optional kwargs.
Wave 6 callers that don't wire them get the lineage-only behaviour
unchanged — Phase 3 returns early when the vector connector or
embedder is unset. Failures inside any step (compactor LLM flake,
embedder error, vector backend hiccup, detector raise) are logged and
swallowed so the lineage critical path always survives a partial
Phase 3.

``worker_factory._build_graph_worker`` wires production deps:
* ``compactor`` via the shared ``build_collection_llm_callable`` (the
  same resolver the graph extractor uses).
* ``vector_connector`` + ``embedder`` via the existing
  ``_build_collection_qdrant_connector`` helper — the graph entity /
  relation vectors go into the same Qdrant collection as chunk
  vectors, distinguished by the ``indexer`` payload key.
* ``merge_detector`` constructed once the connector + embedder
  resolved, with a thin shim adapting the sync ``(text -> list[float])``
  callable used by the graph worker into the ``embed_query`` shape the
  detector expects.

When any of these fail to construct (no completion model,
collection's embedder unresolved, etc.) the factory logs a warning and
falls back to the no-op default, mirroring the summary worker pattern.

Tests:
* 12 new InMemory unit tests in
  ``tests/unit_test/indexing/test_t1_2_graph.py`` — cover the four
  invariants (Phase 3 skipped without deps; 3-field payload exact;
  uuid5 id deterministic across resyncs; compactor before embed;
  fallback on compactor None; snapshot-diff delete on doc cascade;
  cross-doc shared entity not deleted; relation Phase 3 mirrors
  entity; detector receives correct names; detector failure
  non-fatal; compactor failure non-fatal; vector upsert failure
  non-fatal). All 1117 unit tests pass.

§K.12 invariant cross-check: this PR materially honours #2 (L1 → L2
single-direction derivation), #3 (Compactor runs before vector
embed), #4 (vector store via
``VectorStoreConnector(Adaptor)`` abstraction — no Qdrant client
import), #5 (3-field payload + ``indexer`` filter), #6 (uuid5
deterministic id), #7 (snapshot-diff via lineage name set, not ANN
list-all), #11 (D-3 detector writes-only, never auto-merges), #12
(no LightRAG strings introduced).

4-pattern pre-check matrix:
* P1 v1: ``GraphModalityWorker.sync`` callers — orchestrator wiring
  in ``aperag/indexing/`` only; the new optional kwargs default to
  ``None`` so call shape is backward-compat.
* P1 v2: caller return-shape expectations — sync still returns
  ``None``; nothing observable beyond Phase 3 side effects on the
  vector store + suggestion table.
* P2 (state binding): ``GraphIndexCompactor`` is W7-2 (merged
  ``c1c48429``), ``MergeCandidateDetector`` is W7-4 (merged
  ``0dbf9fd1``), ``VectorStoreConnector`` is the existing Wave 4
  abstraction. All three already in main.
* P3 (Protocol method state): no new ``LineageGraphStore`` Protocol
  methods needed — Phase 3 reads via existing ``get_entity`` /
  ``get_relation`` and writes via the ``compacted_description`` kwarg
  shipped in W7-1.

Closes Wave 7 task #3.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

@earayu earayu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 LGTM ✅ (huangheng pass-1, per spec §K.12.11) — GitHub 不允许同账号 approve;verdict = ready to merge

Phase 3 四步顺序严格 + 3-field payload + uuid5 deterministic id + snapshot-diff via name set + Wave 6 backward-compat (Phase 3 deps optional 全 None 时 short-circuit) + failures swallow lineage critical path inviolate — 全 align spec §K.12.3 + my msg=16a38734 7-invariant lock + architect msg=acbd0003 lock。hard-gate format 第六个达标 PR (mirror PR #1751/#1752/#1754/#1755/#1756 standard)。

12-invariant cross-check(task #3 scope)

# Invariant This PR 验证依据
1 L1 graph data 不污染 (kg.jsonl raw vs storage view 严格分层) kg.jsonl untouched;Phase 3 只 read description_parts + write derived compacted_description & vector points
2 L1 → L2 单向派生 Compactor reads description_parts → writes compacted_description(L1 内部派生)→ embed → vector point(L2 from L1);never reverse
3 Compactor 在 sync 末尾, vector embed 之前 Phase 3 顺序 line 1428-1463:compactor (Step A) → embed/upsert (Step B lines 1434-1437) → snapshot-diff (Step C lines 1445-1461) → detector (Step D);test_w7_phase3_compactor_runs_before_embed
4 Vector store via VectorStoreConnectorAdaptor type hint 用 abstract VectorStoreConnector Protocol;asyncio.to_thread(connector.upsert/delete, ...) per architect msg=acbd0003 sync wrap pattern
5 payload indexer="graph_entity" filter / 3 field ✅ for entity line 1537-1541 entity payload exact 3 字段 {indexer, entity_name, entity_type} 无 collection_id leak;test_w7_phase3_writes_vector_point_with_3_field_payload⚠️ relation 见 minor below
6 uuid5 deterministic vector point id line 1611-1620: entity f"graph_entity:{cid}:{name}" / relation f"graph_relation:{cid}:{src}->{tgt}:{type}"test_w7_phase3_uuid5_id_is_deterministic 钉 re-sync 同 content 同 id
7 snapshot-diff via lineage entity name set _capture_entity_nameskg_entity_records ∪ store-survived-pre-sync;line 1449 pre_sync - post_sync = gc_entity_names;不 call vector_connector.scrolltest_w7_phase3_snapshot_diff_preserves_cross_doc_entity 钉 cross-doc shared 不被 gc
8 alias_map orphan-persists n/a task #6 范畴
9 upsert 透明 alias redirect n/a task #6 范畴
10 DB 列长度 application-layer cap n/a W7-1/W7-2 范畴;task #3 不动 schema
11 候选检测仅写不自动合并 (D-3) Step D 调 merge_detector.detect_for_sync(...) (PR #1755 已 verified read-only);test_w7_phase3_merge_detector_invoked_with_affected_names 钉 call shape
12 命名 grep-zero LightRAG gh pr diff 1757 | grep -E '^\+' | grep -i lightrag → 0 added lines

Phase 3 step ordering 严格性验证

# graph.py line 1428-1463
# Step A — Compactor
for entity in affected_entities_for_phase3:
    await self._compact_and_persist_entity(entity)  # ← 1st
# Step B — Embed + vector upsert
for entity in affected_entities_for_phase3:
    await self._upsert_entity_vector_point(entity)  # ← 2nd
for relation in affected_relations_for_phase3:
    await self._upsert_relation_vector_point(relation)
# Step C — Snapshot-diff delete
gc_entity_names = pre_sync_entity_names - post_sync_entity_names
ids = [self._entity_vector_id(name) for name in gc_entity_names]
await asyncio.to_thread(self._vector_connector.delete, ids)  # ← 3rd
# Step D — Detector
await self._merge_detector.detect_for_sync(...)  # ← 4th

完美 mirror spec §K.12.3 + invariant #3。✅

Optional deps + degradation 验证

  • Wave 6 现有 sync (无 Phase 3 deps) → if not (self._embedder and self._vector_connector): return short-circuit → 1083 现有 unit test 不破坏 ✅
  • Per-step failure swallowed: line 1525-1531 / 1545-1550 / 1558-1566 / 1578-1587 全 try/except → lineage critical path inviolate ✅
  • test_w7_phase3_skipped_when_deps_unwired + test_w7_phase3_compactor_failure_non_fatal + test_w7_phase3_vector_upsert_failure_non_fatal + test_w7_phase3_detector_failure_non_fatal 全 cover

4-pattern pre-check matrix(PR body paste)

P1 v1 callers + P1 v2 caller method + P2 dependency state + P3 Protocol method state 全 paste;Phase 3 deps 全 reference 已 merged commit (W7-1 c1499777 / W7-2 c1c48429 / W7-4 0dbf9fd1)。

simple-stable directive 4-guardrail

PR description 4 项全显式:no new Protocol method / smallest surface / production helpers reuse / alembic-free。

测试质量

12 new tests + 1117 total pass:

  • Phase 3 skip when unwired (Wave 6 backward-compat)
  • 3-field payload exact (no collection_id leak)
  • uuid5 id deterministic across resyncs
  • Compactor → embed ordering invariant 钉
  • Compactor None falls back to name + raw parts
  • Snapshot-diff deletes gc'd vector
  • Cross-doc shared entity NOT deleted(Linus Torvalds 在 doc_A + doc_B;doc_A re-parse 不再提 → 不 gc → vector 保留)
  • Relation Phase 3 mirror entity (3-field payload, indexer="graph_relation", uuid5 format)
  • Detector invoked with affected names
  • Detector / Compactor / Vector failure non-fatal × 3

完备。

⚠️ 2 个 minor architecture observation(非阻塞,sediment 给 future task)

Observation 1: Relation payload 字段命名 overload

line 1572-1576 relation vector point payload:

payload={
    "indexer": "graph_relation",
    "entity_name": f"{relation.source}->{relation.target}",  # ← composite 复用 entity_name 字段
    "entity_type": relation.relation_type,
}

indexer="graph_relation" 区分类型,但 payload 字段 entity_name 在 relation 上下文是 "source->target" composite,不是单 entity name。读者解析时需要 indexer-aware 处理。

当前 OK:MergeCandidateDetector + GraphSearchService 都靠 Eq("indexer", "graph_entity") 严格 filter,不会把 relation hit 当 entity 处理。

建议 future cleanup(task #10 close-out 列上):relation payload 用 distinct fields,e.g. {indexer, source, target, relation_type},避免 entity_name 字段语义双关。

Observation 2: Relation vectors 当前 unused by task #5 search_relations

task #3 PR 写 graph_relation vector points (per spec §K.12.3 "embed entity/relation");但 task #5 PR #1756 已 merged 实施 search_relations 走 1-hop expansion (不 query graph_relation vectors)。

结果:relation vectors 写入 store 但 search_relations 不 read,当前是 forward-compat 储备,未被消费。

spec 角度 check:spec §K.12.3 GraphSearchService 描述 "Qdrant 召回 entity & relation(payload filter indexer=...)" — 即 spec 期望 search_relations 走 vector recall。task #5 实施成 1-hop expansion 是 conservative subset(PR #1756 description 显式说 "vector store carries no per-relation vectors in Wave 7")。

所以:task #3 写 relation vectors 是 spec-correct;task #5 search_relations 当前 incomplete 该 follow-up 升级。

建议 follow-up(不阻塞本 PR):task #5-bis 或 task #7/#8 wiring 阶段升级 search_relations 走 graph_relation vector recall (Eq("indexer", "graph_relation")) → 充分利用 task #3 写的 relation vectors。如不升级,relation vector 写入是 storage waste;升级后实现 spec 预期的 LightRAG-style 完整召回。

可以是 task #10 close-out cleanup list 之一,或 Wave 8 优化候选。

修完会 LGTM 的清单

实际上已经可以 merge ✅。两个 minor observation 都是 sediment 候选,本 PR 内部对 spec 完整 align。

@bryce 工作非常 solid — Phase 3 4 步顺序严格 + cross-doc preserve test 是细节意识高分点(Linus 在 doc_A + doc_B,doc_A re-parse 不删 vector — 这种 multi-doc shared entity gc-correctness pin 是产品稳定性硬保证)。👍

@符炫炜 LGTM,可 merge。
@不穷 推进 task #3 → done after merge;task #6 / #7 / #8 / #9 / #10 critical path 余下。

@bryce task #10 close-out cleanup list 增量:

  • (上轮 task #1) Wave 6 era 8 处 # -- LightRAG-style query layer 注释
  • (上轮 task #4) line 249 legacy name fallback
  • (上轮 task #5) 6 处 docstring "LightRAG-style" descriptive
  • (本轮 task #3) Observation 1: relation payload 字段重新设计 (entity_name/entity_type → source/target/relation_type);Observation 2: search_relations 升级走 graph_relation vector recall(spec §K.12.3 完整对齐)

Wave 7 进度:6/10 task PR + 2 spec PR = 8 PR merged 后即可 close-out(剩 #6 / #7 / #8 / #9 / #10 = 5 PR)。

@earayu earayu merged commit 37bdb73 into main Apr 27, 2026
10 checks passed
@earayu earayu deleted the bryce/wave7-task3-sync-extension branch April 27, 2026 18:28
earayu added a commit that referenced this pull request Apr 27, 2026
…t + LineageEntityMerger (#1758)

§K.12.6 / §K.12.7 / §K.12.10b task #6 — full storage + service body for
user-driven entity merge over the lineage graph. Per architect ratify
msg=cf860ae4 + huangheng endorse msg=22816e0d (5-drift lock) +
sentinel pick msg=22816e0d (`__curation_merge__`).

What ships
----------

1. **`aperag_lineage_entity_alias` table** (alembic
   ``b5d2e8f1c9a4``) — composite PK ``(collection_id, alias_name)``,
   ``canonical_name`` index for reverse lookup.
2. **`LineageEntityAlias` ORM** (`aperag/domains/knowledge_graph/db/
   models.py`) — re-uses the curation domain's existing ``Base``.
3. **`AliasMapRepository`** (`aperag/graph_curation/alias_map.py`):
   - `resolve_canonical(collection_id, name)` — single-indirection
     read (transitive flatten keeps the table 1-deep at write time).
   - `upsert_alias(...)` — cycle reject (`AliasCycleError`) +
     transitive flatten (UPDATE + INSERT in one transaction).
   - `list_aliases_pointing_at(...)` for tests / admin tooling.
   - `purge_collection(...)` for collection teardown.
4. **`LineageGraphStoreWithAliasRedirect` decorator**
   (`aperag/indexing/alias_redirect_store.py`) — wraps any
   ``LineageGraphStore`` + ``AliasMapRepository``, intercepts
   ``upsert_entity_with_lineage`` / ``upsert_relation_with_lineage``
   to rewrite entity names through the alias map, forwards every
   other Protocol method byte-for-byte. Per huangheng CR lock
   (Option (b), msg=93d9add1 / msg=22816e0d): zero changes to the
   three backend store implementations.
5. **`LineageEntityMerger`** (`aperag/graph_curation/lineage_merge.py`)
   — orchestrator for user-driven merge. Step ordering locked
   (invariant #2): alias upserts → L1 source-parts re-anchored
   preserving doc lineage → L1 final unified+compacted with
   ``__curation_merge__`` sentinel → vector upsert (3-field payload,
   ``uuid5`` deterministic id) → source delete (L1 + vector) last.
6. **24 unit tests**:
   - `test_alias_map.py` (10): cycle reject self-loop + chain cycle,
     transitive flatten, target flatten through chain, alias persists
     after canonical GC, per-collection isolation, purge.
   - `test_alias_redirect_store.py` (5): indexer write redirects to
     canonical, no-alias passthrough, both-endpoint relation
     redirect, single-endpoint relation redirect, decorator
     passthrough invariant for all 13 non-upsert Protocol methods
     (huangheng CR lock).
   - `test_lineage_merge.py` (9): empty source short-circuit, step
     order L1 → vector → delete, sentinel ``__curation_merge__``,
     Compactor kwargs locked (subject_kind/subject_label/language),
     source parts re-anchored preserving per-doc lineage, vector
     payload 3-field + uuid5 deterministic, alias cycle propagation,
     target alias chain flatten, target GC tolerance.

§K.12 invariant cross-check
---------------------------

- #1 L1 not polluted: source parts re-anchored under target preserve
  original `(document_id, parse_version, chunk_ids)` lineage —
  pinned by `test_source_parts_reanchored_preserving_doc_lineage`.
- #2 L1 → L2 derivation: step ordering pinned by
  `test_step_order_is_l1_then_vector_then_delete` (L1 writes precede
  vector writes; deletes last).
- #3 transparent alias redirect: pinned by
  `test_indexer_upsert_after_merge_redirects_to_canonical`.
- #4 vector store via abstraction: ``VectorStoreConnector.upsert/delete``
  via ``asyncio.to_thread``.
- #5 3-field payload: pinned by
  `test_vector_payload_is_3_field_with_deterministic_uuid5`.
- #6 uuid5 deterministic point id: same test pins
  ``uuid5(NAMESPACE_DNS, "graph_entity:{cid}:{name}")``.
- #7 alias_map orphan persist: pinned by
  `test_alias_persists_after_canonical_gc`.
- #9 cycle flatten + reject: pinned by 3 tests
  (`test_transitive_flatten_rewrites_existing_alias_rows`,
  `test_cycle_reject_self_loop`,
  `test_cycle_reject_through_existing_chain`).
- #11 D-3 merge user-driven only: `LineageEntityMerger.merge_entities`
  is the only entry point; `MergeCandidateDetector` (PR #1755) does
  not import it.
- #12 grep-zero LightRAG: 0 hits in new files.

Scope notes
-----------

- **Wiring**: `worker_factory._build_lineage_graph_store` is NOT
  changed in this PR. Bryce's task #3 PR (#1757) just merged adds the
  worker.sync() consumption path; wiring the alias-redirect decorator
  into `_build_lineage_graph_store` is a one-line follow-up that
  belongs with task #8 retrieval cutover (where the legacy merge route
  is also replaced) so we don't ship a partial wiring.
- **REST route cutover**: legacy
  `KnowledgeGraphService.merge_entities` (route handler) still calls
  `GraphIndexService.merge_entities`. Replacing it with
  `LineageEntityMerger` is task #8's job (chenyexuan) — it changes
  no field on the response shape so cuiwenbo task #9 frontend is
  zero-touch.
earayu added a commit that referenced this pull request Apr 28, 2026
Replaces the Wave 7 conservative 1-hop expansion in
``GraphSearchService.search_relations`` with direct vector recall
filtered on ``Eq("indexer", "graph_relation")``. Task #3 (PR #1757)
has been writing relation vectors all along; this is finally the
consumer side, completing the LightRAG-style full recall the spec
§K.12.6 expected (huangheng W8-1 sediment from task #3 PR CR
msg=d4ad0259).

Algorithm
---------

1. Embed query via ``embedder.embed_query``.
2. ``vector_connector.search`` with ``Eq("indexer", "graph_relation")``
   filter + the same threshold/top_k knobs ``search_entities`` uses.
3. Parse each hit's payload (``entity_name="src->tgt"`` +
   ``entity_type=relation_type`` per task #3 writer
   ``aperag/indexing/graph.py:1631``) by splitting on ``->``;
   skip hits that don't parse cleanly.
4. Reverse-lookup full ``RelationWithLineage`` via
   ``asyncio.gather(*store.get_relation(...))`` (architect ratify
   approach (a), msg=cf860ae4 — preserves ``compose_context``
   byte-parity rendering required by the task #5 invariant).
5. Drop ``None`` results (edge GC'd between sync and search).

Failure paths (embedder / vector store down) swallow and return
``[]``, mirroring ``search_entities``.

§K.12 invariant cross-check
---------------------------

- #4 vector store via abstraction: uses ``VectorStoreConnector``,
  no Qdrant-specific imports.
- #5 indexer filter: ``Eq("indexer", "graph_relation")`` pinned in
  ``test_search_relations_uses_graph_relation_filter``.
- #11 D-3 read-only: never invokes any mutation method.
- #12 grep-zero LightRAG: 0 hits in the new code.
- All other invariants n/a (read-only path, no schema changes).

Test coverage
-------------

10 new/replaced cases in ``test_graph_search_service.py``:
- empty query / zero-topk short-circuit (no embed, no search)
- ``Eq("indexer","graph_relation")`` filter + threshold + top_k
  pinning
- payload parse + ``store.get_relation`` reverse lookup happy path
- hit-order preservation + payload-key dedup
- payload skip cases: missing payload, missing arrow, missing
  entity_type, empty source, empty target
- GC tolerance: edge deleted between sync and search
- embedder failure swallowed
- vector store failure swallowed

Plus 14 pre-existing ``search_entities`` / ``get_subgraph`` /
``compose_context`` tests retained — total 24/24 pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant