Commit 37bdb73
feat(graph): GraphModalityWorker.sync() Phase 3 (Wave 7 W7-3) (#1757)
Adds Phase 3 to ``GraphModalityWorker.sync()``: the four steps that
turn the lineage-store rebuild (Phase 1+2, Wave 4-6) into a complete
LightRAG-style graph index — compact, embed, vector upsert,
snapshot-diff delete, and merge-candidate detect.
Step ordering (per spec §K.12.3 + huangheng msg=16a38734 invariant
list):
1. **Compactor** (W7-2): for each entity / relation just touched by
this sync, run ``GraphIndexCompactor.compact_if_oversized`` over the
per-doc ``description_parts`` and write the unified summary back via
``upsert_*_with_lineage(..., compacted_description=...)``. Returning
``None`` (below threshold or compactor unwired) leaves the
COALESCE-preserved column alone.
2. **Embed + vector upsert**: hash the compacted summary (or the
``name`` + raw parts fallback when the compactor opted out) and
``await asyncio.to_thread(connector.upsert, [VectorPoint(...)])``
the result. Vector point id is
``uuid5(NAMESPACE_DNS, f"graph_entity:{collection_id}:{name}")`` —
deterministic so a re-sync overwrites instead of leaking. Payload is
the 3-field shape locked at spec §K.12.5 ratify msg=acbd0003 /
msg=d3f4e6f8: ``{indexer, entity_name, entity_type}``. No
``collection_id`` payload — that lives in the uuid5 id (cross-
collection uniqueness in a shared backing store) while the connector
handles per-tenant guard.
3. **Snapshot-diff delete**: pre-sync entity-name set
(``find_entity_ids_with_lineage`` from Phase 1) minus post-sync set
(kg.jsonl entities ∪ pre-sync entities still alive after gc) → ids
to delete. Computed against the lineage store, never an ANN
``list_all`` (invariant #7).
4. **MergeCandidateDetector** (W7-4, optional): pass the affected
entity names to ``detect_for_sync`` so PENDING auto-detect
suggestions get persisted for the curator UI. D-3 lock — detector
never auto-merges.
The Phase 3 dependencies (``compactor``, ``embedder``,
``vector_connector``, ``merge_detector``) are all optional kwargs.
Wave 6 callers that don't wire them get the lineage-only behaviour
unchanged — Phase 3 returns early when the vector connector or
embedder is unset. Failures inside any step (compactor LLM flake,
embedder error, vector backend hiccup, detector raise) are logged and
swallowed so the lineage critical path always survives a partial
Phase 3.
``worker_factory._build_graph_worker`` wires production deps:
* ``compactor`` via the shared ``build_collection_llm_callable`` (the
same resolver the graph extractor uses).
* ``vector_connector`` + ``embedder`` via the existing
``_build_collection_qdrant_connector`` helper — the graph entity /
relation vectors go into the same Qdrant collection as chunk
vectors, distinguished by the ``indexer`` payload key.
* ``merge_detector`` constructed once the connector + embedder
resolved, with a thin shim adapting the sync ``(text -> list[float])``
callable used by the graph worker into the ``embed_query`` shape the
detector expects.
When any of these fail to construct (no completion model,
collection's embedder unresolved, etc.) the factory logs a warning and
falls back to the no-op default, mirroring the summary worker pattern.
Tests:
* 12 new InMemory unit tests in
``tests/unit_test/indexing/test_t1_2_graph.py`` — cover the four
invariants (Phase 3 skipped without deps; 3-field payload exact;
uuid5 id deterministic across resyncs; compactor before embed;
fallback on compactor None; snapshot-diff delete on doc cascade;
cross-doc shared entity not deleted; relation Phase 3 mirrors
entity; detector receives correct names; detector failure
non-fatal; compactor failure non-fatal; vector upsert failure
non-fatal). All 1117 unit tests pass.
§K.12 invariant cross-check: this PR materially honours #2 (L1 → L2
single-direction derivation), #3 (Compactor runs before vector
embed), #4 (vector store via
``VectorStoreConnector(Adaptor)`` abstraction — no Qdrant client
import), #5 (3-field payload + ``indexer`` filter), #6 (uuid5
deterministic id), #7 (snapshot-diff via lineage name set, not ANN
list-all), #11 (D-3 detector writes-only, never auto-merges), #12
(no LightRAG strings introduced).
4-pattern pre-check matrix:
* P1 v1: ``GraphModalityWorker.sync`` callers — orchestrator wiring
in ``aperag/indexing/`` only; the new optional kwargs default to
``None`` so call shape is backward-compat.
* P1 v2: caller return-shape expectations — sync still returns
``None``; nothing observable beyond Phase 3 side effects on the
vector store + suggestion table.
* P2 (state binding): ``GraphIndexCompactor`` is W7-2 (merged
``c1c48429``), ``MergeCandidateDetector`` is W7-4 (merged
``0dbf9fd1``), ``VectorStoreConnector`` is the existing Wave 4
abstraction. All three already in main.
* P3 (Protocol method state): no new ``LineageGraphStore`` Protocol
methods needed — Phase 3 reads via existing ``get_entity`` /
``get_relation`` and writes via the ``compacted_description`` kwarg
shipped in W7-1.
Closes Wave 7 task #3.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>1 parent 3bf3116 commit 37bdb73
3 files changed
Lines changed: 1174 additions & 1 deletion
0 commit comments