Skip to content

Commit 37bdb73

Browse files
earayuclaude
andauthored
feat(graph): GraphModalityWorker.sync() Phase 3 (Wave 7 W7-3) (#1757)
Adds Phase 3 to ``GraphModalityWorker.sync()``: the four steps that turn the lineage-store rebuild (Phase 1+2, Wave 4-6) into a complete LightRAG-style graph index — compact, embed, vector upsert, snapshot-diff delete, and merge-candidate detect. Step ordering (per spec §K.12.3 + huangheng msg=16a38734 invariant list): 1. **Compactor** (W7-2): for each entity / relation just touched by this sync, run ``GraphIndexCompactor.compact_if_oversized`` over the per-doc ``description_parts`` and write the unified summary back via ``upsert_*_with_lineage(..., compacted_description=...)``. Returning ``None`` (below threshold or compactor unwired) leaves the COALESCE-preserved column alone. 2. **Embed + vector upsert**: hash the compacted summary (or the ``name`` + raw parts fallback when the compactor opted out) and ``await asyncio.to_thread(connector.upsert, [VectorPoint(...)])`` the result. Vector point id is ``uuid5(NAMESPACE_DNS, f"graph_entity:{collection_id}:{name}")`` — deterministic so a re-sync overwrites instead of leaking. Payload is the 3-field shape locked at spec §K.12.5 ratify msg=acbd0003 / msg=d3f4e6f8: ``{indexer, entity_name, entity_type}``. No ``collection_id`` payload — that lives in the uuid5 id (cross- collection uniqueness in a shared backing store) while the connector handles per-tenant guard. 3. **Snapshot-diff delete**: pre-sync entity-name set (``find_entity_ids_with_lineage`` from Phase 1) minus post-sync set (kg.jsonl entities ∪ pre-sync entities still alive after gc) → ids to delete. Computed against the lineage store, never an ANN ``list_all`` (invariant #7). 4. **MergeCandidateDetector** (W7-4, optional): pass the affected entity names to ``detect_for_sync`` so PENDING auto-detect suggestions get persisted for the curator UI. D-3 lock — detector never auto-merges. The Phase 3 dependencies (``compactor``, ``embedder``, ``vector_connector``, ``merge_detector``) are all optional kwargs. Wave 6 callers that don't wire them get the lineage-only behaviour unchanged — Phase 3 returns early when the vector connector or embedder is unset. Failures inside any step (compactor LLM flake, embedder error, vector backend hiccup, detector raise) are logged and swallowed so the lineage critical path always survives a partial Phase 3. ``worker_factory._build_graph_worker`` wires production deps: * ``compactor`` via the shared ``build_collection_llm_callable`` (the same resolver the graph extractor uses). * ``vector_connector`` + ``embedder`` via the existing ``_build_collection_qdrant_connector`` helper — the graph entity / relation vectors go into the same Qdrant collection as chunk vectors, distinguished by the ``indexer`` payload key. * ``merge_detector`` constructed once the connector + embedder resolved, with a thin shim adapting the sync ``(text -> list[float])`` callable used by the graph worker into the ``embed_query`` shape the detector expects. When any of these fail to construct (no completion model, collection's embedder unresolved, etc.) the factory logs a warning and falls back to the no-op default, mirroring the summary worker pattern. Tests: * 12 new InMemory unit tests in ``tests/unit_test/indexing/test_t1_2_graph.py`` — cover the four invariants (Phase 3 skipped without deps; 3-field payload exact; uuid5 id deterministic across resyncs; compactor before embed; fallback on compactor None; snapshot-diff delete on doc cascade; cross-doc shared entity not deleted; relation Phase 3 mirrors entity; detector receives correct names; detector failure non-fatal; compactor failure non-fatal; vector upsert failure non-fatal). All 1117 unit tests pass. §K.12 invariant cross-check: this PR materially honours #2 (L1 → L2 single-direction derivation), #3 (Compactor runs before vector embed), #4 (vector store via ``VectorStoreConnector(Adaptor)`` abstraction — no Qdrant client import), #5 (3-field payload + ``indexer`` filter), #6 (uuid5 deterministic id), #7 (snapshot-diff via lineage name set, not ANN list-all), #11 (D-3 detector writes-only, never auto-merges), #12 (no LightRAG strings introduced). 4-pattern pre-check matrix: * P1 v1: ``GraphModalityWorker.sync`` callers — orchestrator wiring in ``aperag/indexing/`` only; the new optional kwargs default to ``None`` so call shape is backward-compat. * P1 v2: caller return-shape expectations — sync still returns ``None``; nothing observable beyond Phase 3 side effects on the vector store + suggestion table. * P2 (state binding): ``GraphIndexCompactor`` is W7-2 (merged ``c1c48429``), ``MergeCandidateDetector`` is W7-4 (merged ``0dbf9fd1``), ``VectorStoreConnector`` is the existing Wave 4 abstraction. All three already in main. * P3 (Protocol method state): no new ``LineageGraphStore`` Protocol methods needed — Phase 3 reads via existing ``get_entity`` / ``get_relation`` and writes via the ``compacted_description`` kwarg shipped in W7-1. Closes Wave 7 task #3. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 3bf3116 commit 37bdb73

3 files changed

Lines changed: 1174 additions & 1 deletion

File tree

0 commit comments

Comments
 (0)