feat(graph Wave 7 W7-1): compacted_description schema + delete methods#1754
Conversation
Adds the ``compacted_description`` derived-cache column to ``EntityWithLineage`` / ``RelationWithLineage`` and the matching ``LineageGraphStore`` Protocol kwarg on ``upsert_*_with_lineage``, plus ``delete_entity`` / ``delete_relation`` Protocol methods used by the curation merge flow (Wave 7 W7-6). The column is the storage-side container for the ``GraphIndexCompactor`` (Wave 7 W7-2) LLM-summarised unified description, fed to the vector embedding step (W7-3) so embedded text stays bounded under ``max_description_chars`` (default 8000, application-layer enforced — no DB CHECK per spec §K.12.5). Implementation contract: ``compacted_description=None`` (default) preserves the existing column value via ``COALESCE`` so the Wave 4 indexer hot path (per-chunk upserts that don't compute compacted summaries) does NOT clear a Compactor-written value. A non-None string overwrites. Postgres uses ``COALESCE(EXCLUDED, table)`` in the ON CONFLICT SET, Neo4j uses ``COALESCE($param, n.prop)``, Nebula implements the same semantics in Python on the read-modify-write loop. ``EntityRecord`` / ``RelationRecord`` (kg.jsonl raw extraction DTOs) are deliberately untouched so the kg.jsonl serialisation contract is not polluted by derived data. Cross-backend coverage: * ``aperag_lineage_entity`` / ``aperag_lineage_relation`` Postgres tables: alembic ``a3b7c4d8e2f1`` adds nullable ``Text`` columns. * Neo4j ``aperag_LineageEntity`` / ``aperag_LineageRelation`` nodes: property added on MERGE ON CREATE init + COALESCE-style preserve. * Nebula tags: ``CREATE TAG`` includes the new property; idempotent ``ALTER TAG ADD`` runs for pre-Wave-7 schemas (swallows duplicate property errors). Tests: * ``tests/unit_test/indexing/test_t1_2_graph.py`` adds 10 InMemory reference tests (round-trip, COALESCE-preserve, lineage-clobber safety gate per huangheng msg=828c83cc, delete idempotency). * ``tests/integration/compat/test_lineage_graph_compat.py`` adds 11 cross-backend tests including the 100k-char roundtrip safety gate (huangheng msg=4d93a6c5 gate #1) catching driver-side buffer truncation across asyncpg / neo4j-python-driver / nebula-python. §K.12 invariant cross-check: this PR lands the schema half of invariant #2 (L1 → L2 single-direction derivation) and #10 (DB column length is application-layer cap, not schema CHECK). #1 (L1 graph data not polluted) is honoured by leaving ``EntityRecord`` / ``RelationRecord`` / kg.jsonl untouched. 4-pattern pre-check matrix: * P1 v1: ``from aperag.indexing.graph import EntityRecord`` → 1 caller (test_graph_extractor.py); none write ``compacted_description``. * P1 v2: every existing caller of ``EntityWithLineage`` / ``RelationWithLineage`` only reads the lineage fields; the new field is opt-in via dataclass default. * P2 (state binding): legacy ``_compact_oversized_descriptions`` lives at ``aperag/domains/knowledge_graph/graphindex/service.py`` and is migrated by Wave 7 W7-2 (separate task); W7-1 only provides the storage container. * P3 (Protocol method state): ``delete_entity`` / ``delete_relation`` added across Protocol + 3 backends + InMemory reference; existing ``gc_*_if_orphan`` is preserved as the conditional variant. Closes Wave 7 task #1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
earayu
left a comment
There was a problem hiding this comment.
🟢 LGTM ✅ (huangheng pass-1, per spec §K.12.11) — GitHub 不允许同账号 approve;verdict = ready to merge
PR body 12-invariant + 4-pattern + 4-guardrail 全 paste(hard gate format 第三个达标 PR),Option B compacted_description kwarg 跨 3 backend 一致,21 tests 覆盖 my 4 个 safety gate(100k roundtrip / lineage preserve / COALESCE preserve / overwrite semantic),naming grep-zero LightRAG(diff 没引入新的)。critical-path 已解锁,#3 / #6 可起步。
12-invariant cross-check(task #1 scope)
| # | Invariant | This PR | 验证依据 |
|---|---|---|---|
| 1 | L1 graph data 不污染 (kg.jsonl raw 抽取 vs storage view 严格分层) | ✅ | EntityRecord / RelationRecord / to_dict / kg.jsonl schema 全 untouched;compacted_description 只在 EntityWithLineage / RelationWithLineage (line 288 / 298) — Option B 实施严格 |
| 2 | L1 → L2 单向派生 | ✅ | compacted_description 由 W7-2 Compactor 算 + W7-3 写入;schema 容器仅;reader fallback 到 description_parts.text 当 NULL(spec §K.12.4) |
| 3 | Compactor 在 sync 末尾, vector embed 之前 | n/a | scope-correct (W7-3) |
| 4 | Vector store via VectorStoreConnectorAdaptor |
n/a | scope-correct (W7-3) |
| 5 | payload indexer="graph_entity"/"graph_relation" filter |
n/a | scope-correct (W7-3) |
| 6 | uuid5 deterministic vector point id | n/a | scope-correct (W7-3) |
| 7 | snapshot-diff via lineage entity name set | n/a | scope-correct (W7-3) |
| 8 | alias_map 持久化独立 doc lifecycle | n/a | scope-correct (W7-6) |
| 9 | upsert_entity_with_lineage 透明 alias redirect |
n/a | scope-correct (W7-6) |
| 10 | DB 列长度 application-layer cap,不是 schema CHECK | ✅ | Postgres Text / Neo4j STRING / Nebula string 全 unbounded;alembic 0 CHECK constraint;100k roundtrip test 跨 3 backend 钉死 |
| 11 | 候选检测仅写不自动合并 (D-3) | n/a | scope-correct (W7-4) |
| 12 | 命名 grep-zero LightRAG | ✅ | gh pr diff 1754 | grep -E '^\+' | grep -i lightrag → 0 added lines;pre-existing Wave 6 注释(如 # -- LightRAG-style query layer (Wave 6 #33 chunk 2))由 item #10 close-out 时统一清,不属本 PR scope |
Option B 实施验证(cross-backend COALESCE preserve semantics)
| Backend | 实现 | 语义 | 验证 |
|---|---|---|---|
| Postgres | ON CONFLICT DO UPDATE SET compacted_description = COALESCE(EXCLUDED.compacted_description, {ENTITY_TABLE}.compacted_description) (line 473-476) |
None=preserve / non-None=overwrite | test_compacted_description_preserved_on_subsequent_none_upsert + test_compacted_description_overwritten_on_subsequent_non_none_upsert |
| Neo4j | n.compacted_description = COALESCE($compacted_description, n.compacted_description) (line 425-427) |
同上 | 同上 cross-backend |
| Nebula | Read-modify-write 在 EntityLock.acquire(name) 下:final_compacted = compacted_description if compacted_description is not None else existing_compacted (line 837 / 873) |
同上 | Python 层显式 preserve;race-safe under entity lock |
3 backend 语义一致 ✅,关键 Option B 风险(per-chunk hot path clobber Compactor cache)通过 test_compacted_description_preserved_on_subsequent_none_upsert 钉死。
4 safety gate 验证(per huangheng msg=4d93a6c5 + msg=828c83cc)
| Gate | Test | 状态 |
|---|---|---|
| 100k roundtrip schema-side margin | test_compacted_description_supports_100k_chars_roundtrip (line 704) — 写 100k chars + 读回 + len assert + 内容相等 assert,跨 3 backend |
✅ |
| Lineage preservation under Compactor write (msg=828c83cc) | test_compacted_description_does_not_clobber_lineage_on_compactor_write (line 730) — seed 多 doc multi-version lineage → Compactor write → assert ALL lineage fields intact + compacted_description set |
✅ exactly the test I asked for |
| Alembic head chain | down_revision = "f8a4c5b9d3e1" (current head per aperag/migration/versions/20260427040000-f8a4c5b9d3e1_*) → 链对 |
✅ |
| ORM 100% mirror cross-backend | 11 cross-backend compat tests + 10 InMemory tests = 21 total covers default None / kwarg roundtrip / COALESCE preserve / overwrite / 100k / lineage-preserve / relation symmetric / delete-entity-unconditional / delete-entity-idempotent / delete-relation-unconditional / delete-relation-idempotent | ✅ |
4-pattern pre-check matrix(PR body paste)
P1 v1+v2 / P2 / P3 全 paste 输出。Pattern 1 v1 caller-grep 落点:from aperag.indexing.graph import EntityRecord 仅 1 site(test 用),no caller writes compacted_description to EntityRecord — Option B 边界严格。
simple-stable directive 4 guardrail
PR body 4 项全显式标注 ✅ — schema-only PR / minimal Protocol surface / Option B 简单 / alembic + ALTER TAG 自动迁移免维护。
delete_entity / delete_relation 验证
新加 Protocol method(line 511 / 526)+ 3 backend + InMemory reference 全实现。语义清晰:unconditional remove,区别于 gc_*_if_orphan(conditional on empty lineage)。供 task #6 GraphCurationService.merge_entities source cleanup 路径用。idempotent return(line 522-523 docstring + 4 unit tests钉)。
Naming grep-zero (Wave 7 close-out 提醒)
PR diff 不引入新 LightRAG 字符。但仓库中仍残留 Wave 6 era 注释:graph.py / postgres.py / neo4j.py / nebula.py 的 # -- LightRAG-style query layer (Wave 6 #33 chunk 2) 等 8 处。这些不属 task #1 scope,但 Wave 7 close-out(item #10 grep-zero verify 段)必须 sweep —— @bryce / @符炫炜 留意 task #10 里加 cleanup commit。
非阻塞,本 PR 可 merge。
修完会 LGTM 的清单
实际上已经可以 merge ✅ — 上面的 "Wave 6 era LightRAG 注释" 是 task #10 提醒,不是本 PR blocker。
@bryce 工作非常 solid — 4 safety gate exactly mirror,3 backend symmetric,COALESCE preserve invariant + lineage-clobber test 都 hit;Wave 7 hard-gate format 第三个达标 PR (mirror PR #1751 / #1752 standard) 👍。
@符炫炜 LGTM,可 merge。merge 后 Bryce task #3 + 明书 task #6 storage 起步解锁。
Summary
Wave 7 task #1 — adds the
compacted_descriptionderived-cache column toLineageGraphStoreProtocol DTOs and the matching kwarg onupsert_*_with_lineage, plusdelete_entity/delete_relationProtocol methods used by the curation merge flow (Wave 7 W7-6).This PR delivers ONLY the schema container; the LLM-summary computation that populates it lives in Wave 7 W7-2 (
GraphIndexCompactor, parallel PR by 明书).Design (per spec §K.12.4)
aperag_lineage_entity/aperag_lineage_relationget a nullableTextcolumn.EntityWithLineage/RelationWithLineage(storage-view dataclasses) getcompacted_description: str | None = None.upsert_*_with_lineageaccepts a newcompacted_descriptionkwarg (Option B per architect msg=6322608e / msg=6926f1ff lock):None(default) preserves any existing column value viaCOALESCE— the Wave 4 indexer hot path (per-chunk upserts that don't compute summaries) MUST NOT clobber a Compactor-written value.EntityRecord/RelationRecord(kg.jsonl raw extraction DTOs) intentionally untouched: kg.jsonl is the LLM extraction contract; derived data does not pollute it (per huangheng msg=4d93a6c5 + architect own-up msg=6322608e — Protocol DTO naming spec must grep existing DTOs first; sediments tofeedback_architect_design_impl_gap_monitor.mdmini-pattern 4).Cross-backend implementation
Text NULLcolumn on both lineage tables; alembica3b7c4d8e2f1adds themCOALESCE(EXCLUDED.compacted_description, table.compacted_description)inON CONFLICT DO UPDATE SETaperag_LineageEntity/aperag_LineageRelation, initNULLon MERGE ON CREATECOALESCE($compacted_description, n.compacted_description)in trailing SETstring NULLtag prop; idempotentALTER TAG ADDruns for pre-Wave-7 schemas (swallows duplicate-property errors); fresh deploys get the column fromCREATE TAGEntityLock.acquire(name)— read existingcompacted_description, fall back to it when kwarg isNonedelete_entity(name)/delete_relation(source, target, type)are unconditional removes (the conditionalgc_*_if_orphanfamily stays intact), used byGraphCurationService.merge_entities(W7-6) to drop merged-source rows after their parts have been UNION'd into the canonical entity.DB column length:
Text/STRING/stringare all unbounded — per spec §K.12.5 the cap (per-part 5K / per-entity 100 parts /compacted_description≤ 8K) is application-layer enforced by the Compactor +graph_extractor, not a schema CHECK constraint. Schema CHECKs would have to be replicated cross-backend; truncate-not-fail in the Compactor is the canonical enforcement.§K.12 invariant cross-check (per architect msg=fcf580a6 directive)
EntityRecord/RelationRecord/ kg.jsonl untouchedVectorStoreConnectorAdaptorabstractionindexer="graph_entity"/"graph_relation"payload filterupsert_entity_with_lineagetransparent alias redirect4-pattern pre-check matrix (per
feedback_architect_design_impl_gap_monitor.md)from aperag.indexing.graph import EntityRecordmatches 1 call site (tests/integration/test_graph_extractor.py:30); none writecompacted_description.EntityWithLineage/RelationWithLineageare referenced in 10 files — all consumers only read the existing lineage fields. The new field is dataclass-default opt-in so no consumer breaks.aperag/domains/knowledge_graph/graphindex/service.py:158(_compact_oversized_descriptions) — migration to the new tree is W7-2 (separate task by 明书). This PR provides only the storage container.delete_entity/delete_relationProtocol methods added acrossLineageGraphStoreProtocol + 3 production backends +InMemoryLineageGraphStorereference. Existinggc_*_if_orphanis preserved as the conditional variant.simple-stable directive 4 guardrails
NULL; Nebula idempotent ALTER TAG handles pre-Wave-7 schemas without operator interventionTest plan
tests/unit_test/indexing/test_t1_2_graph.py— 10 new InMemory tests:None, round-trip, COALESCE-preserve onNonekwarg, overwrite on non-Nonetests/integration/compat/test_lineage_graph_compat.py— 11 new cross-backend tests:Noneper backenduv run pytest tests/unit_test/)ruff format --check ./aperagclean (only pre-existing unrelated drift inmineru_parser.py)ruff check ./aperagcleanalembic headsresolves to single heada3b7c4d8e2f1alembic upgrade head --sqlgenerates expectedALTER TABLE ... ADD COLUMN compacted_description TEXTfor both lineage tablesSpec / decision references
delete_entity/delete_relationaddition: PM msg=d3f4e6f8🤖 Generated with Claude Code