Skip to content

feat(graph Wave 7 W7-1): compacted_description schema + delete methods#1754

Merged
earayu merged 1 commit into
mainfrom
bryce/wave7-task1-compacted-description
Apr 27, 2026
Merged

feat(graph Wave 7 W7-1): compacted_description schema + delete methods#1754
earayu merged 1 commit into
mainfrom
bryce/wave7-task1-compacted-description

Conversation

@earayu
Copy link
Copy Markdown
Collaborator

@earayu earayu commented Apr 27, 2026

Summary

Wave 7 task #1 — adds the compacted_description derived-cache column to LineageGraphStore Protocol DTOs and the matching kwarg on upsert_*_with_lineage, plus delete_entity / delete_relation Protocol methods used by the curation merge flow (Wave 7 W7-6).

This PR delivers ONLY the schema container; the LLM-summary computation that populates it lives in Wave 7 W7-2 (GraphIndexCompactor, parallel PR by 明书).

Design (per spec §K.12.4)

  • aperag_lineage_entity / aperag_lineage_relation get a nullable Text column.
  • EntityWithLineage / RelationWithLineage (storage-view dataclasses) get compacted_description: str | None = None.
  • upsert_*_with_lineage accepts a new compacted_description kwarg (Option B per architect msg=6322608e / msg=6926f1ff lock):
    • None (default) preserves any existing column value via COALESCE — the Wave 4 indexer hot path (per-chunk upserts that don't compute summaries) MUST NOT clobber a Compactor-written value.
    • non-None string overwrites.
  • EntityRecord / RelationRecord (kg.jsonl raw extraction DTOs) intentionally untouched: kg.jsonl is the LLM extraction contract; derived data does not pollute it (per huangheng msg=4d93a6c5 + architect own-up msg=6322608e — Protocol DTO naming spec must grep existing DTOs first; sediments to feedback_architect_design_impl_gap_monitor.md mini-pattern 4).

Cross-backend implementation

Backend Storage Preserve semantics
Postgres Text NULL column on both lineage tables; alembic a3b7c4d8e2f1 adds them COALESCE(EXCLUDED.compacted_description, table.compacted_description) in ON CONFLICT DO UPDATE SET
Neo4j Node property on aperag_LineageEntity / aperag_LineageRelation, init NULL on MERGE ON CREATE COALESCE($compacted_description, n.compacted_description) in trailing SET
Nebula string NULL tag prop; idempotent ALTER TAG ADD runs for pre-Wave-7 schemas (swallows duplicate-property errors); fresh deploys get the column from CREATE TAG In-Python read-modify-write under EntityLock.acquire(name) — read existing compacted_description, fall back to it when kwarg is None

delete_entity(name) / delete_relation(source, target, type) are unconditional removes (the conditional gc_*_if_orphan family stays intact), used by GraphCurationService.merge_entities (W7-6) to drop merged-source rows after their parts have been UNION'd into the canonical entity.

DB column length: Text / STRING / string are all unbounded — per spec §K.12.5 the cap (per-part 5K / per-entity 100 parts / compacted_description ≤ 8K) is application-layer enforced by the Compactor + graph_extractor, not a schema CHECK constraint. Schema CHECKs would have to be replicated cross-backend; truncate-not-fail in the Compactor is the canonical enforcement.

§K.12 invariant cross-check (per architect msg=fcf580a6 directive)

# Invariant This PR
1 L1 graph data not polluted (kg.jsonl raw vs storage view layered) EntityRecord / RelationRecord / kg.jsonl untouched
2 L1 → L2 single-direction derivation (compacted is derived from description_parts) ✅ schema container ready; Compactor produces value in W7-2
3 Compactor at sync-end, before vector embed N/A (W7-3 wires this)
4 Vector store via VectorStoreConnectorAdaptor abstraction N/A (W7-3)
5 indexer="graph_entity" / "graph_relation" payload filter N/A (W7-3)
6 uuid5 deterministic vector point id N/A (W7-3)
7 snapshot-diff via lineage entity name set N/A (W7-3)
8 alias_map orphan-persists across canonical gc N/A (W7-6)
9 upsert_entity_with_lineage transparent alias redirect N/A (W7-6)
10 DB column length is application-layer cap, not schema CHECK ✅ TEXT / STRING / string unbounded; cap lives in Compactor (W7-2)
11 Candidate detection writes-only, no auto-merge (D-3) N/A (W7-4)
12 grep-zero LightRAG naming ✅ no LightRAG strings introduced

4-pattern pre-check matrix (per feedback_architect_design_impl_gap_monitor.md)

  • P1 v1 (import count): from aperag.indexing.graph import EntityRecord matches 1 call site (tests/integration/test_graph_extractor.py:30); none write compacted_description.
  • P1 v2 (caller-method matrix): EntityWithLineage / RelationWithLineage are referenced in 10 files — all consumers only read the existing lineage fields. The new field is dataclass-default opt-in so no consumer breaks.
  • P2 (state binding): legacy compactor lives at aperag/domains/knowledge_graph/graphindex/service.py:158 (_compact_oversized_descriptions) — migration to the new tree is W7-2 (separate task by 明书). This PR provides only the storage container.
  • P3 (Protocol method state): new delete_entity / delete_relation Protocol methods added across LineageGraphStore Protocol + 3 production backends + InMemoryLineageGraphStore reference. Existing gc_*_if_orphan is preserved as the conditional variant.

simple-stable directive 4 guardrails

Guardrail Status
#1 don't expand scope ✅ schema-only PR; Compactor logic is W7-2
#2 ship fast ✅ minimal Protocol surface change (1 kwarg + 2 delete methods); 11 cross-backend tests + 10 InMemory tests
#3 simple > complex ✅ Option B kwarg over Option A new method (architect lock msg=6322608e); COALESCE preserve-semantics kept identical across 3 backends
#4 private-deploy maintenance-free ✅ alembic auto-migrates Postgres; Neo4j ON CREATE inits NULL; Nebula idempotent ALTER TAG handles pre-Wave-7 schemas without operator intervention

Test plan

  • tests/unit_test/indexing/test_t1_2_graph.py — 10 new InMemory tests:
    • default None, round-trip, COALESCE-preserve on None kwarg, overwrite on non-None
    • lineage-clobber safety gate (huangheng msg=828c83cc): Compactor write must not drop other lineage fields
    • relation symmetric coverage, delete entity/relation idempotency
  • tests/integration/compat/test_lineage_graph_compat.py — 11 new cross-backend tests:
    • default None per backend
    • explicit kwarg round-trip
    • COALESCE preserve invariant (the principal Option B risk)
    • non-None overwrite
    • 100k-char roundtrip safety gate (huangheng msg=4d93a6c5 gate feat/frontend #1) — catches driver-side buffer truncation across asyncpg / neo4j-python-driver / nebula-python
    • lineage-preserve under Compactor write
    • relation symmetric coverage
    • delete unconditionally + idempotent return
  • All 1083 unit tests pass (uv run pytest tests/unit_test/)
  • ruff format --check ./aperag clean (only pre-existing unrelated drift in mineru_parser.py)
  • ruff check ./aperag clean
  • alembic heads resolves to single head a3b7c4d8e2f1
  • alembic upgrade head --sql generates expected ALTER TABLE ... ADD COLUMN compacted_description TEXT for both lineage tables
  • CI compat-graph stage (Postgres + Neo4j + Nebula real-backend matrix) — requires e2e infra; reviewer please confirm green before merge

Spec / decision references

🤖 Generated with Claude Code

Adds the ``compacted_description`` derived-cache column to
``EntityWithLineage`` / ``RelationWithLineage`` and the matching
``LineageGraphStore`` Protocol kwarg on ``upsert_*_with_lineage``,
plus ``delete_entity`` / ``delete_relation`` Protocol methods used
by the curation merge flow (Wave 7 W7-6).

The column is the storage-side container for the
``GraphIndexCompactor`` (Wave 7 W7-2) LLM-summarised unified
description, fed to the vector embedding step (W7-3) so embedded
text stays bounded under ``max_description_chars`` (default 8000,
application-layer enforced — no DB CHECK per spec §K.12.5).

Implementation contract: ``compacted_description=None`` (default)
preserves the existing column value via ``COALESCE`` so the Wave 4
indexer hot path (per-chunk upserts that don't compute compacted
summaries) does NOT clear a Compactor-written value. A non-None
string overwrites. Postgres uses ``COALESCE(EXCLUDED, table)`` in
the ON CONFLICT SET, Neo4j uses ``COALESCE($param, n.prop)``, Nebula
implements the same semantics in Python on the read-modify-write
loop. ``EntityRecord`` / ``RelationRecord`` (kg.jsonl raw extraction
DTOs) are deliberately untouched so the kg.jsonl serialisation
contract is not polluted by derived data.

Cross-backend coverage:
* ``aperag_lineage_entity`` / ``aperag_lineage_relation`` Postgres
  tables: alembic ``a3b7c4d8e2f1`` adds nullable ``Text`` columns.
* Neo4j ``aperag_LineageEntity`` / ``aperag_LineageRelation`` nodes:
  property added on MERGE ON CREATE init + COALESCE-style preserve.
* Nebula tags: ``CREATE TAG`` includes the new property; idempotent
  ``ALTER TAG ADD`` runs for pre-Wave-7 schemas (swallows duplicate
  property errors).

Tests:
* ``tests/unit_test/indexing/test_t1_2_graph.py`` adds 10 InMemory
  reference tests (round-trip, COALESCE-preserve, lineage-clobber
  safety gate per huangheng msg=828c83cc, delete idempotency).
* ``tests/integration/compat/test_lineage_graph_compat.py`` adds 11
  cross-backend tests including the 100k-char roundtrip safety gate
  (huangheng msg=4d93a6c5 gate #1) catching driver-side buffer
  truncation across asyncpg / neo4j-python-driver / nebula-python.

§K.12 invariant cross-check: this PR lands the schema half of
invariant #2 (L1 → L2 single-direction derivation) and #10 (DB
column length is application-layer cap, not schema CHECK). #1
(L1 graph data not polluted) is honoured by leaving ``EntityRecord``
/ ``RelationRecord`` / kg.jsonl untouched.

4-pattern pre-check matrix:
* P1 v1: ``from aperag.indexing.graph import EntityRecord`` →
  1 caller (test_graph_extractor.py); none write
  ``compacted_description``.
* P1 v2: every existing caller of ``EntityWithLineage`` /
  ``RelationWithLineage`` only reads the lineage fields; the new
  field is opt-in via dataclass default.
* P2 (state binding): legacy ``_compact_oversized_descriptions``
  lives at ``aperag/domains/knowledge_graph/graphindex/service.py``
  and is migrated by Wave 7 W7-2 (separate task); W7-1 only
  provides the storage container.
* P3 (Protocol method state): ``delete_entity`` /
  ``delete_relation`` added across Protocol + 3 backends + InMemory
  reference; existing ``gc_*_if_orphan`` is preserved as the
  conditional variant.

Closes Wave 7 task #1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

@earayu earayu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 LGTM ✅ (huangheng pass-1, per spec §K.12.11) — GitHub 不允许同账号 approve;verdict = ready to merge

PR body 12-invariant + 4-pattern + 4-guardrail 全 paste(hard gate format 第三个达标 PR),Option B compacted_description kwarg 跨 3 backend 一致,21 tests 覆盖 my 4 个 safety gate(100k roundtrip / lineage preserve / COALESCE preserve / overwrite semantic),naming grep-zero LightRAG(diff 没引入新的)。critical-path 已解锁,#3 / #6 可起步。

12-invariant cross-check(task #1 scope)

# Invariant This PR 验证依据
1 L1 graph data 不污染 (kg.jsonl raw 抽取 vs storage view 严格分层) EntityRecord / RelationRecord / to_dict / kg.jsonl schema 全 untouched;compacted_description 只在 EntityWithLineage / RelationWithLineage (line 288 / 298) — Option B 实施严格
2 L1 → L2 单向派生 compacted_description 由 W7-2 Compactor 算 + W7-3 写入;schema 容器仅;reader fallback 到 description_parts.text 当 NULL(spec §K.12.4)
3 Compactor 在 sync 末尾, vector embed 之前 n/a scope-correct (W7-3)
4 Vector store via VectorStoreConnectorAdaptor n/a scope-correct (W7-3)
5 payload indexer="graph_entity"/"graph_relation" filter n/a scope-correct (W7-3)
6 uuid5 deterministic vector point id n/a scope-correct (W7-3)
7 snapshot-diff via lineage entity name set n/a scope-correct (W7-3)
8 alias_map 持久化独立 doc lifecycle n/a scope-correct (W7-6)
9 upsert_entity_with_lineage 透明 alias redirect n/a scope-correct (W7-6)
10 DB 列长度 application-layer cap,不是 schema CHECK Postgres Text / Neo4j STRING / Nebula string 全 unbounded;alembic 0 CHECK constraint;100k roundtrip test 跨 3 backend 钉死
11 候选检测仅写不自动合并 (D-3) n/a scope-correct (W7-4)
12 命名 grep-zero LightRAG gh pr diff 1754 | grep -E '^\+' | grep -i lightrag0 added lines;pre-existing Wave 6 注释(如 # -- LightRAG-style query layer (Wave 6 #33 chunk 2))由 item #10 close-out 时统一清,不属本 PR scope

Option B 实施验证(cross-backend COALESCE preserve semantics)

Backend 实现 语义 验证
Postgres ON CONFLICT DO UPDATE SET compacted_description = COALESCE(EXCLUDED.compacted_description, {ENTITY_TABLE}.compacted_description) (line 473-476) None=preserve / non-None=overwrite test_compacted_description_preserved_on_subsequent_none_upsert + test_compacted_description_overwritten_on_subsequent_non_none_upsert
Neo4j n.compacted_description = COALESCE($compacted_description, n.compacted_description) (line 425-427) 同上 同上 cross-backend
Nebula Read-modify-write 在 EntityLock.acquire(name) 下:final_compacted = compacted_description if compacted_description is not None else existing_compacted (line 837 / 873) 同上 Python 层显式 preserve;race-safe under entity lock

3 backend 语义一致 ✅,关键 Option B 风险(per-chunk hot path clobber Compactor cache)通过 test_compacted_description_preserved_on_subsequent_none_upsert 钉死。

4 safety gate 验证(per huangheng msg=4d93a6c5 + msg=828c83cc)

Gate Test 状态
100k roundtrip schema-side margin test_compacted_description_supports_100k_chars_roundtrip (line 704) — 写 100k chars + 读回 + len assert + 内容相等 assert,跨 3 backend
Lineage preservation under Compactor write (msg=828c83cc) test_compacted_description_does_not_clobber_lineage_on_compactor_write (line 730) — seed 多 doc multi-version lineage → Compactor write → assert ALL lineage fields intact + compacted_description set ✅ exactly the test I asked for
Alembic head chain down_revision = "f8a4c5b9d3e1" (current head per aperag/migration/versions/20260427040000-f8a4c5b9d3e1_*) → 链对
ORM 100% mirror cross-backend 11 cross-backend compat tests + 10 InMemory tests = 21 total covers default None / kwarg roundtrip / COALESCE preserve / overwrite / 100k / lineage-preserve / relation symmetric / delete-entity-unconditional / delete-entity-idempotent / delete-relation-unconditional / delete-relation-idempotent

4-pattern pre-check matrix(PR body paste)

P1 v1+v2 / P2 / P3 全 paste 输出。Pattern 1 v1 caller-grep 落点:from aperag.indexing.graph import EntityRecord 仅 1 site(test 用),no caller writes compacted_description to EntityRecord — Option B 边界严格。

simple-stable directive 4 guardrail

PR body 4 项全显式标注 ✅ — schema-only PR / minimal Protocol surface / Option B 简单 / alembic + ALTER TAG 自动迁移免维护。

delete_entity / delete_relation 验证

新加 Protocol method(line 511 / 526)+ 3 backend + InMemory reference 全实现。语义清晰:unconditional remove,区别于 gc_*_if_orphan(conditional on empty lineage)。供 task #6 GraphCurationService.merge_entities source cleanup 路径用。idempotent return(line 522-523 docstring + 4 unit tests钉)。

Naming grep-zero (Wave 7 close-out 提醒)

PR diff 不引入新 LightRAG 字符。但仓库中仍残留 Wave 6 era 注释graph.py / postgres.py / neo4j.py / nebula.py# -- LightRAG-style query layer (Wave 6 #33 chunk 2) 等 8 处。这些不属 task #1 scope,但 Wave 7 close-out(item #10 grep-zero verify 段)必须 sweep —— @bryce / @符炫炜 留意 task #10 里加 cleanup commit。

非阻塞,本 PR 可 merge。

修完会 LGTM 的清单

实际上已经可以 merge ✅ — 上面的 "Wave 6 era LightRAG 注释" 是 task #10 提醒,不是本 PR blocker。

@bryce 工作非常 solid — 4 safety gate exactly mirror,3 backend symmetric,COALESCE preserve invariant + lineage-clobber test 都 hit;Wave 7 hard-gate format 第三个达标 PR (mirror PR #1751 / #1752 standard) 👍。

@符炫炜 LGTM,可 merge。merge 后 Bryce task #3 + 明书 task #6 storage 起步解锁。

@不穷 推进 task #1 → done after merge;task #3 / #6 可启动。

@earayu earayu merged commit c149977 into main Apr 27, 2026
15 checks passed
@earayu earayu deleted the bryce/wave7-task1-compacted-description branch April 27, 2026 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant