Skip to content

Commit 0535bf4

Browse files
earayuclaude
andauthored
refactor(task-31-a3): description-free graph_curation 7 call sites (#1941)
* refactor(task-31-a3): description-free graph_curation 7 call sites Wave 5 description-NULL invariant (task #31 spec § 3.1.5): graph extractor stopped emitting `description` text post Wave 5 task #5 (facts/vectors split). The dedup detection / scoring / snapshot / accept-apply paths still read `entity.description` / `compacted_description` / `description_parts` and would either silently degrade scoring (always-empty bag-of-tokens) or leak stale fragments from pre-Wave-5 rows into reviewer-facing suggestions. Fix the 6 detector / snapshot call sites + 1 apply path enumerated in the spec, plus 1 service-layer helper surfaced by the boundary test grep gate: 1. candidate_generation.py:38 entity_snapshot — drop description 2. candidate_generation.py:179 _lexical_signals — drop description Jaccard token overlap 3. candidate_generation.py:196 _pair_score — drop description scoring weight (signal no longer emitted; branch is dead) 4. dto.py CurationEntity.from_lineage — set description="" instead of deriving from compacted / description_parts; keep field on the dataclass for back-compat with callers that still pass it 5. merge_candidate_detector._description_text_for_scoring → _embedding_query_text — embed `<name> (<entity_type>)` (mirror of how the graph_vectors worker writes the entity vector, Wave 5 task #5 / #7); the legacy method always short-circuited to "" post Wave 5 so detection produced zero candidates 6. merge_candidate_detector._to_legacy_entity — pass description="" instead of reading from entity 7. merge_candidate_detector._snapshot — drop description key from persisted entity_snapshots payload +1 lineage_merge.py — add merge_entities_apply_description_free variant for the async accept-apply worker (task #31 § 3.1.5). Skips LLM unified description / Compactor pass / __curation_merge__ sentinel description write / vector embed write per the spec «不调» list. Legacy merge_entities path is preserved for manual sync API back-compat (Lesson #14 multi-iteration cleanup follow-up). +1 service._fetch_shadow_neighbors — replace `entity.description or entity.name` with `entity.name`; post Wave 5 the description is always "" so the fallback was a no-op, and reading description here violates the boundary gate. Boundary gate (tests/boundaries/test_graph_curation_description_free.py, 4 AST-level assertions per spec § 5.2.a): - graph_curation_modules_do_not_read_entity_description - merge_candidate_detector_does_not_read_entity_description - lineage_merge_apply_description_free_does_not_read_entity_description - lineage_merge_apply_description_free_does_not_call_llm_or_compactor Allowlist: - lineage_merge.merge_entities (legacy back-compat) excluded by file - dto.py field declaration excluded (annotation, not a read) - LineageMergeResult.compacted_description (non-entity result shape used by legacy sync handle_action API) excluded by base name Wave-5 invariant codify pattern (Lesson #18 candidate, per huangheng PR #1932 + chenyexuan PR #1933 first-application demo): lesson sediment (cr-checklist § 四 Wave 5 description-NULL family) + mechanical gate (this boundary test) — paired so future regressions fail at CI not at review time. Tests: 1466 unit + 104 boundary all green. Risk: 0 production behavior change for legacy sync handle_action API (merge_entities preserved); new accept-apply async path uses the description-free variant exclusively. Spec: docs/zh-CN/architecture/task-31-graph-node-merge-spec-v1.md § 3.1.5 Task: task #77 (Phase A3) under task #31 umbrella Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(task-31-a3): fold huangheng cr-checklist Lesson #14/#18 NITs Per @huangheng cr-checklist Lesson #14 + #18 候选 cross-link verify (msg=be330423) — 2 non-blocker NITs on PR #1941 fix-forwarded: NIT 1 (service.py:244 deprecation marker): Add deprecation comment on the legacy sync ``handle_action()`` API return-shape line that reads ``merge_result.compacted_description``. Aligns with Lesson #14 «老 path 保留 + 标 deprecation» pattern (matches the ``lineage_merge.merge_entities`` deprecation marker added by the main commit), and explicitly cross-links the boundary test allowlist mechanism (``NON_ENTITY_BASE_NAMES``) so future grep-based audits don't dispatch on the read. NIT 2 (boundary test docstring bonus catch cross-link): Add explicit Lesson #18 候选 second-application demo trail in ``tests/boundaries/test_graph_curation_description_free.py`` module docstring — cite the ``service.py:845`` bonus catch (``text = entity.description or entity.name`` inside ``GraphCurationService._fetch_shadow_neighbors``) as canonical proof of the «lesson sediment + mechanical gate 双 layer codification» value. The spec § 3.1.5 ratify (符炫炜 + Bryce + ziang + huangzhangshu + Weston multi-source review) listed exactly 6+1 sites and every reviewer + spec author missed this 7th hidden read; the boundary gate caught it on first run, turning ``reviewer-as-detector`` into ``CI-as-detector`` per the Lesson #18 thesis. 0 production code change beyond comment / docstring text. Tests: 4/4 boundary test pass + ruff format / check clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(boundary): include dto.py in description-free AST scan Per @huangzhangshu BLOCKER (PR #1941 testing-lane CR, msg=2deb5407) + @ziang second-source ratify (msg=f485803c) + @不穷 PM dispatch (msg=a6cd42c9): the boundary gate ``test_graph_curation_modules_do_not_read_entity_description`` was whole-file excluding ``aperag/graph_curation/dto.py`` to avoid flagging the dataclass field declaration. But spec § 3.1.5 item 4 explicitly lists ``CurationEntity.from_lineage`` as one of the 6 description-free call sites, so the gate must catch future regressions that re-introduce ``entity.compacted_description`` / ``entity.description_parts`` reads inside ``from_lineage``. The whole-file exclusion was a false-positive prevention that turned out to be unnecessary: the AST walker matches ``ast.Attribute`` reads only, and dataclass field annotations (``description: str = ""``) are ``ast.AnnAssign`` nodes with ``target=ast.Name``, while constructor keyword args (``cls(description="")``) are ``ast.keyword`` nodes — neither is an ``ast.Attribute`` access on an entity object. Drops the whole-file exclusion and adds two reinforcing sister-tests so future maintainers do not regress this: * ``test_dto_module_is_in_boundary_scope`` — synthetic-AST positive control: feeds a fake ``from_lineage`` body that reads ``entity.compacted_description`` through the same offender detector and asserts the offender is surfaced. If a future refactor breaks the AST walker, this test catches the silent protection-loss. * ``test_dto_field_declaration_is_not_a_false_positive`` — live negative control: confirms the production ``dto.py`` produces zero offenders, with a docstring directing future maintainers to fix the walker (NOT re-allowlist the file) if a false- positive is ever observed. 6/6 boundary tests pass + ruff format / check clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 0ae069c commit 0535bf4

7 files changed

Lines changed: 834 additions & 62 deletions

File tree

aperag/graph_curation/candidate_generation.py

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,16 @@ class CandidatePair:
3636

3737

3838
def entity_snapshot(entity: Entity) -> dict[str, Any]:
39+
# Wave 5 description-NULL invariant (task #31 A3, spec § 3.1.5):
40+
# snapshot must not surface ``entity.description`` — Wave 5 graph
41+
# extractor stopped emitting descriptions, so reading here would
42+
# serialize an empty string at best and a stale fragment at worst.
43+
# Reviewers / async accept-apply workers operate on name + type +
44+
# source_chunk_count instead.
3945
return {
4046
"entity_id": entity.entity_id,
4147
"entity_name": entity.name,
4248
"entity_type": entity.type,
43-
"description": entity.description,
4449
"source_chunk_count": len(entity.source_chunk_ids or ()),
4550
}
4651

@@ -176,9 +181,12 @@ def _lexical_signals(left: Entity, right: Entity) -> dict[str, Any]:
176181
if name_token_overlap >= 0.5:
177182
signals["name_token_overlap"] = round(name_token_overlap, 3)
178183

179-
description_overlap = _jaccard(_tokens(left.description), _tokens(right.description))
180-
if description_overlap >= 0.2:
181-
signals["description_token_overlap"] = round(description_overlap, 3)
184+
# Wave 5 description-NULL invariant (task #31 A3, spec § 3.1.5):
185+
# ``description`` is no longer a valid lexical signal — Wave 5
186+
# graph extractor stopped emitting descriptions, so a Jaccard over
187+
# description tokens would compare two empty bags and never trip.
188+
# Removed entirely to enforce the invariant via shape (no read =
189+
# no silent regression to "" vs "" comparisons that always pass).
182190

183191
return signals
184192

@@ -193,8 +201,12 @@ def _pair_score(signals: Dict[str, Any]) -> float:
193201
score += 0.30
194202
if signals.get("name_token_overlap"):
195203
score += min(float(signals["name_token_overlap"]) * 0.35, 0.30)
196-
if signals.get("description_token_overlap"):
197-
score += min(float(signals["description_token_overlap"]) * 0.20, 0.15)
204+
# Wave 5 description-NULL invariant (task #31 A3, spec § 3.1.5):
205+
# ``description_token_overlap`` weight removed — _lexical_signals no
206+
# longer emits the signal post Wave 5 (descriptions are NULL), so
207+
# this branch is dead. Removed to keep scoring weights canonical
208+
# across detector + service paths and let the boundary test
209+
# (test_graph_curation_description_free) grep-zero this surface.
198210
if signals.get("shared_chunk_count"):
199211
score += min(int(signals["shared_chunk_count"]) * 0.20, 0.50)
200212
if signals.get("vector_score") is not None:

aperag/graph_curation/dto.py

Lines changed: 28 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -56,13 +56,17 @@ class CurationEntity:
5656
* ``name`` — same as legacy.
5757
* ``type`` — entity type / label. Pulled from
5858
``EntityWithLineage.entity_type``.
59-
* ``description`` — single-string view used by ``_pair_score`` for
60-
Jaccard token overlap. Prefers
61-
``EntityWithLineage.compacted_description`` when populated by the
62-
Wave 7 ``GraphIndexCompactor``, falling back to the joined
63-
per-doc ``description_parts`` text so newly-extracted entities
64-
that have not yet hit the compactor still produce a usable
65-
similarity signal.
59+
* ``description`` — preserved for backward-compat with the legacy
60+
DTO shape; Wave 5 description-NULL invariant (task #31 A3, spec
61+
§ 3.1.5) means new dedup detection paths must NOT read this
62+
field. ``from_lineage`` always sets it to ``""`` — the Wave 5
63+
graph extractor no longer emits ``description_parts`` /
64+
``compacted_description``, so any non-empty value here would be
65+
stale residue. Field kept (instead of removed) only so existing
66+
callers that pass ``description=""`` keyword don't break;
67+
boundary test ``test_graph_curation_description_free`` grep-
68+
zeroes any ``entity.description`` *read* in
69+
``aperag/graph_curation/**`` + ``aperag/indexing/merge_candidate_detector.py``.
6670
* ``source_chunk_ids`` — flattened across all
6771
``EntityWithLineage.source_lineage`` members so the candidate
6872
generator's ``shared_chunks`` heuristic still works.
@@ -72,7 +76,12 @@ class CurationEntity:
7276
collection_id: str
7377
name: str
7478
type: str
75-
description: str
79+
# Wave 5 description-NULL invariant (task #31 A3): always ``""``
80+
# for entities materialised via :meth:`from_lineage`. Field kept
81+
# default-empty for backward-compat with callers that explicitly
82+
# pass ``description=""`` (or pre-Wave-5 fixtures); new code paths
83+
# must not read it (boundary test enforced).
84+
description: str = ""
7685
source_chunk_ids: Sequence[str] = field(default_factory=tuple)
7786

7887
def __post_init__(self) -> None:
@@ -97,12 +106,17 @@ def from_lineage(
97106
``collection_id`` is supplied by the caller because the
98107
``EntityWithLineage`` view is per-collection-bound at the
99108
store level and does not carry the id field on the row.
100-
"""
101-
if entity.compacted_description:
102-
description = entity.compacted_description
103-
else:
104-
description = "\n\n".join(p.text for p in entity.description_parts if p.text)
105109
110+
Wave 5 description-NULL invariant (task #31 A3, spec § 3.1.5
111+
item 4): description input no longer depends on
112+
``entity.compacted_description`` / ``entity.description_parts``
113+
— Wave 5 graph extractor stopped emitting them, so the legacy
114+
derivation would always collapse to ``""`` and any non-empty
115+
value (e.g. on a stale row from before the cut-over) would
116+
leak into dedup scoring. Set explicitly to ``""`` here so the
117+
invariant is enforced at the boundary, not silently relied on
118+
downstream.
119+
"""
106120
chunk_ids: list[str] = []
107121
for member in entity.source_lineage:
108122
chunk_ids.extend(member.chunk_ids)
@@ -112,7 +126,7 @@ def from_lineage(
112126
collection_id=collection_id,
113127
name=entity.name,
114128
type=entity.entity_type,
115-
description=description,
129+
description="",
116130
source_chunk_ids=tuple(chunk_ids),
117131
)
118132

aperag/graph_curation/lineage_merge.py

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,20 @@ async def merge_entities(
189189
Empty ``source_names`` is a no-op (returns immediately with
190190
``LineageMergeResult(final_target=target_name, ...,
191191
merged_source_ids=[])``).
192+
193+
⚠️ **Wave 5 description-NULL invariant note** (task #31 A3,
194+
spec § 3.1.5): this is the *legacy* description-bearing path
195+
— it still calls the LLM unified description / Compactor /
196+
``__curation_merge__`` sentinel write / vector embed write.
197+
It is preserved for the manual / sync ``handle_action()`` API
198+
compatibility lane (per spec § 3.1.5 «老 ``lineage_merge.py``
199+
路径保留作 manual API 兼容,标 deprecation Lesson #14 multi-
200+
iteration cleanup follow-up»). The async accept-apply worker
201+
introduced by task #31 must use
202+
:meth:`merge_entities_apply_description_free` instead — that
203+
variant skips every LLM/compactor/vector-embed step so it
204+
respects the Wave 5 invariant that
205+
``EntityRecord.description`` is always ``NULL``/``""``.
192206
"""
193207
if not source_names:
194208
return LineageMergeResult(
@@ -328,6 +342,161 @@ async def merge_entities(
328342
compacted_description=compacted,
329343
)
330344

345+
async def merge_entities_apply_description_free(
346+
self,
347+
*,
348+
target_name: str,
349+
source_names: Sequence[str],
350+
merged_by: str | None = None,
351+
) -> LineageMergeResult:
352+
"""Wave 5 description-NULL apply variant — task #31 A3 (spec
353+
§ 3.1.5).
354+
355+
Async accept-apply worker entry point. Mirrors
356+
:meth:`merge_entities` step ordering (1 → 2 → 3 → 6a → 8) but
357+
deliberately **skips** the four description-bearing steps
358+
forbidden by the Wave 5 invariant:
359+
360+
* Step 4 — LLM unified description **(skipped)**: graph
361+
extractor no longer emits ``description_parts`` after Wave
362+
5, so unifying empty fragments would call the LLM with no
363+
input and burn budget for no value.
364+
* Step 5 — :class:`GraphIndexCompactor` pass **(skipped)**:
365+
there is no unified description to compact.
366+
* Step 6b — final ``__curation_merge__`` sentinel upsert
367+
**(skipped)**: nothing to anchor; the canonical entity
368+
carries no description text post Wave 5.
369+
* Step 7 — vector layer write **(skipped)**: graph entity
370+
vectors are owned by the ``graph_vectors`` worker (Wave 5
371+
task #5 / #7 split). The merger never re-embeds a
372+
description here — the orphan source vectors are GC'd by
373+
the ``q:graph_vector`` clean-up lane (task #11) so the
374+
accept-apply path does not need to call the vector store at
375+
all (no write, no delete).
376+
377+
What still runs:
378+
379+
1. ``alias_repo.resolve_canonical(target_name)`` — flatten
380+
any pre-existing alias chain so the final target is
381+
canonical (1-hop guarantee, same as legacy step 1).
382+
2. ``alias_repo.upsert_alias(alias=source, target=...)`` per
383+
source — guarantees future indexer writes redirect.
384+
:class:`AliasCycleError` propagates.
385+
3. ``store.get_entity`` for target + sources — needed only to
386+
read ``source_lineage`` so per-doc lineage members can be
387+
re-anchored under the canonical name. ``description_parts``
388+
is post-Wave-5 always ``[]`` and is **not** read.
389+
4. (Step 6a equivalent) Re-anchor each source's
390+
``source_lineage`` members under the target name with
391+
``EntityRecord.description=""`` so per-doc tracking is
392+
preserved (invariant #1, L1 不污染) without re-introducing
393+
description residue.
394+
5. (Step 8 equivalent) Delete sources from L1 only — vector
395+
store is left to task #11 GC, mirroring how every other
396+
description-free path treats orphan vectors.
397+
398+
Failure mode is identical to :meth:`merge_entities` — the
399+
merge is not a single SQL transaction across all steps; the
400+
alias upsert (step 2) is transactional inside
401+
:class:`AliasMapRepository`, and every downstream write is
402+
keyed on ``(document_id, parse_version)`` so a retried merge
403+
is idempotent.
404+
405+
Returns ``LineageMergeResult`` with
406+
``unified_description=""`` and ``compacted_description=None``
407+
— fields are kept on the result shape for backward-compat
408+
with the legacy caller (UI layer reuses the same shape) but
409+
new callers should ignore them.
410+
"""
411+
if not source_names:
412+
return LineageMergeResult(
413+
final_target=target_name,
414+
merged_source_ids=[],
415+
unified_description="",
416+
compacted_description=None,
417+
)
418+
419+
# Step 1 — flatten target chain (1-hop).
420+
final_target = await self._alias_repo.resolve_canonical(collection_id=self._collection_id, name=target_name)
421+
422+
# Step 2 — record alias rows. Cycle reject propagates as
423+
# AliasCycleError so the caller can abort cleanly.
424+
for src in source_names:
425+
await self._alias_repo.upsert_alias(
426+
collection_id=self._collection_id,
427+
alias_name=src,
428+
target=final_target,
429+
merged_by=merged_by,
430+
)
431+
432+
# Step 3 — read target + sources (lineage only; description_parts
433+
# is empty post-Wave-5 and is intentionally not read here).
434+
target_entity = await self._store.get_entity(final_target)
435+
if target_entity is None:
436+
# Target was GC'd between merge initiation and execution —
437+
# alias rows already written; no L1 update can succeed.
438+
logger.warning(
439+
"lineage_merger(description_free): target %r missing at merge time (collection=%s); "
440+
"alias rows written but no L1 update performed",
441+
final_target,
442+
self._collection_id,
443+
)
444+
return LineageMergeResult(
445+
final_target=final_target,
446+
merged_source_ids=list(source_names),
447+
unified_description="",
448+
compacted_description=None,
449+
)
450+
451+
source_entities: list[EntityWithLineage] = []
452+
for src in source_names:
453+
row = await self._store.get_entity(src)
454+
if row is not None:
455+
source_entities.append(row)
456+
457+
# Step 6a equivalent — re-anchor source lineage members under
458+
# the target name with ``description=""`` so per-doc tracking
459+
# is preserved without leaking Wave-5-erased description text.
460+
# We re-anchor per ``LineageMember`` instead of per
461+
# ``DescriptionPart`` because post-Wave-5 entities carry
462+
# lineage rows but no description parts, and the indexer write
463+
# contract still requires a record + lineage tuple.
464+
bulk_parts: list[tuple[EntityRecord, LineageMember]] = []
465+
for src in source_entities:
466+
for member in src.source_lineage:
467+
chunk_ids = tuple(member.chunk_ids)
468+
bulk_parts.append(
469+
(
470+
EntityRecord(
471+
name=final_target,
472+
entity_type=target_entity.entity_type,
473+
description="",
474+
source_chunk_ids=chunk_ids,
475+
),
476+
LineageMember(
477+
document_id=member.document_id,
478+
parse_version=member.parse_version,
479+
tenant_scope_key=member.tenant_scope_key,
480+
chunk_ids=chunk_ids,
481+
),
482+
)
483+
)
484+
if bulk_parts:
485+
await self._store.bulk_upsert_entity_with_lineage_parts(parts=bulk_parts)
486+
487+
# Step 8 equivalent — delete sources from L1 only. Vector
488+
# cleanup is owned by the ``q:graph_vector`` GC lane
489+
# (task #11) per the description-free contract.
490+
for src_entity in source_entities:
491+
await self._store.delete_entity(src_entity.name)
492+
493+
return LineageMergeResult(
494+
final_target=final_target,
495+
merged_source_ids=list(source_names),
496+
unified_description="",
497+
compacted_description=None,
498+
)
499+
331500
# ------------------------------------------------------------------
332501
# internals
333502
# ------------------------------------------------------------------

aperag/graph_curation/service.py

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,24 @@ async def _load(session: AsyncSession):
241241
# in W7-8). ``edges_redirected`` / ``edges_collapsed`` are 0
242242
# by design — alias redirect happens at indexer write-time via
243243
# the decorator, not as a per-merge count we can surface.
244+
#
245+
# ⚠️ DEPRECATED (task #31 A3, spec § 3.1.5 — Lesson #14 multi-
246+
# iteration cleanup family): the ``merge_result`` carries
247+
# ``compacted_description`` / ``unified_description`` only on
248+
# the legacy ``LineageEntityMerger.merge_entities`` path —
249+
# post-Wave-5 the description-free apply variant
250+
# (``merge_entities_apply_description_free``) returns ``None``
251+
# / ``""`` for both fields. This sync ``handle_action()`` API
252+
# is preserved for back-compat with existing callers that
253+
# consume ``merge_result.description`` in the response;
254+
# follow-up cleanup once all consumers migrate to the async
255+
# accept-apply path will drop this field from the response
256+
# shape entirely. The boundary test
257+
# ``tests/boundaries/test_graph_curation_description_free.py``
258+
# explicitly allowlists ``merge_result.compacted_description``
259+
# via the ``NON_ENTITY_BASE_NAMES`` mechanism (it is a
260+
# ``LineageMergeResult`` field, not an ``entity`` description
261+
# read).
244262
merge_description = merge_result.compacted_description or merge_result.unified_description
245263
chunk_ids: set[str] = set()
246264
target_after = await merger._store.get_entity(merge_result.final_target) # noqa: SLF001
@@ -842,7 +860,15 @@ async def _fetch_shadow_neighbors(
842860
flt = Eq("indexer", "graph_entity")
843861
out: dict[str, list[tuple[str, float]]] = {}
844862
for entity in entities:
845-
text = entity.description or entity.name
863+
# Wave 5 description-NULL invariant (task #31 A3, spec § 3.1.5):
864+
# ``CurationEntity.description`` is always ``""`` post Wave 5
865+
# (see ``CurationEntity.from_lineage``); the legacy
866+
# ``description or name`` fallback short-circuits to ``name``
867+
# uniformly, so read the name directly. Mirrors the
868+
# ``MergeCandidateDetector._embedding_query_text`` shape but
869+
# this consumer pre-dates the helper and only needs a stable
870+
# text input — name is sufficient.
871+
text = entity.name
846872
if not text:
847873
continue
848874
try:

0 commit comments

Comments
 (0)