Skip to content

Commit 836fe54

Browse files
earayuclaude
andcommitted
fix(celery T2.1): revert cutover to worker single-TX + add graph cleanup dispatch
Three architect rulings (msg=492315e8) applied as a follow-up on top of T2.1 commit 7f51d44: **Ruling 1 — §F.3 cutover MUST be single-TX in worker, not split.** My earlier T2.1 implementation split §F.3 into: - worker writes statement 1 (status=ACTIVE) on sync success - reconciler.reconcile_cutover() writes statements 2+3 (demote+promote) ~30s later on its next cycle That's a drift from the §F.3 explicit lock "Three statements in one transaction" + introduces an ACTIVE-but-not-is_serving inconsistency window the spec disallows in §F.4. Fix: - aperag/indexing/orchestrator.py: replace _finalize_active with _finalize_active_with_cutover — runs all 3 §F.3 statements (status=ACTIVE → demote prior is_serving for (doc, modality) → promote new) inside the same session.begin() block. The §F.1 partial unique index (uniq_document_index_v2_serving) is honoured naturally because statement 2 (demote-FALSE) precedes statement 3 (promote-TRUE) — index never sees two TRUE rows mid-TX. - aperag/indexing/reconciler.py: remove reconcile_cutover() entirely + drop from run_reconcile_loop (now 3 scans: PENDING dispatch + FAILED retry + RUNNING reclaim). Module docstring updated to reference the orchestrator-side cutover with explicit pointer to architect ruling msg=492315e8 Ruling 1. - aperag/indexing/__init__.py: remove reconcile_cutover from exports. - tests: 3 reconciler-cutover tests rewritten as 3 orchestrator-side cutover tests (process_one_task happy path → ACTIVE+is_serving=TRUE in one TX; partial unique invariant; per-modality independence). E2E smoke updated — no longer needs a separate reconcile_cutover step after orchestrator completion. **Ruling 3 — graph cleanup MUST land in T2.1, not punt to T2.2.** T2.2 lane is quota + bulkhead per Ruling 2 simplification; cleanup is squarely chenyexuan's lane. Earlier T2.1 logged a WARN + skipped graph backend cleanup as a "T2.2 follow-up" — that was a scope leak. Fix in aperag/indexing/cleanup.py: - Two distinct entry points with different graph semantics: (A) cleanup_orphan_parse_versions — orphan parse_v GC. For graph, this is a backend no-op because §D.3.6 sync supersede already removed old parse_version's lineage members when the new parse_version was written (per amended §D.3.2 canonical PR #1725 head a0a4799 — sync clears by document_id, not parse_v). Counted under "graph_noop" for telemetry; DB row still GC'd. (B) cleanup_for_deleted_documents — caller-driven, runs when a document is removed. For non-graph: flat backend delete per (document_id, parse_version). For graph: invoke the lineage cleanup loop on the worker's underlying LineageGraphStore via EntityLock (Wave 1 conventions exposed as _store + _entity_lock). Per-document dedup so multiple parse_version rows for the same doc share one lineage cleanup call. - _is_graph_worker uses Modality.GRAPH check (no graph.py import) to avoid pulling Nebula/Neo4j extras into the cleanup module. - _flat_backend_delete_callable renamed for clarity (was _backend_delete_callable); only walks worker._backend (vector / fulltext / summary / vision); explicitly does NOT walk _store. - _cleanup_graph_lineage_for_document implements §D.3 lineage cleanup at the storage layer using duck-typed Wave 1 conventions (find_entity_ids_with_lineage / remove_entity_lineage_member / optional gc_entity_if_orphan + symmetric relation cleanup). Each entity is serialized through entity_lock.acquire() so a concurrent graph sync cannot race. Tests: - test_cleanup_orphan_parse_version_for_graph_is_backend_noop — asserts graph_noop counter increments + backend delete is NOT called + DB row is still GC'd - test_cleanup_for_deleted_documents_removes_non_graph_backend_per_parse_version — vector path: delete_by_filter per parse_version + rows dropped - test_cleanup_for_deleted_documents_calls_graph_lineage_cleanup_once_per_doc — graph path: lineage cleanup runs once per doc regardless of how many parse_version rows; entity GC + relation cleanup invoked - test_cleanup_for_deleted_documents_handles_empty_input - _StubLineageGraphStore + _StubAsyncLock + _GraphLikeWorker test fixtures — duck-typed stand-ins for GraphModalityWorker so we don't need graph extras to test cleanup Local pytest: full indexing suite + Phase 3 audit gate = 104 passed, 0 failed (Wave 1 + T2.1 + T2.2 + T2.1 follow-up all green together on rebased branch on Bryce's commit 057409f). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 057409f commit 836fe54

5 files changed

Lines changed: 691 additions & 218 deletions

File tree

aperag/indexing/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
from aperag.indexing.cleanup import (
2828
CLEANUP_INTERVAL_SECONDS,
2929
ORPHAN_COOLDOWN_SECONDS,
30+
cleanup_for_deleted_documents,
3031
cleanup_orphan_parse_versions,
3132
find_orphan_parse_versions,
3233
run_cleanup_loop,
@@ -127,7 +128,6 @@
127128
HEARTBEAT_STALE_SECONDS,
128129
RECONCILE_BATCH_SIZE,
129130
RECONCILE_INTERVAL_SECONDS,
130-
reconcile_cutover,
131131
reconcile_failed_retry,
132132
reconcile_pending_dispatch,
133133
reconcile_running_reclaim,
@@ -242,13 +242,13 @@
242242
"reconcile_pending_dispatch",
243243
"reconcile_failed_retry",
244244
"reconcile_running_reclaim",
245-
"reconcile_cutover",
246245
"run_reconcile_loop",
247246
# Cleanup (T2.1)
248247
"CLEANUP_INTERVAL_SECONDS",
249248
"ORPHAN_COOLDOWN_SECONDS",
250249
"find_orphan_parse_versions",
251250
"cleanup_orphan_parse_versions",
251+
"cleanup_for_deleted_documents",
252252
"run_cleanup_loop",
253253
# Quota (T2.2 §H.5)
254254
"DEFAULT_TENANT_FALLBACK",

0 commit comments

Comments
 (0)