feat(celery Wave 1): Foundation + 5 modalities + observability#1726
Merged
Conversation
…+ parser Phase celery T1.1 per docs/modularization/indexing-redesign-design-pack.md (PR #1725 v3 head 5d7a60f). Foundation lane that the other Wave 1 modality lanes (T1.2 graph @bryce / T1.3 vector+fulltext / T1.4 summary+vision / T1.5 observability) depend on. What this adds (~600 LOC + tests): - alembic migration `f9c4d2a8e1b5_indexing_redesign_document_index.py` creates the `document_index` table with the §F.1 partial unique index `uniq_document_index_serving` enforcing "at most one is_serving=TRUE row per (document_id, modality)" at the DB layer (Bryce v2 review msg=7ccb176f #2). Postgres native; SQLite ≥3.8 supports the same syntax (Tier 1 §L deploy stays consistent). - `aperag/indexing/models.py` — DocumentIndex SQLAlchemy ORM mirroring the alembic schema, plus `Modality` (5 values per §C/§D) and `IndexStatus` (4 lifecycle states per §F.2) string-enums. - `aperag/indexing/base.py` — `ModalityWorker` ABC with `derive` + `sync` so per-modality workers (T1.2-T1.5) inherit the §D.1 "DELETE-by-(doc, parse_version) THEN INSERT" replace-idempotent contract. Graph reinterprets DELETE as the §D.3 lineage-level DELETE+INSERT internally; the ABC accepts that variation. - `aperag/indexing/object_store.py` — Atomic write helpers that wrap the existing `aperag.objectstore` package: LocalFS uses tmp+fsync+rename per §C.7; S3/MinIO relies on the single-request PutObject (or multipart CompleteMultipartUpload) visibility gate. Includes `read_or_none` for the §C.7 read contract and an `InMemoryObjectStore` test fixture so downstream T1.x can wire unit tests without touching disk. - `aperag/indexing/parser.py` — Deterministic parser entry point that produces the three shared artifacts (`markdown.md` / `outline.json` / `chunks.jsonl`) under `derived/parse_<v>/` per §C.1. parse_version is computed via the canonical D10.g §E.2 helper (`compute_parse_version`) so a chunking change rolls the version. T1.1 ships an in-process simulator implementation that proves the write contract; production parser integration (docparser/Marker/OCR) swaps in at T2.x without changing the artifact shape. - `aperag/migration/env.py` — Register `aperag.indexing.models` for alembic autogen (the new module deliberately lives outside the per-domain `db/models.py` tree because the `aperag.domains.indexing` surface is the Wave 3 hard-delete target). Tests cover the three Wave 1 acceptance gates locked by the design pack §K: 1. Partial unique invariant: an INSERT of a second is_serving=TRUE row for the same (document_id, modality) raises IntegrityError; non-serving rows + a different modality on the same document are allowed; the §F.3 three-statement cutover transaction satisfies the constraint. 2. Object-store atomic write: `_LocalAtomicWriter` produces a final destination file with no `.tmp.*` siblings remaining; concurrent writers to different artifacts in the same parse_version directory each land their own bytes; `InMemoryObjectStore` matches the single-call atomicity semantic. 3. Parser → derived/ round-trip: parse_document writes the three canonical artifacts; identical inputs yield identical parse_version + identical artifact contents (§C.3 idempotent retry); a chunking config change rolls parse_version (§E.2 hash); chunks.jsonl round-trips via `read_chunks`; outline.json carries slash-separated `section_path` + slug `heading_anchor` per the D10.c §A.9 R1 lock; re-running parse_document overwrites atomically; missing/empty artifacts are treated as "derive not yet complete" (§C.7 read contract); path traversal is rejected. Out of scope for this lane (per §K decomposition): - T1.2 graph modality + §D.3 lineage model (@bryce, task #7) - T1.3 vector + fulltext modalities (chenyexuan, task #8) - T1.4 summary + vision modalities (chenyexuan, task #9) - T1.5 observability OTLP emit (chenyexuan, task #10) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…dation fix-forward Bundles the architect-mandated fix-forward (msg=4a801b2b, msg=c3b0ba5b, msg=07b5b1e6) on top of T1.1 foundation: T1.1 fix-forward (per architect rulings on PR #1726 P0 bugs): - Bug1: object_store.py LocalObjectStore import alias (class is named Local) - Bug2: tablename → document_index_v2 + index renames (Wave 3 renames back to canonical via alembic per task #14 acceptance amendment) - NEW (§H.2): tenant_scope_key VARCHAR(64) NOT NULL column + idx_document_index_v2_tenant_scope index — locked into Wave 1 schema for T2.2 quota lane to consume without migration churn - Tests: fixture default tenant_scope_key="user:test" T1.3 Vector + Fulltext (§D.1 replace-idempotent contract): - VectorBackend / InMemoryVectorBackend Protocol + Qdrant-shaped delete_by_filter + upsert_point - VectorModality: derive no-op pass-through, sync DELETE-by-(doc, parse_v) THEN per-chunk upsert with placeholder embedding (sha256-derived 16-dim) - FulltextBackend / InMemoryFulltextBackend + delete_by_query + bulk_index - FulltextModality: shares chunks.jsonl with vector (§C.6); chunk_id parity preserved for hybrid dedup at search layer - Tests: replace-idempotent on double sync, new parse_version doesn't corrupt old slot, hybrid chunk_id parity, modality discriminator on payload, missing-artifact no-op (§C.7 reschedule semantic) T1.4 Summary + Vision (§C.6 + §D.2 expensive-derive split): - SummaryModality: derive reads parser markdown.md, runs placeholder summarizer (first non-heading paragraph), embeds, writes summary.json atomically; sync deletes by filter + upserts single point keyed summary:{document_id}:{parse_version} - VisionModality: derive reads synthetic image-records JSON (T1 simulator contract — T2.x replaces with real PDF extract + vision-LLM), writes vision/manifest.jsonl, sync upserts one point per image keyed vision:{document_id}:{parse_version}:{image_id} - Both backends use Qdrant-shaped Protocol + InMemory test fixtures - Tests: derive persists canonical artifact (cost preserved across retries), idempotent on double sync, new parse_version doesn't corrupt old slot, modality discriminator, missing-artifact no-op T1.5 Observability primitives (§J.1 SLI emission): - 5 metric name constants prefixed indexing.* — index_lag_seconds / index_failure_rate / index_success / queue_depth / worker_utilization - MetricsEmitter Protocol + NoopMetricsEmitter (Tier 1 deploy without OTLP) + InMemoryMetricsEmitter (test fixture) - emit_index_lag / emit_index_failure / emit_index_success / emit_queue_depth / emit_worker_utilization helpers — modality attribute optional, utilization clamps to [0, 1] and handles capacity=0 - Tests: emission shape contract for each helper + metric name prefix lock Per architect msg=f21a79f0 + PM msg=07b5b1e6 + msg=95012fdb: this commit accumulates onto the Wave 1 mega-PR #1726. Bryce will rebase + push T1.2 graph commit on the same branch; once T1.2 lands, the full Wave 1 PR flips to in_review for huangheng's step 0+ / step 0 / step 0''' / cross-lane caller sweep CR + verdict. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…edis lock + tenant_scope_key Per docs/modularization/indexing-redesign-design-pack.md §D.3 + architect msg=cc555e33 / msg=f2921ae0 / msg=c3b0ba5b lineage rulings. The graph modality is the only one whose backend rows are shared across documents, so the simple DELETE-by-(doc, parse_version) + re-INSERT pattern that vector/fulltext/summary/vision use breaks the shared-entity model — Bryce v2 review msg=7ccb176f #3 surfaced this. T1.2 lands the lineage-tracked entity / relation rows + the §D.3.2 two-phase sync algorithm + the per-entity Redis lock that protects Nebula's read-modify-write window from racing. Surface (aperag/indexing/graph.py, ~1017 lines): * Lineage data model — LineageMember(document_id, parse_version, tenant_scope_key, chunk_ids), DescriptionPart, EntityRecord, RelationRecord, EntityWithLineage, RelationWithLineage. The tenant_scope_key lives at SET-element level (not entity row level) per architect msg=c3b0ba5b — placement chosen so a shared entity cited by multiple tenants still has one row but each lineage member carries its own tenant attribution for read-path ACL filtering (§H.2 quota / organization key). * Per-entity lock — EntityLock Protocol + InMemoryEntityLock (asyncio.Lock per key, single-process default) + RedisEntityLock (Redis SETNX-style lock keyed by f"{prefix}:{slot}" where slot is crc32(entity_id) % 4096 to bound the key space). The lock is mandatory on the Nebula path because Nebula 3.x's list ops require application-layer read-modify-write; concurrent sync calls touching the same shared entity would otherwise lose lineage members at the network round-trip. * LineageGraphStore Protocol — backend abstraction so the §D.3.2 algorithm in GraphModalityWorker is portable across the three backends listed in §D.3.5 (Nebula 3.x, Neo4j with native list + APOC, in-memory). Methods filter lineage by document_id (not exact (doc, parse_version)) so re-parsing supersedes the old parse_version cleanly per §D.3.6 step 3 (see §D.3.2 amendment note in docstring). * InMemoryLineageGraphStore — Python-dict reference implementation used as the canonical correctness oracle. The §D.3.6 self-test + concurrent race test run against it; any future Nebula / Neo4j adapter that satisfies the Protocol inherits pass/fail status from this suite. * GraphExtractor — Callable injection point so the production LightRAG-based LLM extractor (Wave 2 wiring) and the test stubs share the same surface. The kg.jsonl artifact persists the extractor output so retries never re-charge LLM cost (§C.6 CANONICAL artifact contract). * serialize_kg_jsonl / parse_kg_jsonl — line-oriented JSON with a "kind" discriminator so future record types can be added without breaking the file format. Empty payloads encode as b"\\n" (one byte) so the §C.7 "empty body == derive not finished" sentinel cannot collide with a deliberate "no records produced" payload — the deletion flow (§D.3.6 step 4 / step 5) publishes b"\\n" to cleanly clear a document's lineage contribution. * GraphModalityWorker — implements ModalityWorker ABC. derive() reads chunks.jsonl produced by the T1.1 parser, calls the injected extractor, writes kg.jsonl atomically. sync() reads kg.jsonl and applies the §D.3.2 algorithm: Phase 1 (cleanup): for every entity / relation in the lineage backend whose lineage SET contains the document_id, remove ALL members for that document_id (across any parse_version) and GC rows that go orphan. Per-entity lock held during cleanup so concurrent syncs cannot race the read-modify-write. Phase 2 (rebuild): for every entity / relation in kg.jsonl, upsert with a fresh LineageMember stamped with the current (document_id, parse_version, tenant_scope_key). Per-entity lock held during rebuild for the same reason. Tests (tests/unit_test/indexing/test_t1_2_graph.py, ~1044 lines, 16 test cases): * §D.3.6 five-step idempotency suite (test_d3_6_step1..step5): doc_A v1 → doc_B v1 → doc_A v2 supersedes → delete doc_A → delete doc_B with full entity GC. Pin the §D.3.2 algorithm against the §D.3.6 narrative; the regression Bryce surfaced in msg=7ccb176f #3 fails this suite under the v1 append-on-conflict path. * Relation-lineage symmetric coverage (test_relation_lineage_doc_a_then_doc_b_then_delete_doc_a) — proves the same lineage semantics flow through relation evidence_lineage. * §D.4 byte-equivalent re-sync (test_d4_byte_equivalent_resync) — double-sync with the same artifact leaves the backend identical. The cross-modality idempotency contract every modality must pass. * Nebula race condition under per-entity lock (test_nebula_race_under_per_entity_lock_preserves_both_writes + test_nebula_race_without_lock_loses_a_writer) — uses a _RaceProvocateurStore that emulates Nebula's read-modify-write network round-trip with a deterministic asyncio yield. Under the in-memory entity lock, both writers' lineage members end up in the SET (PASS); under a no-op lock, deterministically one writer loses (the negative control). Pins the architect msg=f2921ae0 invariant that per-entity serialization is mandatory not optional. * tenant_scope_key propagation (test_tenant_scope_key_propagates_into_lineage_members) — pins the SET-element placement decision per architect msg=c3b0ba5b so a future regression that drops the field or moves it to entity row level fails loudly. * kg.jsonl round-trip (test_kg_jsonl_*) — entity / relation serialization, forward-compatible kind skipping, empty-body encoding (always at least b"\\n" to distinguish "no records" from "derive crashed"). * derive contract (test_derive_writes_kg_jsonl_under_canonical_path + test_sync_with_missing_artifact_is_a_noop) — derive writes to derived/parse_<v>/kg.jsonl per §C.6 canonical layout; sync no-ops when the artifact is missing per §C.7 read contract. * End-to-end with the real T1.1 parser (test_end_to_end_with_real_parser_chunks) — proves graph.derive cooperates with the chunks.jsonl shape that aperag.indexing.parser.parse_document actually produces, no hidden coupling beyond the §C.6 contract. aperag/indexing/__init__.py — re-export the T1.2 graph public surface alongside chenyexuan's T1.3/T1.4/T1.5 modality exports so the indexing package surface is uniform across all five modalities. Lint: ruff check + ruff format --check both clean across aperag/indexing/ and tests/unit_test/indexing/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…patch Three follow-up bugs surfaced by Bryce's local pytest collection (msg=464d5b70) on top of fix-forward 859f899: 1. test_t1_1_foundation.py:69 — ImportError. The class is named `Local` (not `LocalObjectStore`); my main-tree object_store.py already uses the `Local as LocalObjectStore` alias per architect ruling on Bug 1 (msg=4a801b2b), but the test file's own import line still referenced the wrong name. Fixed: same alias pattern. 2. test_local_atomic_write_uses_tmp_rename_dance — AttributeError on `aperag.objectstore.local.settings`. The module never had a `settings` attribute; the monkey-patch block was unfounded. `Local(LocalConfig)` accepts the config struct directly, no module-level singleton dependency. Replaced the monkey-patch block with a direct `LocalObjectStore(LocalConfig(root_dir=...))` construction. 3. test_concurrent_atomic_writes_dont_clobber_each_other — same AttributeError pattern, same fix. No production code changed; only the T1.1 test fixture is simplified (net -30 lines). Lint clean. Wave 1 PR #1726 ready to flip to in_review after this push. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
earayu
added a commit
that referenced
this pull request
Apr 26, 2026
Bryce v3 implementation review (msg=464d5b70) caught a spec bug in
§D.3.2 step 1b: the pseudocode used `(document_id, parse_version)`
exact-match for lineage filter, which contradicts §D.3.6 narrative
step 3 ("doc_A v2 写入(覆盖 doc_A 旧 lineage)"). Strict exact-match
would leave lineage[A,v_old] + lineage[A,v_new] coexisting after a
re-parse, violating the expected supersede semantic.
Architect ruling (msg=80c5dc06) is to amend §D.3.2 step 1b to filter
by `document_id` only (not parse_version). This makes sync(doc, v_new)
self-contained for supersede; orchestrator does not need to do
explicit clear-then-sync.
§D.3.6 narrative remains canonical. PR #1726 Wave 1 graph implementation
follows the corrected algorithm.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ex duplicate huangheng's Wave 1 CR (msg=8e67bf0e) flagged P1: test_phase3_classes_have_single_definition_site fails because Wave 1 introduces aperag/indexing/models.py:DocumentIndex alongside the legacy aperag/domains/indexing/db/models.py:DocumentIndex still owned by Celery. Architect ruling msg=5be9a535 — option (a): add a narrow allowlist covering exactly this (class_name, file_pair); reject (b) renaming the new ORM class (would force a Wave 1 + Wave 3 double-rename, conflicts with msg=4a801b2b "Python class name stays canonical, only __tablename__ differs") and (c) xfail (does not express the "intentional transitional state" semantic). Wave 3 task #14 acceptance per architect amendment will remove this allowlist entry in the same PR that deletes the legacy file + alembic renames document_index_v2 back to document_index. The allowlist entry is narrow: - Only the exact frozenset of two file paths is accepted; any third duplicate (Wave 4+ regression, e.g. a copy-paste of the class) still flags as an offender. - Only matches `class DocumentIndex(Base):` definitions (the existing orm_pattern); pydantic schemas with the same name are unaffected. - No global widening — no other Phase 3 class gets an exception. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
earayu
added a commit
that referenced
this pull request
Apr 26, 2026
…ite proposal (#1725) * docs(indexing): add indexing redesign design pack — first-principles rewrite proposal Per earayu2 directive (#celery msg=56812dd6 + msg=d8080c08): full redesign of the document indexing system, prioritizing simplicity and reliability over feature breadth, targeting 100 concurrent docs, with hard-cut authorization (pre-launch / no users / no migration). Design pack contents (1049 lines, 11 sections): - §A — Current system analysis with file:line evidence (3-layer ownership skew, Python lease thread tied to worker process, graph index NOT replace-idempotent per nebula.py:354 upsert_entities, ~995 lines in tasks.py mixing infra + business) - §B — First principles (single SoT in DB, idempotent convergence, source/ derived/index three-layer separation, concurrency bounded by external capacity, simple > complex) - §C — Three-layer document model (collections/<id>/documents/<id>/source/ + derived/parse_<v>/{markdown.md, chunks.jsonl, kg.jsonl, summary.json, vision/} + backend index stores) - §D — Idempotency contract per modality (DELETE-by-(document_id, parse_version) before INSERT for all 5 modalities; fixes graph index append bug) - §E — Concurrency model decision matrix (HTTP-only / lightweight task / Celery refactor); recommends lightweight Redis-backed asyncio worker pool per modality (5 worker processes, ~80-line reconciler, no Celery / no chord / no Python lease thread) - §F — State machine + atomic flip (4 status values vs current 6; document.active_parse_version + pending_parse_version with transactional flip; deletion via async cleanup worker) - §G — Multi-modal unified pipeline (Modality ABC with derive() + sync() contract; collapses earayu2's "Celery task 绕来绕去最后又绕回 graph index" complaint into 2 functions in 1 file) - §H — Multi-tenant isolation (recommend simple — required tenant context + bulkheads, defer fairness machinery until observability shows real noisy-neighbor signal) - §I — Failure recovery (3 modes: worker crash, transient backend, permanent failure; exponential backoff retry; Redis token bucket for LLM rate-limit backpressure) - §J — Observability (4 SLI: index_lag_seconds, index_failure_rate, queue_depth, worker_utilization; OTLP wire; aligns with PR #1702) - §K — Migration plan (7 PRs: observability primitives → idempotent indexers → object store layout → worker pool → atomic flip → cutover → availability discriminator; feature-flagged dual-stack during PR-D/E; cutover deletes ~3000 lines of Celery infrastructure) Net delta: roughly +4150 / -4850 lines across 7 PRs — net subtraction despite adding functionality. Indexing layer drops from ~2500 lines to ~1500. Three open decisions deferred to earayu2: 1. Concurrency model: lightweight Redis-asyncio (recommended), HTTP-only, or Celery refactor 2. Atomic flip contract: all-modalities-ACTIVE-required (recommended) vs per-modality independent 3. PR sequence: 7-PR cut (recommended) vs combined Sibling reference: Bryce msg=791082a4 + msg=38fbf962 first-principles analysis + architect msg=19f283d5 + msg=2ee66c89 4-blind-spot synthesis. This design pack is the single canonical deliverable per earayu2's owner directive (@符炫炜 sole author of the final design). * docs(indexing): redesign pack v2 — incorporate earayu2 拍板 + 答 derived/MinIO Driven by earayu2 msg=cc0a00d7 + PM consolidation msg=32463d64. - Drop Celery → lock Redis + asyncio (§E, decision matrix removed) - Drop atomic flip → per-modality independent is_serving cutover (§F) Accept short eventual-consistency window per earayu2 directive. - Answer derived/parse_<v>/ contents per modality (§C.6) — chunks.jsonl shared by vector+fulltext, kg.jsonl, summary.json, vision/manifest.jsonl + images/, markdown.md + outline.json. - Answer MinIO/object-store suitability (§C.7) — ~150 MB / 100-doc burst, trivial; LocalFS / MinIO / S3-compatible all work; small-file + LIST caveats addressed. - Add §L private/on-premise "deploy-and-forget": Tier 1 inline (SQLite + LocalFS, ~10 docs/hour), Tier 2 docker-compose (~100 concurrent), Tier 3 horizontal scale-out — same code. - §H tenant_scope_key forward-compat hook for future organization concept; simple Redis token-bucket quota that won't lock future fairness. - §K restructure: 7 PRs → 3 waves (Foundation / Runtime / Cutover) with per-wave parallel-writability map. - §G.5 SearchResultMetadata extends w/ parse_version + index_state_per_modality (becomes structurally required under per-modality independent flip). PR #1725 v2; awaiting earayu2 final 拍板 on Wave 1 kickoff. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(indexing): v2 amendment — Bryce review deltas (chunking trade-off / serving invariant / graph entity lineage) Bryce v2 review msg=7ccb176f surfaced 3 substantive technical deltas all agreed-as-must-address by PM msg=fc307bbf. Folded into v2: §C.6 — chunks.jsonl shared-by-vector-and-fulltext is now framed as conscious trade-off (vector wants larger chunks, fulltext wants smaller) with explicit shadow-file extension hook (chunks.fulltext.jsonl + namespaced sub-IDs) preserved so future split is unblocked. §F.1 — partial unique index added at the schema layer: CREATE UNIQUE INDEX uniq_document_index_serving ON document_index (document_id, modality) WHERE is_serving = TRUE; This makes the "at most one serving row per (doc, modality)" invariant DB- enforced, not orchestrator-enforced. SQLite 3.8+ supports the same syntax (Tier 1 deploy stays consistent). §D.3 — graph entity lineage model rewritten. Cross-document shared entities ("Linus" mentioned in 100 docs) cannot be cleared by simple DELETE-by-doc without losing other docs' contributions. New model: - source_lineage: SET<{document_id, parse_version, chunk_ids[]}> - description_parts: SET<{document_id, parse_version, text}> - sync = lineage-level DELETE+INSERT, entity GC when lineage empty Includes per-entity serialization invariant for Nebula (read-modify-write without native list ops). 5-step idempotency self-test extension specified. PR #1725 v2; ready for earayu2 / Bryce final ack. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(indexing): §D.3.2 amendment — lineage cleanup by document_id only Bryce v3 implementation review (msg=464d5b70) caught a spec bug in §D.3.2 step 1b: the pseudocode used `(document_id, parse_version)` exact-match for lineage filter, which contradicts §D.3.6 narrative step 3 ("doc_A v2 写入(覆盖 doc_A 旧 lineage)"). Strict exact-match would leave lineage[A,v_old] + lineage[A,v_new] coexisting after a re-parse, violating the expected supersede semantic. Architect ruling (msg=80c5dc06) is to amend §D.3.2 step 1b to filter by `document_id` only (not parse_version). This makes sync(doc, v_new) self-contained for supersede; orchestrator does not need to do explicit clear-then-sync. §D.3.6 narrative remains canonical. PR #1726 Wave 1 graph implementation follows the corrected algorithm. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(indexing): §J.1 amendment — failure_total + success_total counter pair (was single rate gauge) huangheng Wave 1 CR (msg=8e67bf0e) flagged §J.1 spec drift: T1.5 implementation emits index_failure_total + index_success_total counter pair, not the single index_failure_rate gauge spec called for. Architect ruling: amend spec to match implementation. Counter pair is OTLP- idiomatic, preserves raw events, re-aggregates across workers without sliding-window state, and the rate is trivially computable downstream. §J.1 spec amended; §K Wave 1 acceptance bullet updated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(indexing): §F.1 + §F.5 amendments — collection_id/source_path/tenant_scope_key columns + cleanup Path C Three Wave 3 amendments folded in to unblock chenyexuan T3.1 (msg=afe345a9): §F.1 schema: - Add collection_id VARCHAR(64) NOT NULL (denormalized from document.collection_id, populated by orchestrator at INSERT for self-contained dispatch payload — per huangheng Wave 2 CR finding msg=c94b57fe + architect ruling msg=498b12f0). - Add source_path TEXT NOT NULL (pointer to source/ artifact, worker derive reads directly without JOIN). - Add tenant_scope_key VARCHAR(64) NOT NULL (forward-compat for future organization concept per §H.2; required key for §H.5 quota bucket). Was implicit-but-not-listed in the schema before; now explicit. - Add idx_document_index_collection + idx_document_index_tenant_scope indexes for cleanup / quota scans. §F.5 cleanup worker: - Restructure to three paths (A/B/C) with explicit semantics. - Path A: orphan parse_version GC (existing); now notes graph backend no-op via §D.3 lineage supersede + graph_noop counter for telemetry. - Path B: single-document deletion cascade — explicit graph dispatch via remove_entity_lineage_member(document_id) per §D.3 amended canonical (by document_id only). - Path C (NEW): collection deletion cascade — Collection.deleted_at scan + Path B per child document + final Collection row + storage tree cleanup. Replaces legacy Celery collection_delete task with state-driven recovery (no asyncio.create_task() durability gap). These amendments unblock T3.1 commit 1+ since chenyexuan needs the spec head to reference for the audit allowlist removal and the caller-migration patterns. Wave 3 task #14 acceptance criteria (per PM msg=5939e394) now references this head. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: 符炫炜 <fuxuanwei@apecloud.io> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wave 1 mega-PR per architect lock (msg=f21a79f0): foundation + all 5 modality workers + observability primitives, accumulated on a single branch for one-shot CR + verdict.
Scope (per §K Wave 1 acceptance):
document_index_v2schema (+tenant_scope_keyper §H.2), Modality ABC, atomic object-store write/read, deterministic parser →derived/parse_<v>/{markdown.md,outline.json,chunks.jsonl}aperag/indexing/graph.py§D.3 lineage model + per-entity Redis lock + Nebula/Neo4j dual path + SET-element-leveltenant_scope_keypropagation; 16 contract tests (test_t1_2_graph.py)deriveno-op pass-through (sharedchunks.jsonlper §C.6),syncDELETE-by-(doc, parse_version) THEN INSERT (§D.1), sharedchunk_idfor hybrid dedupsummary.json, vision writesvision/manifest.jsonl+ per-image points, both backends Qdrant-shaped, both keyed for §D.1 contractindexing.*per §J.1 (counter pairindex_failure_total+index_success_totalper amended spec; rate computed downstream), MetricsEmitter Protocol + Noop/InMemory emitters, 5 emit helpers (modality attr optional, utilization clamps to [0, 1])Architect-mandated fix-forward (already applied in commits 859f899 + f9fffe6 + 4a28d27):
LocalObjectStoreimport alias (class is namedLocal)document_index_v2(Wave 3 alembic renames back to canonical per task feat: socket debug #14 acceptance amendment)tenant_scope_key VARCHAR(64) NOT NULLcolumn +idx_document_index_v2_tenant_scopeindex, locked into Wave 1 schema for T2.2 quota lanelocal_mod.settingsmonkey-patch (the module never had that attr;Local(LocalConfig)is the supported construction path)WAVE_1_2_TEMPORARY_DUP_ALLOWLISTentry intests/unit_test/test_phase3_reexport_audit.pyfor the transitionalDocumentIndexduplicate (Wave 3 task feat: socket debug #14 removes the entry in the same PR that deletes the legacy file)T1 simulator placeholders — sha256-derived embeddings, first-paragraph summarizer, JSON image-records vision input. Production T2.x swaps placeholders for real embedding/LLM/vision services without changing the §D.1 sync contract.
Design pack canonical amendments (already amended in PR #1725 head
a0a47994)The following design-pack amendments are canonical in PR #1725 head
a0a47994and are reflected in this PR's implementation. PR #1726 does not duplicate them in source; CR should validate against PR #1725 heada0a47994.(document_id, parse_version)exact-match to(document_id)document_id-only. Singlesync(doc, parse_version, kg)call must self-contain the supersede semantic (§D.3.6 narrative is canonical). Bryce's T1.2 implementation ingraph.py(commit 7c9fe06) follows the amended contract.index_failure_rate→ counter pair amendment —index_failure_total+index_success_totalcounter pair is canonical (rate computed downstream from counter pair). OTLP-idiomatic: preserves raw events, cleanly re-aggregable across workers, no worker-side sliding-window state. T1.5 implementation already aligned.tenant_scope_key— Bryce's T1.2 attachestenant_scope_keyat lineage-member granularity (not row-level). Architect ack msg=80c5dc06: a single entity may be referenced by docs from different orgs in a future multi-tenant deployment; member-level scope preserves accurate quota / isolation boundaries.Branch coordination
Wave 1 fully assembled on
chenyexuan/celery-t1.1-foundation:Ready for @huangheng final no-blocker verdict.
Test plan
chunk_idparitytest_phase3_classes_have_single_definition_siteallowlist verified — only the narrowDocumentIndextransitional pair is exempt; any third duplicate still flags🤖 Generated with Claude Code