feat(celery Wave 1): Foundation + 5 modalities + observability by earayu · Pull Request #1726 · apecloud/ApeRAG

earayu · 2026-04-26T14:45:25Z

Summary

Wave 1 mega-PR per architect lock (msg=f21a79f0): foundation + all 5 modality workers + observability primitives, accumulated on a single branch for one-shot CR + verdict.

Scope (per §K Wave 1 acceptance):

T1.1 Foundation — document_index_v2 schema (+ tenant_scope_key per §H.2), Modality ABC, atomic object-store write/read, deterministic parser → derived/parse_<v>/{markdown.md,outline.json,chunks.jsonl}
T1.2 Graph modality (Bryce, commit 7c9fe06) — aperag/indexing/graph.py §D.3 lineage model + per-entity Redis lock + Nebula/Neo4j dual path + SET-element-level tenant_scope_key propagation; 16 contract tests (test_t1_2_graph.py)
T1.3 Vector + Fulltext modalities — Qdrant-shaped Protocol + InMemory backends, derive no-op pass-through (shared chunks.jsonl per §C.6), sync DELETE-by-(doc, parse_version) THEN INSERT (§D.1), shared chunk_id for hybrid dedup
T1.4 Summary + Vision modalities — expensive-derive split (§C.6 + §D.2): summary writes summary.json, vision writes vision/manifest.jsonl + per-image points, both backends Qdrant-shaped, both keyed for §D.1 contract
T1.5 Observability primitives — 5 SLI metric constants prefixed indexing.* per §J.1 (counter pair index_failure_total + index_success_total per amended spec; rate computed downstream), MetricsEmitter Protocol + Noop/InMemory emitters, 5 emit helpers (modality attr optional, utilization clamps to [0, 1])

Architect-mandated fix-forward (already applied in commits 859f899 + f9fffe6 + 4a28d27):

Bug 1 — LocalObjectStore import alias (class is named Local)
Bug 2 — table renamed to document_index_v2 (Wave 3 alembic renames back to canonical per task feat: socket debug #14 acceptance amendment)
NEW per §H.2 — tenant_scope_key VARCHAR(64) NOT NULL column + idx_document_index_v2_tenant_scope index, locked into Wave 1 schema for T2.2 quota lane
T1.1 test follow-up (commit f9fffe6) — same import alias inside the test file; removed bogus local_mod.settings monkey-patch (the module never had that attr; Local(LocalConfig) is the supported construction path)
T1.1 audit follow-up (commit 4a28d27) — narrow WAVE_1_2_TEMPORARY_DUP_ALLOWLIST entry in tests/unit_test/test_phase3_reexport_audit.py for the transitional DocumentIndex duplicate (Wave 3 task feat: socket debug #14 removes the entry in the same PR that deletes the legacy file)

T1 simulator placeholders — sha256-derived embeddings, first-paragraph summarizer, JSON image-records vision input. Production T2.x swaps placeholders for real embedding/LLM/vision services without changing the §D.1 sync contract.

Design pack canonical amendments (already amended in PR #1725 head `a0a47994`)

The following design-pack amendments are canonical in PR #1725 head a0a47994 and are reflected in this PR's implementation. PR #1726 does not duplicate them in source; CR should validate against PR #1725 head a0a47994.

§D.3.2 step 1b lineage filter — amended from (document_id, parse_version) exact-match to (document_id) document_id-only. Single sync(doc, parse_version, kg) call must self-contain the supersede semantic (§D.3.6 narrative is canonical). Bryce's T1.2 implementation in graph.py (commit 7c9fe06) follows the amended contract.
§J.1 index_failure_rate → counter pair amendment — index_failure_total + index_success_total counter pair is canonical (rate computed downstream from counter pair). OTLP-idiomatic: preserves raw events, cleanly re-aggregable across workers, no worker-side sliding-window state. T1.5 implementation already aligned.
SET-element-level tenant_scope_key — Bryce's T1.2 attaches tenant_scope_key at lineage-member granularity (not row-level). Architect ack msg=80c5dc06: a single entity may be referenced by docs from different orgs in a future multi-tenant deployment; member-level scope preserves accurate quota / isolation boundaries.

Branch coordination

Wave 1 fully assembled on chenyexuan/celery-t1.1-foundation:

T1.1 + T1.3/T1.4/T1.5 + observability + fix-forward by chenyexuan
T1.2 graph by Bryce
T1.1 test leftover fixes by chenyexuan
T1.1 audit allowlist by chenyexuan

Ready for @huangheng final no-blocker verdict.

Test plan

T1.1 partial unique invariant: blocks 2nd serving row, allows non-serving + cross-modality (12 tests)
T1.1 atomic write tmp+rename + concurrent-write no-clobber + path traversal rejection
T1.1 parser → 3 canonical artifacts + deterministic parse_version + chunks.jsonl round-trip + outline section_path
T1.2 graph §D.3.6 5-step idempotency + relation symmetry + §D.4 byte-equivalent re-sync + Nebula race-with-vs-without-lock + tenant_scope_key propagation + kg.jsonl round-trip + e2e parser integration (16 tests)
T1.3 vector/fulltext idempotency (double sync byte-equivalent), new parse_version doesn't corrupt old slot, hybrid chunk_id parity
T1.4 summary/vision derive persists canonical artifact (cost preserved across retries) + sync idempotency + missing-artifact no-op (§C.7)
T1.5 emit shape contract for each helper + metric name prefix lock + utilization clamp/zero-capacity
Phase 3 audit test_phase3_classes_have_single_definition_site allowlist verified — only the narrow DocumentIndex transitional pair is exempt; any third duplicate still flags
huangheng final no-blocker verdict + CI 4/4

🤖 Generated with Claude Code

@bryce

…+ parser Phase celery T1.1 per docs/modularization/indexing-redesign-design-pack.md (PR #1725 v3 head 5d7a60f). Foundation lane that the other Wave 1 modality lanes (T1.2 graph @bryce / T1.3 vector+fulltext / T1.4 summary+vision / T1.5 observability) depend on. What this adds (~600 LOC + tests): - alembic migration `f9c4d2a8e1b5_indexing_redesign_document_index.py` creates the `document_index` table with the §F.1 partial unique index `uniq_document_index_serving` enforcing "at most one is_serving=TRUE row per (document_id, modality)" at the DB layer (Bryce v2 review msg=7ccb176f #2). Postgres native; SQLite ≥3.8 supports the same syntax (Tier 1 §L deploy stays consistent). - `aperag/indexing/models.py` — DocumentIndex SQLAlchemy ORM mirroring the alembic schema, plus `Modality` (5 values per §C/§D) and `IndexStatus` (4 lifecycle states per §F.2) string-enums. - `aperag/indexing/base.py` — `ModalityWorker` ABC with `derive` + `sync` so per-modality workers (T1.2-T1.5) inherit the §D.1 "DELETE-by-(doc, parse_version) THEN INSERT" replace-idempotent contract. Graph reinterprets DELETE as the §D.3 lineage-level DELETE+INSERT internally; the ABC accepts that variation. - `aperag/indexing/object_store.py` — Atomic write helpers that wrap the existing `aperag.objectstore` package: LocalFS uses tmp+fsync+rename per §C.7; S3/MinIO relies on the single-request PutObject (or multipart CompleteMultipartUpload) visibility gate. Includes `read_or_none` for the §C.7 read contract and an `InMemoryObjectStore` test fixture so downstream T1.x can wire unit tests without touching disk. - `aperag/indexing/parser.py` — Deterministic parser entry point that produces the three shared artifacts (`markdown.md` / `outline.json` / `chunks.jsonl`) under `derived/parse_<v>/` per §C.1. parse_version is computed via the canonical D10.g §E.2 helper (`compute_parse_version`) so a chunking change rolls the version. T1.1 ships an in-process simulator implementation that proves the write contract; production parser integration (docparser/Marker/OCR) swaps in at T2.x without changing the artifact shape. - `aperag/migration/env.py` — Register `aperag.indexing.models` for alembic autogen (the new module deliberately lives outside the per-domain `db/models.py` tree because the `aperag.domains.indexing` surface is the Wave 3 hard-delete target). Tests cover the three Wave 1 acceptance gates locked by the design pack §K: 1. Partial unique invariant: an INSERT of a second is_serving=TRUE row for the same (document_id, modality) raises IntegrityError; non-serving rows + a different modality on the same document are allowed; the §F.3 three-statement cutover transaction satisfies the constraint. 2. Object-store atomic write: `_LocalAtomicWriter` produces a final destination file with no `.tmp.*` siblings remaining; concurrent writers to different artifacts in the same parse_version directory each land their own bytes; `InMemoryObjectStore` matches the single-call atomicity semantic. 3. Parser → derived/ round-trip: parse_document writes the three canonical artifacts; identical inputs yield identical parse_version + identical artifact contents (§C.3 idempotent retry); a chunking config change rolls parse_version (§E.2 hash); chunks.jsonl round-trips via `read_chunks`; outline.json carries slash-separated `section_path` + slug `heading_anchor` per the D10.c §A.9 R1 lock; re-running parse_document overwrites atomically; missing/empty artifacts are treated as "derive not yet complete" (§C.7 read contract); path traversal is rejected. Out of scope for this lane (per §K decomposition): - T1.2 graph modality + §D.3 lineage model (@bryce, task #7) - T1.3 vector + fulltext modalities (chenyexuan, task #8) - T1.4 summary + vision modalities (chenyexuan, task #9) - T1.5 observability OTLP emit (chenyexuan, task #10) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…dation fix-forward Bundles the architect-mandated fix-forward (msg=4a801b2b, msg=c3b0ba5b, msg=07b5b1e6) on top of T1.1 foundation: T1.1 fix-forward (per architect rulings on PR #1726 P0 bugs): - Bug1: object_store.py LocalObjectStore import alias (class is named Local) - Bug2: tablename → document_index_v2 + index renames (Wave 3 renames back to canonical via alembic per task #14 acceptance amendment) - NEW (§H.2): tenant_scope_key VARCHAR(64) NOT NULL column + idx_document_index_v2_tenant_scope index — locked into Wave 1 schema for T2.2 quota lane to consume without migration churn - Tests: fixture default tenant_scope_key="user:test" T1.3 Vector + Fulltext (§D.1 replace-idempotent contract): - VectorBackend / InMemoryVectorBackend Protocol + Qdrant-shaped delete_by_filter + upsert_point - VectorModality: derive no-op pass-through, sync DELETE-by-(doc, parse_v) THEN per-chunk upsert with placeholder embedding (sha256-derived 16-dim) - FulltextBackend / InMemoryFulltextBackend + delete_by_query + bulk_index - FulltextModality: shares chunks.jsonl with vector (§C.6); chunk_id parity preserved for hybrid dedup at search layer - Tests: replace-idempotent on double sync, new parse_version doesn't corrupt old slot, hybrid chunk_id parity, modality discriminator on payload, missing-artifact no-op (§C.7 reschedule semantic) T1.4 Summary + Vision (§C.6 + §D.2 expensive-derive split): - SummaryModality: derive reads parser markdown.md, runs placeholder summarizer (first non-heading paragraph), embeds, writes summary.json atomically; sync deletes by filter + upserts single point keyed summary:{document_id}:{parse_version} - VisionModality: derive reads synthetic image-records JSON (T1 simulator contract — T2.x replaces with real PDF extract + vision-LLM), writes vision/manifest.jsonl, sync upserts one point per image keyed vision:{document_id}:{parse_version}:{image_id} - Both backends use Qdrant-shaped Protocol + InMemory test fixtures - Tests: derive persists canonical artifact (cost preserved across retries), idempotent on double sync, new parse_version doesn't corrupt old slot, modality discriminator, missing-artifact no-op T1.5 Observability primitives (§J.1 SLI emission): - 5 metric name constants prefixed indexing.* — index_lag_seconds / index_failure_rate / index_success / queue_depth / worker_utilization - MetricsEmitter Protocol + NoopMetricsEmitter (Tier 1 deploy without OTLP) + InMemoryMetricsEmitter (test fixture) - emit_index_lag / emit_index_failure / emit_index_success / emit_queue_depth / emit_worker_utilization helpers — modality attribute optional, utilization clamps to [0, 1] and handles capacity=0 - Tests: emission shape contract for each helper + metric name prefix lock Per architect msg=f21a79f0 + PM msg=07b5b1e6 + msg=95012fdb: this commit accumulates onto the Wave 1 mega-PR #1726. Bryce will rebase + push T1.2 graph commit on the same branch; once T1.2 lands, the full Wave 1 PR flips to in_review for huangheng's step 0+ / step 0 / step 0''' / cross-lane caller sweep CR + verdict. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…edis lock + tenant_scope_key Per docs/modularization/indexing-redesign-design-pack.md §D.3 + architect msg=cc555e33 / msg=f2921ae0 / msg=c3b0ba5b lineage rulings. The graph modality is the only one whose backend rows are shared across documents, so the simple DELETE-by-(doc, parse_version) + re-INSERT pattern that vector/fulltext/summary/vision use breaks the shared-entity model — Bryce v2 review msg=7ccb176f #3 surfaced this. T1.2 lands the lineage-tracked entity / relation rows + the §D.3.2 two-phase sync algorithm + the per-entity Redis lock that protects Nebula's read-modify-write window from racing. Surface (aperag/indexing/graph.py, ~1017 lines): * Lineage data model — LineageMember(document_id, parse_version, tenant_scope_key, chunk_ids), DescriptionPart, EntityRecord, RelationRecord, EntityWithLineage, RelationWithLineage. The tenant_scope_key lives at SET-element level (not entity row level) per architect msg=c3b0ba5b — placement chosen so a shared entity cited by multiple tenants still has one row but each lineage member carries its own tenant attribution for read-path ACL filtering (§H.2 quota / organization key). * Per-entity lock — EntityLock Protocol + InMemoryEntityLock (asyncio.Lock per key, single-process default) + RedisEntityLock (Redis SETNX-style lock keyed by f"{prefix}:{slot}" where slot is crc32(entity_id) % 4096 to bound the key space). The lock is mandatory on the Nebula path because Nebula 3.x's list ops require application-layer read-modify-write; concurrent sync calls touching the same shared entity would otherwise lose lineage members at the network round-trip. * LineageGraphStore Protocol — backend abstraction so the §D.3.2 algorithm in GraphModalityWorker is portable across the three backends listed in §D.3.5 (Nebula 3.x, Neo4j with native list + APOC, in-memory). Methods filter lineage by document_id (not exact (doc, parse_version)) so re-parsing supersedes the old parse_version cleanly per §D.3.6 step 3 (see §D.3.2 amendment note in docstring). * InMemoryLineageGraphStore — Python-dict reference implementation used as the canonical correctness oracle. The §D.3.6 self-test + concurrent race test run against it; any future Nebula / Neo4j adapter that satisfies the Protocol inherits pass/fail status from this suite. * GraphExtractor — Callable injection point so the production LightRAG-based LLM extractor (Wave 2 wiring) and the test stubs share the same surface. The kg.jsonl artifact persists the extractor output so retries never re-charge LLM cost (§C.6 CANONICAL artifact contract). * serialize_kg_jsonl / parse_kg_jsonl — line-oriented JSON with a "kind" discriminator so future record types can be added without breaking the file format. Empty payloads encode as b"\\n" (one byte) so the §C.7 "empty body == derive not finished" sentinel cannot collide with a deliberate "no records produced" payload — the deletion flow (§D.3.6 step 4 / step 5) publishes b"\\n" to cleanly clear a document's lineage contribution. * GraphModalityWorker — implements ModalityWorker ABC. derive() reads chunks.jsonl produced by the T1.1 parser, calls the injected extractor, writes kg.jsonl atomically. sync() reads kg.jsonl and applies the §D.3.2 algorithm: Phase 1 (cleanup): for every entity / relation in the lineage backend whose lineage SET contains the document_id, remove ALL members for that document_id (across any parse_version) and GC rows that go orphan. Per-entity lock held during cleanup so concurrent syncs cannot race the read-modify-write. Phase 2 (rebuild): for every entity / relation in kg.jsonl, upsert with a fresh LineageMember stamped with the current (document_id, parse_version, tenant_scope_key). Per-entity lock held during rebuild for the same reason. Tests (tests/unit_test/indexing/test_t1_2_graph.py, ~1044 lines, 16 test cases): * §D.3.6 five-step idempotency suite (test_d3_6_step1..step5): doc_A v1 → doc_B v1 → doc_A v2 supersedes → delete doc_A → delete doc_B with full entity GC. Pin the §D.3.2 algorithm against the §D.3.6 narrative; the regression Bryce surfaced in msg=7ccb176f #3 fails this suite under the v1 append-on-conflict path. * Relation-lineage symmetric coverage (test_relation_lineage_doc_a_then_doc_b_then_delete_doc_a) — proves the same lineage semantics flow through relation evidence_lineage. * §D.4 byte-equivalent re-sync (test_d4_byte_equivalent_resync) — double-sync with the same artifact leaves the backend identical. The cross-modality idempotency contract every modality must pass. * Nebula race condition under per-entity lock (test_nebula_race_under_per_entity_lock_preserves_both_writes + test_nebula_race_without_lock_loses_a_writer) — uses a _RaceProvocateurStore that emulates Nebula's read-modify-write network round-trip with a deterministic asyncio yield. Under the in-memory entity lock, both writers' lineage members end up in the SET (PASS); under a no-op lock, deterministically one writer loses (the negative control). Pins the architect msg=f2921ae0 invariant that per-entity serialization is mandatory not optional. * tenant_scope_key propagation (test_tenant_scope_key_propagates_into_lineage_members) — pins the SET-element placement decision per architect msg=c3b0ba5b so a future regression that drops the field or moves it to entity row level fails loudly. * kg.jsonl round-trip (test_kg_jsonl_*) — entity / relation serialization, forward-compatible kind skipping, empty-body encoding (always at least b"\\n" to distinguish "no records" from "derive crashed"). * derive contract (test_derive_writes_kg_jsonl_under_canonical_path + test_sync_with_missing_artifact_is_a_noop) — derive writes to derived/parse_<v>/kg.jsonl per §C.6 canonical layout; sync no-ops when the artifact is missing per §C.7 read contract. * End-to-end with the real T1.1 parser (test_end_to_end_with_real_parser_chunks) — proves graph.derive cooperates with the chunks.jsonl shape that aperag.indexing.parser.parse_document actually produces, no hidden coupling beyond the §C.6 contract. aperag/indexing/__init__.py — re-export the T1.2 graph public surface alongside chenyexuan's T1.3/T1.4/T1.5 modality exports so the indexing package surface is uniform across all five modalities. Lint: ruff check + ruff format --check both clean across aperag/indexing/ and tests/unit_test/indexing/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…patch Three follow-up bugs surfaced by Bryce's local pytest collection (msg=464d5b70) on top of fix-forward 859f899: 1. test_t1_1_foundation.py:69 — ImportError. The class is named `Local` (not `LocalObjectStore`); my main-tree object_store.py already uses the `Local as LocalObjectStore` alias per architect ruling on Bug 1 (msg=4a801b2b), but the test file's own import line still referenced the wrong name. Fixed: same alias pattern. 2. test_local_atomic_write_uses_tmp_rename_dance — AttributeError on `aperag.objectstore.local.settings`. The module never had a `settings` attribute; the monkey-patch block was unfounded. `Local(LocalConfig)` accepts the config struct directly, no module-level singleton dependency. Replaced the monkey-patch block with a direct `LocalObjectStore(LocalConfig(root_dir=...))` construction. 3. test_concurrent_atomic_writes_dont_clobber_each_other — same AttributeError pattern, same fix. No production code changed; only the T1.1 test fixture is simplified (net -30 lines). Lint clean. Wave 1 PR #1726 ready to flip to in_review after this push. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Bryce v3 implementation review (msg=464d5b70) caught a spec bug in §D.3.2 step 1b: the pseudocode used `(document_id, parse_version)` exact-match for lineage filter, which contradicts §D.3.6 narrative step 3 ("doc_A v2 写入（覆盖 doc_A 旧 lineage）"). Strict exact-match would leave lineage[A,v_old] + lineage[A,v_new] coexisting after a re-parse, violating the expected supersede semantic. Architect ruling (msg=80c5dc06) is to amend §D.3.2 step 1b to filter by `document_id` only (not parse_version). This makes sync(doc, v_new) self-contained for supersede; orchestrator does not need to do explicit clear-then-sync. §D.3.6 narrative remains canonical. PR #1726 Wave 1 graph implementation follows the corrected algorithm. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ex duplicate huangheng's Wave 1 CR (msg=8e67bf0e) flagged P1: test_phase3_classes_have_single_definition_site fails because Wave 1 introduces aperag/indexing/models.py:DocumentIndex alongside the legacy aperag/domains/indexing/db/models.py:DocumentIndex still owned by Celery. Architect ruling msg=5be9a535 — option (a): add a narrow allowlist covering exactly this (class_name, file_pair); reject (b) renaming the new ORM class (would force a Wave 1 + Wave 3 double-rename, conflicts with msg=4a801b2b "Python class name stays canonical, only __tablename__ differs") and (c) xfail (does not express the "intentional transitional state" semantic). Wave 3 task #14 acceptance per architect amendment will remove this allowlist entry in the same PR that deletes the legacy file + alembic renames document_index_v2 back to document_index. The allowlist entry is narrow: - Only the exact frozenset of two file paths is accepted; any third duplicate (Wave 4+ regression, e.g. a copy-paste of the class) still flags as an offender. - Only matches `class DocumentIndex(Base):` definitions (the existing orm_pattern); pydantic schemas with the same name are unaffected. - No global widening — no other Phase 3 class gets an exception. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ite proposal (#1725) * docs(indexing): add indexing redesign design pack — first-principles rewrite proposal Per earayu2 directive (#celery msg=56812dd6 + msg=d8080c08): full redesign of the document indexing system, prioritizing simplicity and reliability over feature breadth, targeting 100 concurrent docs, with hard-cut authorization (pre-launch / no users / no migration). Design pack contents (1049 lines, 11 sections): - §A — Current system analysis with file:line evidence (3-layer ownership skew, Python lease thread tied to worker process, graph index NOT replace-idempotent per nebula.py:354 upsert_entities, ~995 lines in tasks.py mixing infra + business) - §B — First principles (single SoT in DB, idempotent convergence, source/ derived/index three-layer separation, concurrency bounded by external capacity, simple > complex) - §C — Three-layer document model (collections/<id>/documents/<id>/source/ + derived/parse_<v>/{markdown.md, chunks.jsonl, kg.jsonl, summary.json, vision/} + backend index stores) - §D — Idempotency contract per modality (DELETE-by-(document_id, parse_version) before INSERT for all 5 modalities; fixes graph index append bug) - §E — Concurrency model decision matrix (HTTP-only / lightweight task / Celery refactor); recommends lightweight Redis-backed asyncio worker pool per modality (5 worker processes, ~80-line reconciler, no Celery / no chord / no Python lease thread) - §F — State machine + atomic flip (4 status values vs current 6; document.active_parse_version + pending_parse_version with transactional flip; deletion via async cleanup worker) - §G — Multi-modal unified pipeline (Modality ABC with derive() + sync() contract; collapses earayu2's "Celery task 绕来绕去最后又绕回 graph index" complaint into 2 functions in 1 file) - §H — Multi-tenant isolation (recommend simple — required tenant context + bulkheads, defer fairness machinery until observability shows real noisy-neighbor signal) - §I — Failure recovery (3 modes: worker crash, transient backend, permanent failure; exponential backoff retry; Redis token bucket for LLM rate-limit backpressure) - §J — Observability (4 SLI: index_lag_seconds, index_failure_rate, queue_depth, worker_utilization; OTLP wire; aligns with PR #1702) - §K — Migration plan (7 PRs: observability primitives → idempotent indexers → object store layout → worker pool → atomic flip → cutover → availability discriminator; feature-flagged dual-stack during PR-D/E; cutover deletes ~3000 lines of Celery infrastructure) Net delta: roughly +4150 / -4850 lines across 7 PRs — net subtraction despite adding functionality. Indexing layer drops from ~2500 lines to ~1500. Three open decisions deferred to earayu2: 1. Concurrency model: lightweight Redis-asyncio (recommended), HTTP-only, or Celery refactor 2. Atomic flip contract: all-modalities-ACTIVE-required (recommended) vs per-modality independent 3. PR sequence: 7-PR cut (recommended) vs combined Sibling reference: Bryce msg=791082a4 + msg=38fbf962 first-principles analysis + architect msg=19f283d5 + msg=2ee66c89 4-blind-spot synthesis. This design pack is the single canonical deliverable per earayu2's owner directive (@符炫炜 sole author of the final design). * docs(indexing): redesign pack v2 — incorporate earayu2 拍板 + 答 derived/MinIO Driven by earayu2 msg=cc0a00d7 + PM consolidation msg=32463d64. - Drop Celery → lock Redis + asyncio (§E, decision matrix removed) - Drop atomic flip → per-modality independent is_serving cutover (§F) Accept short eventual-consistency window per earayu2 directive. - Answer derived/parse_<v>/ contents per modality (§C.6) — chunks.jsonl shared by vector+fulltext, kg.jsonl, summary.json, vision/manifest.jsonl + images/, markdown.md + outline.json. - Answer MinIO/object-store suitability (§C.7) — ~150 MB / 100-doc burst, trivial; LocalFS / MinIO / S3-compatible all work; small-file + LIST caveats addressed. - Add §L private/on-premise "deploy-and-forget": Tier 1 inline (SQLite + LocalFS, ~10 docs/hour), Tier 2 docker-compose (~100 concurrent), Tier 3 horizontal scale-out — same code. - §H tenant_scope_key forward-compat hook for future organization concept; simple Redis token-bucket quota that won't lock future fairness. - §K restructure: 7 PRs → 3 waves (Foundation / Runtime / Cutover) with per-wave parallel-writability map. - §G.5 SearchResultMetadata extends w/ parse_version + index_state_per_modality (becomes structurally required under per-modality independent flip). PR #1725 v2; awaiting earayu2 final 拍板 on Wave 1 kickoff. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(indexing): v2 amendment — Bryce review deltas (chunking trade-off / serving invariant / graph entity lineage) Bryce v2 review msg=7ccb176f surfaced 3 substantive technical deltas all agreed-as-must-address by PM msg=fc307bbf. Folded into v2: §C.6 — chunks.jsonl shared-by-vector-and-fulltext is now framed as conscious trade-off (vector wants larger chunks, fulltext wants smaller) with explicit shadow-file extension hook (chunks.fulltext.jsonl + namespaced sub-IDs) preserved so future split is unblocked. §F.1 — partial unique index added at the schema layer: CREATE UNIQUE INDEX uniq_document_index_serving ON document_index (document_id, modality) WHERE is_serving = TRUE; This makes the "at most one serving row per (doc, modality)" invariant DB- enforced, not orchestrator-enforced. SQLite 3.8+ supports the same syntax (Tier 1 deploy stays consistent). §D.3 — graph entity lineage model rewritten. Cross-document shared entities ("Linus" mentioned in 100 docs) cannot be cleared by simple DELETE-by-doc without losing other docs' contributions. New model: - source_lineage: SET<{document_id, parse_version, chunk_ids[]}> - description_parts: SET<{document_id, parse_version, text}> - sync = lineage-level DELETE+INSERT, entity GC when lineage empty Includes per-entity serialization invariant for Nebula (read-modify-write without native list ops). 5-step idempotency self-test extension specified. PR #1725 v2; ready for earayu2 / Bryce final ack. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(indexing): §D.3.2 amendment — lineage cleanup by document_id only Bryce v3 implementation review (msg=464d5b70) caught a spec bug in §D.3.2 step 1b: the pseudocode used `(document_id, parse_version)` exact-match for lineage filter, which contradicts §D.3.6 narrative step 3 ("doc_A v2 写入（覆盖 doc_A 旧 lineage）"). Strict exact-match would leave lineage[A,v_old] + lineage[A,v_new] coexisting after a re-parse, violating the expected supersede semantic. Architect ruling (msg=80c5dc06) is to amend §D.3.2 step 1b to filter by `document_id` only (not parse_version). This makes sync(doc, v_new) self-contained for supersede; orchestrator does not need to do explicit clear-then-sync. §D.3.6 narrative remains canonical. PR #1726 Wave 1 graph implementation follows the corrected algorithm. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(indexing): §J.1 amendment — failure_total + success_total counter pair (was single rate gauge) huangheng Wave 1 CR (msg=8e67bf0e) flagged §J.1 spec drift: T1.5 implementation emits index_failure_total + index_success_total counter pair, not the single index_failure_rate gauge spec called for. Architect ruling: amend spec to match implementation. Counter pair is OTLP- idiomatic, preserves raw events, re-aggregates across workers without sliding-window state, and the rate is trivially computable downstream. §J.1 spec amended; §K Wave 1 acceptance bullet updated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(indexing): §F.1 + §F.5 amendments — collection_id/source_path/tenant_scope_key columns + cleanup Path C Three Wave 3 amendments folded in to unblock chenyexuan T3.1 (msg=afe345a9): §F.1 schema: - Add collection_id VARCHAR(64) NOT NULL (denormalized from document.collection_id, populated by orchestrator at INSERT for self-contained dispatch payload — per huangheng Wave 2 CR finding msg=c94b57fe + architect ruling msg=498b12f0). - Add source_path TEXT NOT NULL (pointer to source/ artifact, worker derive reads directly without JOIN). - Add tenant_scope_key VARCHAR(64) NOT NULL (forward-compat for future organization concept per §H.2; required key for §H.5 quota bucket). Was implicit-but-not-listed in the schema before; now explicit. - Add idx_document_index_collection + idx_document_index_tenant_scope indexes for cleanup / quota scans. §F.5 cleanup worker: - Restructure to three paths (A/B/C) with explicit semantics. - Path A: orphan parse_version GC (existing); now notes graph backend no-op via §D.3 lineage supersede + graph_noop counter for telemetry. - Path B: single-document deletion cascade — explicit graph dispatch via remove_entity_lineage_member(document_id) per §D.3 amended canonical (by document_id only). - Path C (NEW): collection deletion cascade — Collection.deleted_at scan + Path B per child document + final Collection row + storage tree cleanup. Replaces legacy Celery collection_delete task with state-driven recovery (no asyncio.create_task() durability gap). These amendments unblock T3.1 commit 1+ since chenyexuan needs the spec head to reference for the audit allowlist removal and the caller-migration patterns. Wave 3 task #14 acceptance criteria (per PM msg=5939e394) now references this head. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: 符炫炜 <fuxuanwei@apecloud.io> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

earayu and others added 2 commits April 26, 2026 22:43

earayu changed the title ~~feat(celery T1.1): Foundation — schema + Modality ABC + object_store + parser~~ feat(celery Wave 1): Foundation + 5 modalities + observability Apr 26, 2026

earayu and others added 2 commits April 26, 2026 23:40

earayu merged commit 0348a78 into main Apr 26, 2026
4 checks passed

earayu deleted the chenyexuan/celery-t1.1-foundation branch April 26, 2026 16:09

earayu mentioned this pull request Apr 27, 2026

feat(celery Wave 3): T3.1 hard-cut legacy Celery indexing layer #1729

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(celery Wave 1): Foundation + 5 modalities + observability#1726

feat(celery Wave 1): Foundation + 5 modalities + observability#1726
earayu merged 5 commits into
mainfrom
chenyexuan/celery-t1.1-foundation

earayu commented Apr 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

earayu commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design pack canonical amendments (already amended in PR #1725 head a0a47994)

Branch coordination

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

earayu commented Apr 26, 2026 •

edited

Loading

Design pack canonical amendments (already amended in PR #1725 head `a0a47994`)