Skip to content

feat(celery Wave 1): Foundation + 5 modalities + observability#1726

Merged
earayu merged 5 commits into
mainfrom
chenyexuan/celery-t1.1-foundation
Apr 26, 2026
Merged

feat(celery Wave 1): Foundation + 5 modalities + observability#1726
earayu merged 5 commits into
mainfrom
chenyexuan/celery-t1.1-foundation

Conversation

@earayu
Copy link
Copy Markdown
Collaborator

@earayu earayu commented Apr 26, 2026

Summary

Wave 1 mega-PR per architect lock (msg=f21a79f0): foundation + all 5 modality workers + observability primitives, accumulated on a single branch for one-shot CR + verdict.

Scope (per §K Wave 1 acceptance):

  • T1.1 Foundationdocument_index_v2 schema (+ tenant_scope_key per §H.2), Modality ABC, atomic object-store write/read, deterministic parser → derived/parse_<v>/{markdown.md,outline.json,chunks.jsonl}
  • T1.2 Graph modality (Bryce, commit 7c9fe06) — aperag/indexing/graph.py §D.3 lineage model + per-entity Redis lock + Nebula/Neo4j dual path + SET-element-level tenant_scope_key propagation; 16 contract tests (test_t1_2_graph.py)
  • T1.3 Vector + Fulltext modalities — Qdrant-shaped Protocol + InMemory backends, derive no-op pass-through (shared chunks.jsonl per §C.6), sync DELETE-by-(doc, parse_version) THEN INSERT (§D.1), shared chunk_id for hybrid dedup
  • T1.4 Summary + Vision modalities — expensive-derive split (§C.6 + §D.2): summary writes summary.json, vision writes vision/manifest.jsonl + per-image points, both backends Qdrant-shaped, both keyed for §D.1 contract
  • T1.5 Observability primitives — 5 SLI metric constants prefixed indexing.* per §J.1 (counter pair index_failure_total + index_success_total per amended spec; rate computed downstream), MetricsEmitter Protocol + Noop/InMemory emitters, 5 emit helpers (modality attr optional, utilization clamps to [0, 1])

Architect-mandated fix-forward (already applied in commits 859f899 + f9fffe6 + 4a28d27):

  1. Bug 1 — LocalObjectStore import alias (class is named Local)
  2. Bug 2 — table renamed to document_index_v2 (Wave 3 alembic renames back to canonical per task feat: socket debug #14 acceptance amendment)
  3. NEW per §H.2 — tenant_scope_key VARCHAR(64) NOT NULL column + idx_document_index_v2_tenant_scope index, locked into Wave 1 schema for T2.2 quota lane
  4. T1.1 test follow-up (commit f9fffe6) — same import alias inside the test file; removed bogus local_mod.settings monkey-patch (the module never had that attr; Local(LocalConfig) is the supported construction path)
  5. T1.1 audit follow-up (commit 4a28d27) — narrow WAVE_1_2_TEMPORARY_DUP_ALLOWLIST entry in tests/unit_test/test_phase3_reexport_audit.py for the transitional DocumentIndex duplicate (Wave 3 task feat: socket debug #14 removes the entry in the same PR that deletes the legacy file)

T1 simulator placeholders — sha256-derived embeddings, first-paragraph summarizer, JSON image-records vision input. Production T2.x swaps placeholders for real embedding/LLM/vision services without changing the §D.1 sync contract.

Design pack canonical amendments (already amended in PR #1725 head a0a47994)

The following design-pack amendments are canonical in PR #1725 head a0a47994 and are reflected in this PR's implementation. PR #1726 does not duplicate them in source; CR should validate against PR #1725 head a0a47994.

  • §D.3.2 step 1b lineage filter — amended from (document_id, parse_version) exact-match to (document_id) document_id-only. Single sync(doc, parse_version, kg) call must self-contain the supersede semantic (§D.3.6 narrative is canonical). Bryce's T1.2 implementation in graph.py (commit 7c9fe06) follows the amended contract.
  • §J.1 index_failure_rate → counter pair amendmentindex_failure_total + index_success_total counter pair is canonical (rate computed downstream from counter pair). OTLP-idiomatic: preserves raw events, cleanly re-aggregable across workers, no worker-side sliding-window state. T1.5 implementation already aligned.
  • SET-element-level tenant_scope_key — Bryce's T1.2 attaches tenant_scope_key at lineage-member granularity (not row-level). Architect ack msg=80c5dc06: a single entity may be referenced by docs from different orgs in a future multi-tenant deployment; member-level scope preserves accurate quota / isolation boundaries.

Branch coordination

Wave 1 fully assembled on chenyexuan/celery-t1.1-foundation:

  • T1.1 + T1.3/T1.4/T1.5 + observability + fix-forward by chenyexuan
  • T1.2 graph by Bryce
  • T1.1 test leftover fixes by chenyexuan
  • T1.1 audit allowlist by chenyexuan

Ready for @huangheng final no-blocker verdict.

Test plan

  • T1.1 partial unique invariant: blocks 2nd serving row, allows non-serving + cross-modality (12 tests)
  • T1.1 atomic write tmp+rename + concurrent-write no-clobber + path traversal rejection
  • T1.1 parser → 3 canonical artifacts + deterministic parse_version + chunks.jsonl round-trip + outline section_path
  • T1.2 graph §D.3.6 5-step idempotency + relation symmetry + §D.4 byte-equivalent re-sync + Nebula race-with-vs-without-lock + tenant_scope_key propagation + kg.jsonl round-trip + e2e parser integration (16 tests)
  • T1.3 vector/fulltext idempotency (double sync byte-equivalent), new parse_version doesn't corrupt old slot, hybrid chunk_id parity
  • T1.4 summary/vision derive persists canonical artifact (cost preserved across retries) + sync idempotency + missing-artifact no-op (§C.7)
  • T1.5 emit shape contract for each helper + metric name prefix lock + utilization clamp/zero-capacity
  • Phase 3 audit test_phase3_classes_have_single_definition_site allowlist verified — only the narrow DocumentIndex transitional pair is exempt; any third duplicate still flags
  • huangheng final no-blocker verdict + CI 4/4

🤖 Generated with Claude Code

earayu and others added 2 commits April 26, 2026 22:43
…+ parser

Phase celery T1.1 per docs/modularization/indexing-redesign-design-pack.md
(PR #1725 v3 head 5d7a60f). Foundation lane that the other Wave 1
modality lanes (T1.2 graph @bryce / T1.3 vector+fulltext / T1.4
summary+vision / T1.5 observability) depend on.

What this adds (~600 LOC + tests):

- alembic migration `f9c4d2a8e1b5_indexing_redesign_document_index.py`
  creates the `document_index` table with the §F.1 partial unique
  index `uniq_document_index_serving` enforcing "at most one
  is_serving=TRUE row per (document_id, modality)" at the DB layer
  (Bryce v2 review msg=7ccb176f #2). Postgres native; SQLite ≥3.8
  supports the same syntax (Tier 1 §L deploy stays consistent).

- `aperag/indexing/models.py` — DocumentIndex SQLAlchemy ORM mirroring
  the alembic schema, plus `Modality` (5 values per §C/§D) and
  `IndexStatus` (4 lifecycle states per §F.2) string-enums.

- `aperag/indexing/base.py` — `ModalityWorker` ABC with `derive` +
  `sync` so per-modality workers (T1.2-T1.5) inherit the §D.1
  "DELETE-by-(doc, parse_version) THEN INSERT" replace-idempotent
  contract. Graph reinterprets DELETE as the §D.3 lineage-level
  DELETE+INSERT internally; the ABC accepts that variation.

- `aperag/indexing/object_store.py` — Atomic write helpers that wrap
  the existing `aperag.objectstore` package: LocalFS uses
  tmp+fsync+rename per §C.7; S3/MinIO relies on the single-request
  PutObject (or multipart CompleteMultipartUpload) visibility gate.
  Includes `read_or_none` for the §C.7 read contract and an
  `InMemoryObjectStore` test fixture so downstream T1.x can wire
  unit tests without touching disk.

- `aperag/indexing/parser.py` — Deterministic parser entry point that
  produces the three shared artifacts (`markdown.md` / `outline.json`
  / `chunks.jsonl`) under `derived/parse_<v>/` per §C.1. parse_version
  is computed via the canonical D10.g §E.2 helper
  (`compute_parse_version`) so a chunking change rolls the version.
  T1.1 ships an in-process simulator implementation that proves the
  write contract; production parser integration (docparser/Marker/OCR)
  swaps in at T2.x without changing the artifact shape.

- `aperag/migration/env.py` — Register `aperag.indexing.models` for
  alembic autogen (the new module deliberately lives outside the
  per-domain `db/models.py` tree because the `aperag.domains.indexing`
  surface is the Wave 3 hard-delete target).

Tests cover the three Wave 1 acceptance gates locked by the design
pack §K:

1. Partial unique invariant: an INSERT of a second is_serving=TRUE
   row for the same (document_id, modality) raises IntegrityError;
   non-serving rows + a different modality on the same document are
   allowed; the §F.3 three-statement cutover transaction satisfies
   the constraint.

2. Object-store atomic write: `_LocalAtomicWriter` produces a final
   destination file with no `.tmp.*` siblings remaining; concurrent
   writers to different artifacts in the same parse_version directory
   each land their own bytes; `InMemoryObjectStore` matches the
   single-call atomicity semantic.

3. Parser → derived/ round-trip: parse_document writes the three
   canonical artifacts; identical inputs yield identical
   parse_version + identical artifact contents (§C.3 idempotent
   retry); a chunking config change rolls parse_version (§E.2 hash);
   chunks.jsonl round-trips via `read_chunks`; outline.json carries
   slash-separated `section_path` + slug `heading_anchor` per the
   D10.c §A.9 R1 lock; re-running parse_document overwrites
   atomically; missing/empty artifacts are treated as "derive not
   yet complete" (§C.7 read contract); path traversal is rejected.

Out of scope for this lane (per §K decomposition):
- T1.2 graph modality + §D.3 lineage model (@bryce, task #7)
- T1.3 vector + fulltext modalities (chenyexuan, task #8)
- T1.4 summary + vision modalities (chenyexuan, task #9)
- T1.5 observability OTLP emit (chenyexuan, task #10)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…dation fix-forward

Bundles the architect-mandated fix-forward (msg=4a801b2b, msg=c3b0ba5b,
msg=07b5b1e6) on top of T1.1 foundation:

T1.1 fix-forward (per architect rulings on PR #1726 P0 bugs):
- Bug1: object_store.py LocalObjectStore import alias (class is named Local)
- Bug2: tablename → document_index_v2 + index renames (Wave 3 renames back
  to canonical via alembic per task #14 acceptance amendment)
- NEW (§H.2): tenant_scope_key VARCHAR(64) NOT NULL column +
  idx_document_index_v2_tenant_scope index — locked into Wave 1 schema for
  T2.2 quota lane to consume without migration churn
- Tests: fixture default tenant_scope_key="user:test"

T1.3 Vector + Fulltext (§D.1 replace-idempotent contract):
- VectorBackend / InMemoryVectorBackend Protocol + Qdrant-shaped
  delete_by_filter + upsert_point
- VectorModality: derive no-op pass-through, sync DELETE-by-(doc, parse_v)
  THEN per-chunk upsert with placeholder embedding (sha256-derived 16-dim)
- FulltextBackend / InMemoryFulltextBackend + delete_by_query + bulk_index
- FulltextModality: shares chunks.jsonl with vector (§C.6); chunk_id parity
  preserved for hybrid dedup at search layer
- Tests: replace-idempotent on double sync, new parse_version doesn't
  corrupt old slot, hybrid chunk_id parity, modality discriminator on
  payload, missing-artifact no-op (§C.7 reschedule semantic)

T1.4 Summary + Vision (§C.6 + §D.2 expensive-derive split):
- SummaryModality: derive reads parser markdown.md, runs placeholder
  summarizer (first non-heading paragraph), embeds, writes summary.json
  atomically; sync deletes by filter + upserts single point keyed
  summary:{document_id}:{parse_version}
- VisionModality: derive reads synthetic image-records JSON (T1 simulator
  contract — T2.x replaces with real PDF extract + vision-LLM), writes
  vision/manifest.jsonl, sync upserts one point per image keyed
  vision:{document_id}:{parse_version}:{image_id}
- Both backends use Qdrant-shaped Protocol + InMemory test fixtures
- Tests: derive persists canonical artifact (cost preserved across
  retries), idempotent on double sync, new parse_version doesn't corrupt
  old slot, modality discriminator, missing-artifact no-op

T1.5 Observability primitives (§J.1 SLI emission):
- 5 metric name constants prefixed indexing.* — index_lag_seconds /
  index_failure_rate / index_success / queue_depth / worker_utilization
- MetricsEmitter Protocol + NoopMetricsEmitter (Tier 1 deploy without
  OTLP) + InMemoryMetricsEmitter (test fixture)
- emit_index_lag / emit_index_failure / emit_index_success /
  emit_queue_depth / emit_worker_utilization helpers — modality attribute
  optional, utilization clamps to [0, 1] and handles capacity=0
- Tests: emission shape contract for each helper + metric name prefix lock

Per architect msg=f21a79f0 + PM msg=07b5b1e6 + msg=95012fdb: this commit
accumulates onto the Wave 1 mega-PR #1726. Bryce will rebase + push T1.2
graph commit on the same branch; once T1.2 lands, the full Wave 1 PR
flips to in_review for huangheng's step 0+ / step 0 / step 0''' /
cross-lane caller sweep CR + verdict.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@earayu earayu changed the title feat(celery T1.1): Foundation — schema + Modality ABC + object_store + parser feat(celery Wave 1): Foundation + 5 modalities + observability Apr 26, 2026
earayu and others added 2 commits April 26, 2026 23:40
…edis lock + tenant_scope_key

Per docs/modularization/indexing-redesign-design-pack.md §D.3 +
architect msg=cc555e33 / msg=f2921ae0 / msg=c3b0ba5b lineage rulings.

The graph modality is the only one whose backend rows are shared
across documents, so the simple DELETE-by-(doc, parse_version) +
re-INSERT pattern that vector/fulltext/summary/vision use breaks the
shared-entity model — Bryce v2 review msg=7ccb176f #3 surfaced this.
T1.2 lands the lineage-tracked entity / relation rows + the §D.3.2
two-phase sync algorithm + the per-entity Redis lock that protects
Nebula's read-modify-write window from racing.

Surface (aperag/indexing/graph.py, ~1017 lines):

* Lineage data model — LineageMember(document_id, parse_version,
  tenant_scope_key, chunk_ids), DescriptionPart, EntityRecord,
  RelationRecord, EntityWithLineage, RelationWithLineage. The
  tenant_scope_key lives at SET-element level (not entity row level)
  per architect msg=c3b0ba5b — placement chosen so a shared entity
  cited by multiple tenants still has one row but each lineage member
  carries its own tenant attribution for read-path ACL filtering
  (§H.2 quota / organization key).

* Per-entity lock — EntityLock Protocol + InMemoryEntityLock
  (asyncio.Lock per key, single-process default) + RedisEntityLock
  (Redis SETNX-style lock keyed by f"{prefix}:{slot}" where slot is
  crc32(entity_id) % 4096 to bound the key space). The lock is
  mandatory on the Nebula path because Nebula 3.x's list ops require
  application-layer read-modify-write; concurrent sync calls touching
  the same shared entity would otherwise lose lineage members at the
  network round-trip.

* LineageGraphStore Protocol — backend abstraction so the §D.3.2
  algorithm in GraphModalityWorker is portable across the three
  backends listed in §D.3.5 (Nebula 3.x, Neo4j with native list +
  APOC, in-memory). Methods filter lineage by document_id (not
  exact (doc, parse_version)) so re-parsing supersedes the old
  parse_version cleanly per §D.3.6 step 3 (see §D.3.2 amendment
  note in docstring).

* InMemoryLineageGraphStore — Python-dict reference implementation
  used as the canonical correctness oracle. The §D.3.6 self-test
  + concurrent race test run against it; any future Nebula / Neo4j
  adapter that satisfies the Protocol inherits pass/fail status from
  this suite.

* GraphExtractor — Callable injection point so the production
  LightRAG-based LLM extractor (Wave 2 wiring) and the test stubs
  share the same surface. The kg.jsonl artifact persists the
  extractor output so retries never re-charge LLM cost (§C.6
  CANONICAL artifact contract).

* serialize_kg_jsonl / parse_kg_jsonl — line-oriented JSON with a
  "kind" discriminator so future record types can be added without
  breaking the file format. Empty payloads encode as b"\\n" (one
  byte) so the §C.7 "empty body == derive not finished" sentinel
  cannot collide with a deliberate "no records produced" payload —
  the deletion flow (§D.3.6 step 4 / step 5) publishes b"\\n" to
  cleanly clear a document's lineage contribution.

* GraphModalityWorker — implements ModalityWorker ABC. derive()
  reads chunks.jsonl produced by the T1.1 parser, calls the injected
  extractor, writes kg.jsonl atomically. sync() reads kg.jsonl and
  applies the §D.3.2 algorithm:
    Phase 1 (cleanup): for every entity / relation in the lineage
      backend whose lineage SET contains the document_id, remove
      ALL members for that document_id (across any parse_version)
      and GC rows that go orphan. Per-entity lock held during
      cleanup so concurrent syncs cannot race the read-modify-write.
    Phase 2 (rebuild): for every entity / relation in kg.jsonl,
      upsert with a fresh LineageMember stamped with the current
      (document_id, parse_version, tenant_scope_key). Per-entity
      lock held during rebuild for the same reason.

Tests (tests/unit_test/indexing/test_t1_2_graph.py, ~1044 lines, 16
test cases):

* §D.3.6 five-step idempotency suite (test_d3_6_step1..step5):
  doc_A v1 → doc_B v1 → doc_A v2 supersedes → delete doc_A → delete
  doc_B with full entity GC. Pin the §D.3.2 algorithm against the
  §D.3.6 narrative; the regression Bryce surfaced in msg=7ccb176f
  #3 fails this suite under the v1 append-on-conflict path.

* Relation-lineage symmetric coverage
  (test_relation_lineage_doc_a_then_doc_b_then_delete_doc_a) — proves
  the same lineage semantics flow through relation evidence_lineage.

* §D.4 byte-equivalent re-sync (test_d4_byte_equivalent_resync) —
  double-sync with the same artifact leaves the backend identical.
  The cross-modality idempotency contract every modality must pass.

* Nebula race condition under per-entity lock
  (test_nebula_race_under_per_entity_lock_preserves_both_writes +
  test_nebula_race_without_lock_loses_a_writer) — uses a
  _RaceProvocateurStore that emulates Nebula's read-modify-write
  network round-trip with a deterministic asyncio yield. Under the
  in-memory entity lock, both writers' lineage members end up in
  the SET (PASS); under a no-op lock, deterministically one writer
  loses (the negative control). Pins the architect msg=f2921ae0
  invariant that per-entity serialization is mandatory not optional.

* tenant_scope_key propagation
  (test_tenant_scope_key_propagates_into_lineage_members) — pins
  the SET-element placement decision per architect msg=c3b0ba5b so
  a future regression that drops the field or moves it to entity
  row level fails loudly.

* kg.jsonl round-trip (test_kg_jsonl_*) — entity / relation
  serialization, forward-compatible kind skipping, empty-body
  encoding (always at least b"\\n" to distinguish "no records"
  from "derive crashed").

* derive contract (test_derive_writes_kg_jsonl_under_canonical_path
  + test_sync_with_missing_artifact_is_a_noop) — derive writes to
  derived/parse_<v>/kg.jsonl per §C.6 canonical layout; sync no-ops
  when the artifact is missing per §C.7 read contract.

* End-to-end with the real T1.1 parser
  (test_end_to_end_with_real_parser_chunks) — proves graph.derive
  cooperates with the chunks.jsonl shape that
  aperag.indexing.parser.parse_document actually produces, no
  hidden coupling beyond the §C.6 contract.

aperag/indexing/__init__.py — re-export the T1.2 graph public
surface alongside chenyexuan's T1.3/T1.4/T1.5 modality exports so
the indexing package surface is uniform across all five modalities.

Lint: ruff check + ruff format --check both clean across
aperag/indexing/ and tests/unit_test/indexing/.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…patch

Three follow-up bugs surfaced by Bryce's local pytest collection
(msg=464d5b70) on top of fix-forward 859f899:

1. test_t1_1_foundation.py:69 — ImportError. The class is named
   `Local` (not `LocalObjectStore`); my main-tree object_store.py
   already uses the `Local as LocalObjectStore` alias per architect
   ruling on Bug 1 (msg=4a801b2b), but the test file's own import
   line still referenced the wrong name. Fixed: same alias pattern.

2. test_local_atomic_write_uses_tmp_rename_dance — AttributeError on
   `aperag.objectstore.local.settings`. The module never had a
   `settings` attribute; the monkey-patch block was unfounded.
   `Local(LocalConfig)` accepts the config struct directly, no
   module-level singleton dependency. Replaced the monkey-patch
   block with a direct `LocalObjectStore(LocalConfig(root_dir=...))`
   construction.

3. test_concurrent_atomic_writes_dont_clobber_each_other — same
   AttributeError pattern, same fix.

No production code changed; only the T1.1 test fixture is simplified
(net -30 lines). Lint clean. Wave 1 PR #1726 ready to flip to
in_review after this push.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
earayu added a commit that referenced this pull request Apr 26, 2026
Bryce v3 implementation review (msg=464d5b70) caught a spec bug in
§D.3.2 step 1b: the pseudocode used `(document_id, parse_version)`
exact-match for lineage filter, which contradicts §D.3.6 narrative
step 3 ("doc_A v2 写入(覆盖 doc_A 旧 lineage)"). Strict exact-match
would leave lineage[A,v_old] + lineage[A,v_new] coexisting after a
re-parse, violating the expected supersede semantic.

Architect ruling (msg=80c5dc06) is to amend §D.3.2 step 1b to filter
by `document_id` only (not parse_version). This makes sync(doc, v_new)
self-contained for supersede; orchestrator does not need to do
explicit clear-then-sync.

§D.3.6 narrative remains canonical. PR #1726 Wave 1 graph implementation
follows the corrected algorithm.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ex duplicate

huangheng's Wave 1 CR (msg=8e67bf0e) flagged P1:
test_phase3_classes_have_single_definition_site fails because Wave 1
introduces aperag/indexing/models.py:DocumentIndex alongside the legacy
aperag/domains/indexing/db/models.py:DocumentIndex still owned by Celery.

Architect ruling msg=5be9a535 — option (a): add a narrow allowlist
covering exactly this (class_name, file_pair); reject (b) renaming the
new ORM class (would force a Wave 1 + Wave 3 double-rename, conflicts
with msg=4a801b2b "Python class name stays canonical, only
__tablename__ differs") and (c) xfail (does not express the
"intentional transitional state" semantic).

Wave 3 task #14 acceptance per architect amendment will remove this
allowlist entry in the same PR that deletes the legacy file + alembic
renames document_index_v2 back to document_index.

The allowlist entry is narrow:
- Only the exact frozenset of two file paths is accepted; any third
  duplicate (Wave 4+ regression, e.g. a copy-paste of the class) still
  flags as an offender.
- Only matches `class DocumentIndex(Base):` definitions (the existing
  orm_pattern); pydantic schemas with the same name are unaffected.
- No global widening — no other Phase 3 class gets an exception.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@earayu earayu merged commit 0348a78 into main Apr 26, 2026
4 checks passed
@earayu earayu deleted the chenyexuan/celery-t1.1-foundation branch April 26, 2026 16:09
earayu added a commit that referenced this pull request Apr 26, 2026
…ite proposal (#1725)

* docs(indexing): add indexing redesign design pack — first-principles rewrite proposal

Per earayu2 directive (#celery msg=56812dd6 + msg=d8080c08): full redesign of
the document indexing system, prioritizing simplicity and reliability over
feature breadth, targeting 100 concurrent docs, with hard-cut authorization
(pre-launch / no users / no migration).

Design pack contents (1049 lines, 11 sections):

- §A — Current system analysis with file:line evidence (3-layer ownership skew,
  Python lease thread tied to worker process, graph index NOT replace-idempotent
  per nebula.py:354 upsert_entities, ~995 lines in tasks.py mixing infra +
  business)

- §B — First principles (single SoT in DB, idempotent convergence, source/
  derived/index three-layer separation, concurrency bounded by external
  capacity, simple > complex)

- §C — Three-layer document model (collections/<id>/documents/<id>/source/ +
  derived/parse_<v>/{markdown.md, chunks.jsonl, kg.jsonl, summary.json,
  vision/} + backend index stores)

- §D — Idempotency contract per modality (DELETE-by-(document_id, parse_version)
  before INSERT for all 5 modalities; fixes graph index append bug)

- §E — Concurrency model decision matrix (HTTP-only / lightweight task /
  Celery refactor); recommends lightweight Redis-backed asyncio worker pool
  per modality (5 worker processes, ~80-line reconciler, no Celery / no chord /
  no Python lease thread)

- §F — State machine + atomic flip (4 status values vs current 6;
  document.active_parse_version + pending_parse_version with transactional
  flip; deletion via async cleanup worker)

- §G — Multi-modal unified pipeline (Modality ABC with derive() + sync()
  contract; collapses earayu2's "Celery task 绕来绕去最后又绕回 graph index"
  complaint into 2 functions in 1 file)

- §H — Multi-tenant isolation (recommend simple — required tenant context +
  bulkheads, defer fairness machinery until observability shows real
  noisy-neighbor signal)

- §I — Failure recovery (3 modes: worker crash, transient backend, permanent
  failure; exponential backoff retry; Redis token bucket for LLM rate-limit
  backpressure)

- §J — Observability (4 SLI: index_lag_seconds, index_failure_rate,
  queue_depth, worker_utilization; OTLP wire; aligns with PR #1702)

- §K — Migration plan (7 PRs: observability primitives → idempotent indexers →
  object store layout → worker pool → atomic flip → cutover → availability
  discriminator; feature-flagged dual-stack during PR-D/E; cutover deletes
  ~3000 lines of Celery infrastructure)

Net delta: roughly +4150 / -4850 lines across 7 PRs — net subtraction despite
adding functionality. Indexing layer drops from ~2500 lines to ~1500.

Three open decisions deferred to earayu2:
1. Concurrency model: lightweight Redis-asyncio (recommended), HTTP-only, or
   Celery refactor
2. Atomic flip contract: all-modalities-ACTIVE-required (recommended) vs
   per-modality independent
3. PR sequence: 7-PR cut (recommended) vs combined

Sibling reference: Bryce msg=791082a4 + msg=38fbf962 first-principles analysis
+ architect msg=19f283d5 + msg=2ee66c89 4-blind-spot synthesis. This design
pack is the single canonical deliverable per earayu2's owner directive
(@符炫炜 sole author of the final design).

* docs(indexing): redesign pack v2 — incorporate earayu2 拍板 + 答 derived/MinIO

Driven by earayu2 msg=cc0a00d7 + PM consolidation msg=32463d64.

- Drop Celery → lock Redis + asyncio (§E, decision matrix removed)
- Drop atomic flip → per-modality independent is_serving cutover (§F)
  Accept short eventual-consistency window per earayu2 directive.
- Answer derived/parse_<v>/ contents per modality (§C.6) — chunks.jsonl
  shared by vector+fulltext, kg.jsonl, summary.json, vision/manifest.jsonl
  + images/, markdown.md + outline.json.
- Answer MinIO/object-store suitability (§C.7) — ~150 MB / 100-doc burst,
  trivial; LocalFS / MinIO / S3-compatible all work; small-file + LIST
  caveats addressed.
- Add §L private/on-premise "deploy-and-forget":
  Tier 1 inline (SQLite + LocalFS, ~10 docs/hour),
  Tier 2 docker-compose (~100 concurrent),
  Tier 3 horizontal scale-out — same code.
- §H tenant_scope_key forward-compat hook for future organization concept;
  simple Redis token-bucket quota that won't lock future fairness.
- §K restructure: 7 PRs → 3 waves (Foundation / Runtime / Cutover) with
  per-wave parallel-writability map.
- §G.5 SearchResultMetadata extends w/ parse_version + index_state_per_modality
  (becomes structurally required under per-modality independent flip).

PR #1725 v2; awaiting earayu2 final 拍板 on Wave 1 kickoff.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(indexing): v2 amendment — Bryce review deltas (chunking trade-off / serving invariant / graph entity lineage)

Bryce v2 review msg=7ccb176f surfaced 3 substantive technical deltas all
agreed-as-must-address by PM msg=fc307bbf. Folded into v2:

§C.6 — chunks.jsonl shared-by-vector-and-fulltext is now framed as conscious
trade-off (vector wants larger chunks, fulltext wants smaller) with explicit
shadow-file extension hook (chunks.fulltext.jsonl + namespaced sub-IDs)
preserved so future split is unblocked.

§F.1 — partial unique index added at the schema layer:
  CREATE UNIQUE INDEX uniq_document_index_serving
      ON document_index (document_id, modality)
      WHERE is_serving = TRUE;
This makes the "at most one serving row per (doc, modality)" invariant DB-
enforced, not orchestrator-enforced. SQLite 3.8+ supports the same syntax
(Tier 1 deploy stays consistent).

§D.3 — graph entity lineage model rewritten. Cross-document shared entities
("Linus" mentioned in 100 docs) cannot be cleared by simple DELETE-by-doc
without losing other docs' contributions. New model:
  - source_lineage: SET<{document_id, parse_version, chunk_ids[]}>
  - description_parts: SET<{document_id, parse_version, text}>
  - sync = lineage-level DELETE+INSERT, entity GC when lineage empty
Includes per-entity serialization invariant for Nebula (read-modify-write
without native list ops). 5-step idempotency self-test extension specified.

PR #1725 v2; ready for earayu2 / Bryce final ack.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(indexing): §D.3.2 amendment — lineage cleanup by document_id only

Bryce v3 implementation review (msg=464d5b70) caught a spec bug in
§D.3.2 step 1b: the pseudocode used `(document_id, parse_version)`
exact-match for lineage filter, which contradicts §D.3.6 narrative
step 3 ("doc_A v2 写入(覆盖 doc_A 旧 lineage)"). Strict exact-match
would leave lineage[A,v_old] + lineage[A,v_new] coexisting after a
re-parse, violating the expected supersede semantic.

Architect ruling (msg=80c5dc06) is to amend §D.3.2 step 1b to filter
by `document_id` only (not parse_version). This makes sync(doc, v_new)
self-contained for supersede; orchestrator does not need to do
explicit clear-then-sync.

§D.3.6 narrative remains canonical. PR #1726 Wave 1 graph implementation
follows the corrected algorithm.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(indexing): §J.1 amendment — failure_total + success_total counter pair (was single rate gauge)

huangheng Wave 1 CR (msg=8e67bf0e) flagged §J.1 spec drift: T1.5
implementation emits index_failure_total + index_success_total counter
pair, not the single index_failure_rate gauge spec called for. Architect
ruling: amend spec to match implementation. Counter pair is OTLP-
idiomatic, preserves raw events, re-aggregates across workers without
sliding-window state, and the rate is trivially computable downstream.

§J.1 spec amended; §K Wave 1 acceptance bullet updated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(indexing): §F.1 + §F.5 amendments — collection_id/source_path/tenant_scope_key columns + cleanup Path C

Three Wave 3 amendments folded in to unblock chenyexuan T3.1 (msg=afe345a9):

§F.1 schema:
- Add collection_id VARCHAR(64) NOT NULL (denormalized from
  document.collection_id, populated by orchestrator at INSERT for
  self-contained dispatch payload — per huangheng Wave 2 CR finding
  msg=c94b57fe + architect ruling msg=498b12f0).
- Add source_path TEXT NOT NULL (pointer to source/ artifact, worker
  derive reads directly without JOIN).
- Add tenant_scope_key VARCHAR(64) NOT NULL (forward-compat for future
  organization concept per §H.2; required key for §H.5 quota bucket).
  Was implicit-but-not-listed in the schema before; now explicit.
- Add idx_document_index_collection + idx_document_index_tenant_scope
  indexes for cleanup / quota scans.

§F.5 cleanup worker:
- Restructure to three paths (A/B/C) with explicit semantics.
- Path A: orphan parse_version GC (existing); now notes graph backend
  no-op via §D.3 lineage supersede + graph_noop counter for telemetry.
- Path B: single-document deletion cascade — explicit graph dispatch
  via remove_entity_lineage_member(document_id) per §D.3 amended canonical
  (by document_id only).
- Path C (NEW): collection deletion cascade — Collection.deleted_at
  scan + Path B per child document + final Collection row + storage
  tree cleanup. Replaces legacy Celery collection_delete task with
  state-driven recovery (no asyncio.create_task() durability gap).

These amendments unblock T3.1 commit 1+ since chenyexuan needs the
spec head to reference for the audit allowlist removal and the
caller-migration patterns. Wave 3 task #14 acceptance criteria
(per PM msg=5939e394) now references this head.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: 符炫炜 <fuxuanwei@apecloud.io>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant