Commit 0348a78
feat(celery Wave 1): Foundation + 5 modalities + observability (#1726)
* feat(celery T1.1): Foundation — schema + Modality ABC + object_store + parser
Phase celery T1.1 per docs/modularization/indexing-redesign-design-pack.md
(PR #1725 v3 head 5d7a60f). Foundation lane that the other Wave 1
modality lanes (T1.2 graph @bryce / T1.3 vector+fulltext / T1.4
summary+vision / T1.5 observability) depend on.
What this adds (~600 LOC + tests):
- alembic migration `f9c4d2a8e1b5_indexing_redesign_document_index.py`
creates the `document_index` table with the §F.1 partial unique
index `uniq_document_index_serving` enforcing "at most one
is_serving=TRUE row per (document_id, modality)" at the DB layer
(Bryce v2 review msg=7ccb176f #2). Postgres native; SQLite ≥3.8
supports the same syntax (Tier 1 §L deploy stays consistent).
- `aperag/indexing/models.py` — DocumentIndex SQLAlchemy ORM mirroring
the alembic schema, plus `Modality` (5 values per §C/§D) and
`IndexStatus` (4 lifecycle states per §F.2) string-enums.
- `aperag/indexing/base.py` — `ModalityWorker` ABC with `derive` +
`sync` so per-modality workers (T1.2-T1.5) inherit the §D.1
"DELETE-by-(doc, parse_version) THEN INSERT" replace-idempotent
contract. Graph reinterprets DELETE as the §D.3 lineage-level
DELETE+INSERT internally; the ABC accepts that variation.
- `aperag/indexing/object_store.py` — Atomic write helpers that wrap
the existing `aperag.objectstore` package: LocalFS uses
tmp+fsync+rename per §C.7; S3/MinIO relies on the single-request
PutObject (or multipart CompleteMultipartUpload) visibility gate.
Includes `read_or_none` for the §C.7 read contract and an
`InMemoryObjectStore` test fixture so downstream T1.x can wire
unit tests without touching disk.
- `aperag/indexing/parser.py` — Deterministic parser entry point that
produces the three shared artifacts (`markdown.md` / `outline.json`
/ `chunks.jsonl`) under `derived/parse_<v>/` per §C.1. parse_version
is computed via the canonical D10.g §E.2 helper
(`compute_parse_version`) so a chunking change rolls the version.
T1.1 ships an in-process simulator implementation that proves the
write contract; production parser integration (docparser/Marker/OCR)
swaps in at T2.x without changing the artifact shape.
- `aperag/migration/env.py` — Register `aperag.indexing.models` for
alembic autogen (the new module deliberately lives outside the
per-domain `db/models.py` tree because the `aperag.domains.indexing`
surface is the Wave 3 hard-delete target).
Tests cover the three Wave 1 acceptance gates locked by the design
pack §K:
1. Partial unique invariant: an INSERT of a second is_serving=TRUE
row for the same (document_id, modality) raises IntegrityError;
non-serving rows + a different modality on the same document are
allowed; the §F.3 three-statement cutover transaction satisfies
the constraint.
2. Object-store atomic write: `_LocalAtomicWriter` produces a final
destination file with no `.tmp.*` siblings remaining; concurrent
writers to different artifacts in the same parse_version directory
each land their own bytes; `InMemoryObjectStore` matches the
single-call atomicity semantic.
3. Parser → derived/ round-trip: parse_document writes the three
canonical artifacts; identical inputs yield identical
parse_version + identical artifact contents (§C.3 idempotent
retry); a chunking config change rolls parse_version (§E.2 hash);
chunks.jsonl round-trips via `read_chunks`; outline.json carries
slash-separated `section_path` + slug `heading_anchor` per the
D10.c §A.9 R1 lock; re-running parse_document overwrites
atomically; missing/empty artifacts are treated as "derive not
yet complete" (§C.7 read contract); path traversal is rejected.
Out of scope for this lane (per §K decomposition):
- T1.2 graph modality + §D.3 lineage model (@bryce, task #7)
- T1.3 vector + fulltext modalities (chenyexuan, task #8)
- T1.4 summary + vision modalities (chenyexuan, task #9)
- T1.5 observability OTLP emit (chenyexuan, task #10)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(celery Wave 1): T1.3+T1.4+T1.5 modalities + observability + foundation fix-forward
Bundles the architect-mandated fix-forward (msg=4a801b2b, msg=c3b0ba5b,
msg=07b5b1e6) on top of T1.1 foundation:
T1.1 fix-forward (per architect rulings on PR #1726 P0 bugs):
- Bug1: object_store.py LocalObjectStore import alias (class is named Local)
- Bug2: tablename → document_index_v2 + index renames (Wave 3 renames back
to canonical via alembic per task #14 acceptance amendment)
- NEW (§H.2): tenant_scope_key VARCHAR(64) NOT NULL column +
idx_document_index_v2_tenant_scope index — locked into Wave 1 schema for
T2.2 quota lane to consume without migration churn
- Tests: fixture default tenant_scope_key="user:test"
T1.3 Vector + Fulltext (§D.1 replace-idempotent contract):
- VectorBackend / InMemoryVectorBackend Protocol + Qdrant-shaped
delete_by_filter + upsert_point
- VectorModality: derive no-op pass-through, sync DELETE-by-(doc, parse_v)
THEN per-chunk upsert with placeholder embedding (sha256-derived 16-dim)
- FulltextBackend / InMemoryFulltextBackend + delete_by_query + bulk_index
- FulltextModality: shares chunks.jsonl with vector (§C.6); chunk_id parity
preserved for hybrid dedup at search layer
- Tests: replace-idempotent on double sync, new parse_version doesn't
corrupt old slot, hybrid chunk_id parity, modality discriminator on
payload, missing-artifact no-op (§C.7 reschedule semantic)
T1.4 Summary + Vision (§C.6 + §D.2 expensive-derive split):
- SummaryModality: derive reads parser markdown.md, runs placeholder
summarizer (first non-heading paragraph), embeds, writes summary.json
atomically; sync deletes by filter + upserts single point keyed
summary:{document_id}:{parse_version}
- VisionModality: derive reads synthetic image-records JSON (T1 simulator
contract — T2.x replaces with real PDF extract + vision-LLM), writes
vision/manifest.jsonl, sync upserts one point per image keyed
vision:{document_id}:{parse_version}:{image_id}
- Both backends use Qdrant-shaped Protocol + InMemory test fixtures
- Tests: derive persists canonical artifact (cost preserved across
retries), idempotent on double sync, new parse_version doesn't corrupt
old slot, modality discriminator, missing-artifact no-op
T1.5 Observability primitives (§J.1 SLI emission):
- 5 metric name constants prefixed indexing.* — index_lag_seconds /
index_failure_rate / index_success / queue_depth / worker_utilization
- MetricsEmitter Protocol + NoopMetricsEmitter (Tier 1 deploy without
OTLP) + InMemoryMetricsEmitter (test fixture)
- emit_index_lag / emit_index_failure / emit_index_success /
emit_queue_depth / emit_worker_utilization helpers — modality attribute
optional, utilization clamps to [0, 1] and handles capacity=0
- Tests: emission shape contract for each helper + metric name prefix lock
Per architect msg=f21a79f0 + PM msg=07b5b1e6 + msg=95012fdb: this commit
accumulates onto the Wave 1 mega-PR #1726. Bryce will rebase + push T1.2
graph commit on the same branch; once T1.2 lands, the full Wave 1 PR
flips to in_review for huangheng's step 0+ / step 0 / step 0''' /
cross-lane caller sweep CR + verdict.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(celery T1.2): graph modality — §D.3 lineage model + per-entity Redis lock + tenant_scope_key
Per docs/modularization/indexing-redesign-design-pack.md §D.3 +
architect msg=cc555e33 / msg=f2921ae0 / msg=c3b0ba5b lineage rulings.
The graph modality is the only one whose backend rows are shared
across documents, so the simple DELETE-by-(doc, parse_version) +
re-INSERT pattern that vector/fulltext/summary/vision use breaks the
shared-entity model — Bryce v2 review msg=7ccb176f #3 surfaced this.
T1.2 lands the lineage-tracked entity / relation rows + the §D.3.2
two-phase sync algorithm + the per-entity Redis lock that protects
Nebula's read-modify-write window from racing.
Surface (aperag/indexing/graph.py, ~1017 lines):
* Lineage data model — LineageMember(document_id, parse_version,
tenant_scope_key, chunk_ids), DescriptionPart, EntityRecord,
RelationRecord, EntityWithLineage, RelationWithLineage. The
tenant_scope_key lives at SET-element level (not entity row level)
per architect msg=c3b0ba5b — placement chosen so a shared entity
cited by multiple tenants still has one row but each lineage member
carries its own tenant attribution for read-path ACL filtering
(§H.2 quota / organization key).
* Per-entity lock — EntityLock Protocol + InMemoryEntityLock
(asyncio.Lock per key, single-process default) + RedisEntityLock
(Redis SETNX-style lock keyed by f"{prefix}:{slot}" where slot is
crc32(entity_id) % 4096 to bound the key space). The lock is
mandatory on the Nebula path because Nebula 3.x's list ops require
application-layer read-modify-write; concurrent sync calls touching
the same shared entity would otherwise lose lineage members at the
network round-trip.
* LineageGraphStore Protocol — backend abstraction so the §D.3.2
algorithm in GraphModalityWorker is portable across the three
backends listed in §D.3.5 (Nebula 3.x, Neo4j with native list +
APOC, in-memory). Methods filter lineage by document_id (not
exact (doc, parse_version)) so re-parsing supersedes the old
parse_version cleanly per §D.3.6 step 3 (see §D.3.2 amendment
note in docstring).
* InMemoryLineageGraphStore — Python-dict reference implementation
used as the canonical correctness oracle. The §D.3.6 self-test
+ concurrent race test run against it; any future Nebula / Neo4j
adapter that satisfies the Protocol inherits pass/fail status from
this suite.
* GraphExtractor — Callable injection point so the production
LightRAG-based LLM extractor (Wave 2 wiring) and the test stubs
share the same surface. The kg.jsonl artifact persists the
extractor output so retries never re-charge LLM cost (§C.6
CANONICAL artifact contract).
* serialize_kg_jsonl / parse_kg_jsonl — line-oriented JSON with a
"kind" discriminator so future record types can be added without
breaking the file format. Empty payloads encode as b"\\n" (one
byte) so the §C.7 "empty body == derive not finished" sentinel
cannot collide with a deliberate "no records produced" payload —
the deletion flow (§D.3.6 step 4 / step 5) publishes b"\\n" to
cleanly clear a document's lineage contribution.
* GraphModalityWorker — implements ModalityWorker ABC. derive()
reads chunks.jsonl produced by the T1.1 parser, calls the injected
extractor, writes kg.jsonl atomically. sync() reads kg.jsonl and
applies the §D.3.2 algorithm:
Phase 1 (cleanup): for every entity / relation in the lineage
backend whose lineage SET contains the document_id, remove
ALL members for that document_id (across any parse_version)
and GC rows that go orphan. Per-entity lock held during
cleanup so concurrent syncs cannot race the read-modify-write.
Phase 2 (rebuild): for every entity / relation in kg.jsonl,
upsert with a fresh LineageMember stamped with the current
(document_id, parse_version, tenant_scope_key). Per-entity
lock held during rebuild for the same reason.
Tests (tests/unit_test/indexing/test_t1_2_graph.py, ~1044 lines, 16
test cases):
* §D.3.6 five-step idempotency suite (test_d3_6_step1..step5):
doc_A v1 → doc_B v1 → doc_A v2 supersedes → delete doc_A → delete
doc_B with full entity GC. Pin the §D.3.2 algorithm against the
§D.3.6 narrative; the regression Bryce surfaced in msg=7ccb176f
#3 fails this suite under the v1 append-on-conflict path.
* Relation-lineage symmetric coverage
(test_relation_lineage_doc_a_then_doc_b_then_delete_doc_a) — proves
the same lineage semantics flow through relation evidence_lineage.
* §D.4 byte-equivalent re-sync (test_d4_byte_equivalent_resync) —
double-sync with the same artifact leaves the backend identical.
The cross-modality idempotency contract every modality must pass.
* Nebula race condition under per-entity lock
(test_nebula_race_under_per_entity_lock_preserves_both_writes +
test_nebula_race_without_lock_loses_a_writer) — uses a
_RaceProvocateurStore that emulates Nebula's read-modify-write
network round-trip with a deterministic asyncio yield. Under the
in-memory entity lock, both writers' lineage members end up in
the SET (PASS); under a no-op lock, deterministically one writer
loses (the negative control). Pins the architect msg=f2921ae0
invariant that per-entity serialization is mandatory not optional.
* tenant_scope_key propagation
(test_tenant_scope_key_propagates_into_lineage_members) — pins
the SET-element placement decision per architect msg=c3b0ba5b so
a future regression that drops the field or moves it to entity
row level fails loudly.
* kg.jsonl round-trip (test_kg_jsonl_*) — entity / relation
serialization, forward-compatible kind skipping, empty-body
encoding (always at least b"\\n" to distinguish "no records"
from "derive crashed").
* derive contract (test_derive_writes_kg_jsonl_under_canonical_path
+ test_sync_with_missing_artifact_is_a_noop) — derive writes to
derived/parse_<v>/kg.jsonl per §C.6 canonical layout; sync no-ops
when the artifact is missing per §C.7 read contract.
* End-to-end with the real T1.1 parser
(test_end_to_end_with_real_parser_chunks) — proves graph.derive
cooperates with the chunks.jsonl shape that
aperag.indexing.parser.parse_document actually produces, no
hidden coupling beyond the §C.6 contract.
aperag/indexing/__init__.py — re-export the T1.2 graph public
surface alongside chenyexuan's T1.3/T1.4/T1.5 modality exports so
the indexing package surface is uniform across all five modalities.
Lint: ruff check + ruff format --check both clean across
aperag/indexing/ and tests/unit_test/indexing/.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(celery T1.1 tests): leftover ImportError + bogus settings monkey-patch
Three follow-up bugs surfaced by Bryce's local pytest collection
(msg=464d5b70) on top of fix-forward 859f899:
1. test_t1_1_foundation.py:69 — ImportError. The class is named
`Local` (not `LocalObjectStore`); my main-tree object_store.py
already uses the `Local as LocalObjectStore` alias per architect
ruling on Bug 1 (msg=4a801b2b), but the test file's own import
line still referenced the wrong name. Fixed: same alias pattern.
2. test_local_atomic_write_uses_tmp_rename_dance — AttributeError on
`aperag.objectstore.local.settings`. The module never had a
`settings` attribute; the monkey-patch block was unfounded.
`Local(LocalConfig)` accepts the config struct directly, no
module-level singleton dependency. Replaced the monkey-patch
block with a direct `LocalObjectStore(LocalConfig(root_dir=...))`
construction.
3. test_concurrent_atomic_writes_dont_clobber_each_other — same
AttributeError pattern, same fix.
No production code changed; only the T1.1 test fixture is simplified
(net -30 lines). Lint clean. Wave 1 PR #1726 ready to flip to
in_review after this push.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(celery T1.1 audit): narrow allowlist for transitional DocumentIndex duplicate
huangheng's Wave 1 CR (msg=8e67bf0e) flagged P1:
test_phase3_classes_have_single_definition_site fails because Wave 1
introduces aperag/indexing/models.py:DocumentIndex alongside the legacy
aperag/domains/indexing/db/models.py:DocumentIndex still owned by Celery.
Architect ruling msg=5be9a535 — option (a): add a narrow allowlist
covering exactly this (class_name, file_pair); reject (b) renaming the
new ORM class (would force a Wave 1 + Wave 3 double-rename, conflicts
with msg=4a801b2b "Python class name stays canonical, only
__tablename__ differs") and (c) xfail (does not express the
"intentional transitional state" semantic).
Wave 3 task #14 acceptance per architect amendment will remove this
allowlist entry in the same PR that deletes the legacy file + alembic
renames document_index_v2 back to document_index.
The allowlist entry is narrow:
- Only the exact frozenset of two file paths is accepted; any third
duplicate (Wave 4+ regression, e.g. a copy-paste of the class) still
flags as an offender.
- Only matches `class DocumentIndex(Base):` definitions (the existing
orm_pattern); pydantic schemas with the same name are unaffected.
- No global widening — no other Phase 3 class gets an exception.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>1 parent 704b3cf commit 0348a78
20 files changed
Lines changed: 5875 additions & 13 deletions
File tree
- aperag
- indexing
- migration
- versions
- tests/unit_test
- indexing
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
0 commit comments