feat(celery Wave 6 #34): chunk_id schema canonical unification by earayu · Pull Request #1738 · apecloud/ApeRAG

earayu · 2026-04-27T11:31:06Z

Summary

Wave 6 task #34 — parser canonical schema unification (per spec docs/modularization/indexing-redesign-design-pack.md §K.11.1 row 34).

Canonical name point_id for the Qdrant point identifier across the three modalities that share the {delete_by_filter, upsert_point} adapter (vector / summary / vision):

VectorBackend.upsert_point(chunk_id=) → upsert_point(point_id=) (matches summary + vision protocol shape)
_QdrantPointBackend.upsert_point() drops dual chunk_id | point_id transition shim → single point_id param
Adapter no longer injects misleading chunk_id into summary / vision payloads (their points are not chunks); each modality now controls its own payload schema
Vector worker still keeps chunk_id in payload (parser-emitted, what hybrid retrieval reads) — §C.6 trade-off preserved

utc_now audit: only aperag/indexing/reconciler.py uses utc_now, and those uses are application-level (lease comparison + gmt_updated stamps), distinct from Wave 5 P5B's server_default=CURRENT_TIMESTAMP migration which covers ORM creation defaults. Documented inline; no further migration needed.

Production-readiness 三类 layer (per §K.11.4)

must-be-real: parser-layer chunk_id schema unified across vector/summary/vision Protocol surfaces ✅
may-be-gated: vector worker keeps chunk_id in payload for hybrid-dedup with fulltext (§C.6 contract preserved) ✅
fully-resolves: huangheng T1 obs B + Wave 5 P5B chunk_id 5th item deferred ✅

simple-stable 4 guardrail check

Guardrail	#34 verdict
不无限扩范围	✅ scope = vector protocol + adapter, no cross-cutting refactor
功能做实	✅ real schema rename, no transitional shim left behind
简单稳定	✅ drops dual-API polymorphism + drops hidden payload side-effect (setdefault chunk_id)
私有化免维护	✅ canonical naming self-evident from signature, no tribal knowledge needed

hard-cut policy

Per earayu2 msg=30c81478: no production data → schema breaking change applied directly. No backward-compat shim, no deprecation window. New contract test asserts that chunk_id= keyword now raises TypeError to catch accidental regression to the pre-Wave-6 dual API.

Files changed

aperag/indexing/vector.py — protocol param + InMemoryVectorBackend record key + worker callsite
aperag/indexing/worker_factory.py — _QdrantPointBackend.upsert_point() simplification
aperag/indexing/reconciler.py — utc_now audit comment
tests/unit_test/indexing/test_chunk_id_schema_canonical.py (new, 7 contract tests)
tests/integration/test_cleanup_fan_out.py (5 callsite updates)
tests/unit_test/indexing/test_t1_3_vector_fulltext.py (hybrid-dedup test reads payload chunk_id now)
tests/unit_test/indexing/test_t2_1_runtime.py (3 callsite updates)
tests/unit_test/indexing/test_t3_1_dispatcher_path_c.py (3 callsite updates + 1 record read)

Test plan

152/152 tests/unit_test/indexing/ pass
21/21 indexing integration tests pass (test_cleanup_fan_out + test_dispatch_with_parse + test_inline_mode_smoke + test_parse_async_roundtrip)
7/7 new contract tests pass (test_chunk_id_schema_canonical.py)
ruff check + ruff format clean
CI lint-and-unit + e2e green

🤖 Generated with Claude Code

Per docs/modularization/indexing-redesign-design-pack.md §K.11.1 #34 (huangheng T1 obs B + Wave 5 P5B chunk_id 5th item deferred). ## Changes aperag/indexing/vector.py - VectorBackend.upsert_point(chunk_id=) → upsert_point(point_id=) (canonical Qdrant naming, aligned with summary/vision protocols) - InMemoryVectorBackend record key renamed chunk_id → point_id - Vector worker callsite passes point_id=chunk["chunk_id"]; chunk_id remains in payload for hybrid-dedup with fulltext (§C.6 trade-off lock) aperag/indexing/worker_factory.py - _QdrantPointBackend.upsert_point(): single point_id keyword (drop pre-Wave-6 dual chunk_id|point_id transition shim) - Drop merged_payload.setdefault("chunk_id", identifier) — adapter no longer injects misleading chunk_id into summary/vision payloads (their points are not chunks); each modality controls its payload aperag/indexing/reconciler.py - utc_now audit comment: utc_now usage in reconciler is application-level wall-clock for lease comparison + gmt_updated stamps, distinct from Wave 5 P5B server_default=CURRENT_TIMESTAMP ORM-creation defaults. No further migration needed. tests/unit_test/indexing/test_chunk_id_schema_canonical.py (new) - 7 contract tests pin canonical naming: * VectorBackend / SummaryBackend / VisionBackend Protocols use point_id * InMemoryVectorBackend round-trips point_id at record level + chunk_id at payload level * Legacy chunk_id= keyword raises TypeError on InMemoryVectorBackend * _QdrantPointBackend.upsert_point uses single point_id param * _QdrantPointBackend does not inject chunk_id into payload * Parser chunks.jsonl chunk_id field naming preserved tests/integration/test_cleanup_fan_out.py tests/unit_test/indexing/test_t1_3_vector_fulltext.py tests/unit_test/indexing/test_t2_1_runtime.py tests/unit_test/indexing/test_t3_1_dispatcher_path_c.py - backend.upsert_point(chunk_id=...) → point_id=... - record reads p["chunk_id"] → p["point_id"] / p["payload"]["chunk_id"] ## Production-readiness 三类 layer (per §K.11.4) - must-be-real: parser layer chunk_id field schema unified across vector/summary/vision backend protocol surfaces; remaining utc_now usage audited (reconciler.py only, application-level, distinct from ORM defaults Wave 5 P5B already migrated) - may-be-gated: vector worker still keeps chunk_id in payload for hybrid-dedup with fulltext (§C.6 contract preserved) - fully-resolves: huangheng T1 obs B + Wave 5 P5B chunk_id 5th item deferred (per spec §K.11.1 row 34) ## simple-stable 4 guardrail (per feedback_simple_stable_zero_maintenance.md) 1. 不无限扩范围 ✅ — scope limited to vector protocol + adapter, no cross-cutting touch 2. 功能做实 ✅ — real schema rename, no transitional shim left behind 3. 简单稳定 ✅ — drops dual-API polymorphism + drops hidden payload side-effect (setdefault chunk_id), each layer's contract is now self-evident from signature 4. 私有化免维护 ✅ — operator/dev reading code sees canonical naming without needing to track which arg name belongs to which modality ## hard-cut policy (per earayu2 msg=30c81478) No production data → schema breaking change applied directly. No backward-compat shim, no deprecation window. Test that the legacy chunk_id= keyword raises TypeError catches accidental regression. ## Pre-check (per K.11.5 Pattern 2) grep upsert_point across aperag/indexing/ + tests/ — all callsites audited, all callers migrated, contract test pins canonical name. ## Local gates - 152/152 indexing unit tests pass - 21/21 indexing integration tests pass (cleanup_fan_out + dispatch_with_parse + inline_mode_smoke + parse_async_roundtrip) - 7/7 new contract tests pass - ruff check + ruff format clean Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…WARN log Per docs/modularization/indexing-redesign-design-pack.md §K.11.1 #38 (huangheng PR #1734 sync point 7 + feedback_announce_equals_landed.md narrative-truth invariant). ## Problem (pre-Wave-6 narrative-truth violation) `aperag.cache.application_runtime.get_application_cache()` keyed the cached `ApplicationCache` instance on the running asyncio event loop (the Redis async client is loop-bound). When the running loop changed (test process restart, worker reinit, asyncio re-initialisation, etc.), the function silently swapped the cache for a `NoopApplicationCacheBackend` with `enabled=False`. From that point on the worker paid the full LiteLLM / embedding round-trip on every request, with no log line, no metric, and no operator-visible signal that caching had degraded to zero. ## Fix When the loop-switch branch fires: - emit a WARN log with operator-actionable phrasing - bump `application_cache_metrics["application_runtime"]["loop_switch_rebuild"]` (uses the existing ApplicationCacheMetrics instance — no new metrics infra) - reset `_async_cache=None` and fall through to the normal initialisation path (which re-establishes the real Redis client on the new loop) The Noop fallback is now reached only when Redis is genuinely unreachable on the new loop — never as a silent loop-switch downgrade. ## Tests (4 new) tests/unit_test/cache/test_application_cache.py - test_application_cache_rebuilds_on_loop_switch_instead_of_silent_noop: two `_run_in_new_loop()` calls produce distinct ApplicationCache instances, second wraps real `ApplicationRedisCacheBackend` (not `NoopApplicationCacheBackend`); metric counter incremented to 1; WARN log captured. - test_application_cache_does_not_increment_loop_switch_metric_on_first_build: first-ever build does not bump the loop-switch counter. - test_application_cache_repeat_call_in_same_loop_is_singleton_no_metric_bump: same-loop repeat returns identical singleton, no metric bump. - test_application_cache_falls_back_to_noop_when_redis_unreachable_on_new_loop: if Redis genuinely fails on the new loop, rebuild attempts the real client, falls back to Noop, but metric still records the loop switch (operators see the rebuild attempt + the Redis failure WARN that was already there). ## Production-readiness 三类 layer (per §K.11.4) - must-be-real: real WARN log + real metric counter + real Redis client rebuild on cross-loop call (no silent zero-cache) - may-be-gated: first cross-loop request may pay rebuild cost (re-ping Redis) — observable via metric, not silent - fully-resolves: huangheng PR #1734 sync point 7 + feedback_announce_equals_landed.md narrative-truth lock ## simple-stable 4 guardrail 1. 不无限扩范围 ✅ — single-function fix in application_runtime.py, reuses existing `application_cache_metrics` infra, no new public API 2. 功能做实 ✅ — real cache rebuild not a placeholder 3. 简单稳定 ✅ — fewer paths than before (one rebuild path replaces the silent-Noop branch), each branch logged 4. 私有化免维护 ✅ — operator can grep `loop_switch_rebuild` in metrics + log "rebuilding for new event loop" to diagnose worker setup issues ## Local gates - 43/43 cache unit tests pass (39 existing + 4 new) - ruff check + ruff format clean Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

earayu and others added 2 commits April 27, 2026 19:30

earayu merged commit 0711647 into main Apr 27, 2026
3 checks passed

earayu deleted the chenyexuan/celery-wave6 branch April 27, 2026 11:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(celery Wave 6 #34): chunk_id schema canonical unification#1738

feat(celery Wave 6 #34): chunk_id schema canonical unification#1738
earayu merged 2 commits into
mainfrom
chenyexuan/celery-wave6

earayu commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

earayu commented Apr 27, 2026

Summary

Production-readiness 三类 layer (per §K.11.4)

simple-stable 4 guardrail check

hard-cut policy

Files changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant