Skip to content

feat(celery Wave 6 #34): chunk_id schema canonical unification#1738

Merged
earayu merged 2 commits into
mainfrom
chenyexuan/celery-wave6
Apr 27, 2026
Merged

feat(celery Wave 6 #34): chunk_id schema canonical unification#1738
earayu merged 2 commits into
mainfrom
chenyexuan/celery-wave6

Conversation

@earayu
Copy link
Copy Markdown
Collaborator

@earayu earayu commented Apr 27, 2026

Summary

Wave 6 task #34 — parser canonical schema unification (per spec docs/modularization/indexing-redesign-design-pack.md §K.11.1 row 34).

Canonical name point_id for the Qdrant point identifier across the three modalities that share the {delete_by_filter, upsert_point} adapter (vector / summary / vision):

  • VectorBackend.upsert_point(chunk_id=)upsert_point(point_id=) (matches summary + vision protocol shape)
  • _QdrantPointBackend.upsert_point() drops dual chunk_id | point_id transition shim → single point_id param
  • Adapter no longer injects misleading chunk_id into summary / vision payloads (their points are not chunks); each modality now controls its own payload schema
  • Vector worker still keeps chunk_id in payload (parser-emitted, what hybrid retrieval reads) — §C.6 trade-off preserved

utc_now audit: only aperag/indexing/reconciler.py uses utc_now, and those uses are application-level (lease comparison + gmt_updated stamps), distinct from Wave 5 P5B's server_default=CURRENT_TIMESTAMP migration which covers ORM creation defaults. Documented inline; no further migration needed.

Production-readiness 三类 layer (per §K.11.4)

  • must-be-real: parser-layer chunk_id schema unified across vector/summary/vision Protocol surfaces ✅
  • may-be-gated: vector worker keeps chunk_id in payload for hybrid-dedup with fulltext (§C.6 contract preserved) ✅
  • fully-resolves: huangheng T1 obs B + Wave 5 P5B chunk_id 5th item deferred ✅

simple-stable 4 guardrail check

Guardrail #34 verdict
不无限扩范围 ✅ scope = vector protocol + adapter, no cross-cutting refactor
功能做实 ✅ real schema rename, no transitional shim left behind
简单稳定 ✅ drops dual-API polymorphism + drops hidden payload side-effect (setdefault chunk_id)
私有化免维护 ✅ canonical naming self-evident from signature, no tribal knowledge needed

hard-cut policy

Per earayu2 msg=30c81478: no production data → schema breaking change applied directly. No backward-compat shim, no deprecation window. New contract test asserts that chunk_id= keyword now raises TypeError to catch accidental regression to the pre-Wave-6 dual API.

Files changed

  • aperag/indexing/vector.py — protocol param + InMemoryVectorBackend record key + worker callsite
  • aperag/indexing/worker_factory.py_QdrantPointBackend.upsert_point() simplification
  • aperag/indexing/reconciler.py — utc_now audit comment
  • tests/unit_test/indexing/test_chunk_id_schema_canonical.py (new, 7 contract tests)
  • tests/integration/test_cleanup_fan_out.py (5 callsite updates)
  • tests/unit_test/indexing/test_t1_3_vector_fulltext.py (hybrid-dedup test reads payload chunk_id now)
  • tests/unit_test/indexing/test_t2_1_runtime.py (3 callsite updates)
  • tests/unit_test/indexing/test_t3_1_dispatcher_path_c.py (3 callsite updates + 1 record read)

Test plan

  • 152/152 tests/unit_test/indexing/ pass
  • 21/21 indexing integration tests pass (test_cleanup_fan_out + test_dispatch_with_parse + test_inline_mode_smoke + test_parse_async_roundtrip)
  • 7/7 new contract tests pass (test_chunk_id_schema_canonical.py)
  • ruff check + ruff format clean
  • CI lint-and-unit + e2e green

🤖 Generated with Claude Code

earayu and others added 2 commits April 27, 2026 19:30
Per docs/modularization/indexing-redesign-design-pack.md §K.11.1 #34
(huangheng T1 obs B + Wave 5 P5B chunk_id 5th item deferred).

## Changes

aperag/indexing/vector.py
- VectorBackend.upsert_point(chunk_id=) → upsert_point(point_id=)
  (canonical Qdrant naming, aligned with summary/vision protocols)
- InMemoryVectorBackend record key renamed chunk_id → point_id
- Vector worker callsite passes point_id=chunk["chunk_id"]; chunk_id
  remains in payload for hybrid-dedup with fulltext (§C.6 trade-off lock)

aperag/indexing/worker_factory.py
- _QdrantPointBackend.upsert_point(): single point_id keyword (drop
  pre-Wave-6 dual chunk_id|point_id transition shim)
- Drop merged_payload.setdefault("chunk_id", identifier) — adapter no
  longer injects misleading chunk_id into summary/vision payloads
  (their points are not chunks); each modality controls its payload

aperag/indexing/reconciler.py
- utc_now audit comment: utc_now usage in reconciler is
  application-level wall-clock for lease comparison + gmt_updated
  stamps, distinct from Wave 5 P5B server_default=CURRENT_TIMESTAMP
  ORM-creation defaults. No further migration needed.

tests/unit_test/indexing/test_chunk_id_schema_canonical.py (new)
- 7 contract tests pin canonical naming:
  * VectorBackend / SummaryBackend / VisionBackend Protocols use point_id
  * InMemoryVectorBackend round-trips point_id at record level + chunk_id at payload level
  * Legacy chunk_id= keyword raises TypeError on InMemoryVectorBackend
  * _QdrantPointBackend.upsert_point uses single point_id param
  * _QdrantPointBackend does not inject chunk_id into payload
  * Parser chunks.jsonl chunk_id field naming preserved

tests/integration/test_cleanup_fan_out.py
tests/unit_test/indexing/test_t1_3_vector_fulltext.py
tests/unit_test/indexing/test_t2_1_runtime.py
tests/unit_test/indexing/test_t3_1_dispatcher_path_c.py
- backend.upsert_point(chunk_id=...) → point_id=...
- record reads p["chunk_id"] → p["point_id"] / p["payload"]["chunk_id"]

## Production-readiness 三类 layer (per §K.11.4)

- must-be-real: parser layer chunk_id field schema unified across
  vector/summary/vision backend protocol surfaces; remaining utc_now
  usage audited (reconciler.py only, application-level, distinct from
  ORM defaults Wave 5 P5B already migrated)
- may-be-gated: vector worker still keeps chunk_id in payload for
  hybrid-dedup with fulltext (§C.6 contract preserved)
- fully-resolves: huangheng T1 obs B + Wave 5 P5B chunk_id 5th item
  deferred (per spec §K.11.1 row 34)

## simple-stable 4 guardrail (per feedback_simple_stable_zero_maintenance.md)

1. 不无限扩范围 ✅ — scope limited to vector protocol + adapter, no
   cross-cutting touch
2. 功能做实 ✅ — real schema rename, no transitional shim left behind
3. 简单稳定 ✅ — drops dual-API polymorphism + drops hidden payload
   side-effect (setdefault chunk_id), each layer's contract is now
   self-evident from signature
4. 私有化免维护 ✅ — operator/dev reading code sees canonical naming
   without needing to track which arg name belongs to which modality

## hard-cut policy (per earayu2 msg=30c81478)

No production data → schema breaking change applied directly. No
backward-compat shim, no deprecation window. Test that the legacy
chunk_id= keyword raises TypeError catches accidental regression.

## Pre-check (per K.11.5 Pattern 2)

grep upsert_point across aperag/indexing/ + tests/ — all callsites
audited, all callers migrated, contract test pins canonical name.

## Local gates

- 152/152 indexing unit tests pass
- 21/21 indexing integration tests pass (cleanup_fan_out +
  dispatch_with_parse + inline_mode_smoke + parse_async_roundtrip)
- 7/7 new contract tests pass
- ruff check + ruff format clean

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…WARN log

Per docs/modularization/indexing-redesign-design-pack.md §K.11.1 #38
(huangheng PR #1734 sync point 7 + feedback_announce_equals_landed.md
narrative-truth invariant).

## Problem (pre-Wave-6 narrative-truth violation)

`aperag.cache.application_runtime.get_application_cache()` keyed the
cached `ApplicationCache` instance on the running asyncio event loop
(the Redis async client is loop-bound). When the running loop
changed (test process restart, worker reinit, asyncio
re-initialisation, etc.), the function silently swapped the cache
for a `NoopApplicationCacheBackend` with `enabled=False`. From that
point on the worker paid the full LiteLLM / embedding round-trip on
every request, with no log line, no metric, and no operator-visible
signal that caching had degraded to zero.

## Fix

When the loop-switch branch fires:
- emit a WARN log with operator-actionable phrasing
- bump
  `application_cache_metrics["application_runtime"]["loop_switch_rebuild"]`
  (uses the existing ApplicationCacheMetrics instance — no new
  metrics infra)
- reset `_async_cache=None` and fall through to the normal
  initialisation path (which re-establishes the real Redis client on
  the new loop)

The Noop fallback is now reached only when Redis is genuinely
unreachable on the new loop — never as a silent loop-switch
downgrade.

## Tests (4 new)

tests/unit_test/cache/test_application_cache.py
- test_application_cache_rebuilds_on_loop_switch_instead_of_silent_noop:
  two `_run_in_new_loop()` calls produce distinct ApplicationCache
  instances, second wraps real `ApplicationRedisCacheBackend` (not
  `NoopApplicationCacheBackend`); metric counter incremented to 1;
  WARN log captured.
- test_application_cache_does_not_increment_loop_switch_metric_on_first_build:
  first-ever build does not bump the loop-switch counter.
- test_application_cache_repeat_call_in_same_loop_is_singleton_no_metric_bump:
  same-loop repeat returns identical singleton, no metric bump.
- test_application_cache_falls_back_to_noop_when_redis_unreachable_on_new_loop:
  if Redis genuinely fails on the new loop, rebuild attempts the
  real client, falls back to Noop, but metric still records the
  loop switch (operators see the rebuild attempt + the Redis
  failure WARN that was already there).

## Production-readiness 三类 layer (per §K.11.4)

- must-be-real: real WARN log + real metric counter + real Redis
  client rebuild on cross-loop call (no silent zero-cache)
- may-be-gated: first cross-loop request may pay rebuild cost
  (re-ping Redis) — observable via metric, not silent
- fully-resolves: huangheng PR #1734 sync point 7 +
  feedback_announce_equals_landed.md narrative-truth lock

## simple-stable 4 guardrail

1. 不无限扩范围 ✅ — single-function fix in
   application_runtime.py, reuses existing
   `application_cache_metrics` infra, no new public API
2. 功能做实 ✅ — real cache rebuild not a placeholder
3. 简单稳定 ✅ — fewer paths than before (one rebuild path replaces
   the silent-Noop branch), each branch logged
4. 私有化免维护 ✅ — operator can grep
   `loop_switch_rebuild` in metrics + log "rebuilding for new event
   loop" to diagnose worker setup issues

## Local gates

- 43/43 cache unit tests pass (39 existing + 4 new)
- ruff check + ruff format clean

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@earayu earayu merged commit 0711647 into main Apr 27, 2026
3 checks passed
@earayu earayu deleted the chenyexuan/celery-wave6 branch April 27, 2026 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant