Skip to content

feat(celery Wave 5): single-PR close-out 16 backlog items (5-phase chunked-rotation)#1733

Merged
earayu merged 13 commits into
mainfrom
bryce/celery-wave5
Apr 27, 2026
Merged

feat(celery Wave 5): single-PR close-out 16 backlog items (5-phase chunked-rotation)#1733
earayu merged 13 commits into
mainfrom
bryce/celery-wave5

Conversation

@earayu
Copy link
Copy Markdown
Collaborator

@earayu earayu commented Apr 27, 2026

Summary

Wave 5 indexing redesign close-out — 16 backlog items in single PR with 5-phase chunked-rotation commits (per architect msg=11442bbf reframe + earayu2 msg=eced858d "大PR 快速落地" directive). Same big-PR pattern as Wave 4 PR #1731.

§K.9 spec amendment locked + Phase 1 chunk 1 (relocate build_collection_llm_callable + render_extraction_prompt + ENTITY_RELATION_EXTRACTION to aperag/indexing/llm.py) shipped.

Phases

  • Phase 1 (Bryce, sequential first): Legacy graphindex package elimination + retrieval/curation migration to §G.5 read primitives + aperag/indexing/llm.py relocate
  • Phase 2 (Bryce, post-Phase 1): T7 multimodal vision-LLM 3-item bundle (embed_image API + parser image extraction + provider capability flag)
  • Phase 3 (chenyexuan, parallel with Phase 2): Layer 2 e2e fixture wiring + sweep D activation
  • Phase 4 (chenyexuan, parallel with Phase 1): P1 production robustness 3-pack (T2 reconciler + T3 parse short-circuit + T2 cleanup error split)
  • Phase 5A (Bryce): P2 batch (T1 config tunability + extractor attribute marker + W5-perf parallel-list + Neo4j label namespace + Cypher type rename)
  • Phase 5B (chenyexuan): P2 batch (chunk_id schema unify + cleanup builder share + tenant org-prefix + utc_now unify + OTLP config cross-check)

Tasks

  • §K.9 Wave 5 spec section in design pack
  • Phase 1 chunk 1: aperag/indexing/llm.py relocate
  • Phase 1 cascade: retrieval/curation caller migration
  • Phase 1 cleanup: delete legacy graphindex/{storage,service,integration,engine,__init__}.py
  • Phase 2: T7 multimodal vision-LLM 3-item bundle
  • Phase 3: Layer 2 e2e fixture wiring
  • Phase 4: P1 production robustness 3-pack
  • Phase 5A/B: P2 polish batches

Test plan

  • ruff check ./aperag ./tests clean (per phase)
  • ruff format --check ./aperag ./tests clean (per phase)
  • Existing test suites pass after each phase (179+ tests)
  • Phase 1 grep-zero verify: aperag/indexing/* 不 cross-ref legacy graphindex/storage/*
  • Phase 2 embed_image API + is_multimodal=True real wire — chunk 4b vision gate self-disable verify
  • Phase 3 Layer 2 stub activated against real backends
  • Phase 4 transient-vs-intentional cleanup error semantic + parse_version short-circuit + reconciler re-enqueue verified
  • CI 4/4 (lint-and-unit + e2e-http-compose lanes)

🤖 Generated with Claude Code

Bryce and others added 13 commits April 27, 2026 19:01
…locate (legacy graphindex elimination first chunk)

Wave 5 task #26 first chunk per architect msg=10b5fae6 §K.9 spec
amendment + PM msg=af82797e dispatch (single Wave 5 PR with 5-phase
chunked-rotation commits, big-PR fast-landing per earayu2 msg=eced858d).

Two changes bundled (per PM msg=af82797e "Phase 1 first commit =
§K.9 spec paste + legacy graphindex elimination 实现起步"):

1. **§K.9 Wave 5 spec section** added to design pack:
   - 5-phase commit roadmap (16 acceptance items)
   - production-readiness 三类 layer per phase
   - architect direct ratify lane scope per phase
   - pre-check pattern lock per phase (per
     ``feedback_spec_lock_grep_verify_caller.md`` 双 pattern)
   - Layer 1 same-session continuation directive (per
     ``feedback_no_refresh_complete_all_tasks.md``)
   - Wave 5 acceptance summary (16 items + owner table)

2. **`aperag/indexing/llm.py` relocate** (§K.9.1 acceptance item 3):
   - new module owns canonical ``build_collection_llm_callable`` +
     ``render_extraction_prompt`` + ``ENTITY_RELATION_EXTRACTION``
     template + ``LLMCall`` type alias
   - legacy ``aperag/domains/knowledge_graph/graphindex/integration.py``
     and ``prompts.py`` re-export from new location during Wave 5
     deprecation window so legacy retrieval/curation callers keep
     working until Phase 1 close-out
   - ``aperag/indexing/graph_extractor.py`` updated to import from
     new ``aperag.indexing.llm`` module directly (instead of legacy
     graphindex bridge)
   - test ``test_graph_extractor.py`` monkey-patch sites updated to
     target ``aperag.indexing.llm`` module

Pre-check pattern 1 grep-verify caller cascade (per architect msg=
b26f64b2 chunk 4d Option C ruling, used as reference list for
Phase 1 migration scope):

- ``aperag/indexing/`` → 0 cross-references to legacy
  ``graphindex/storage/*`` (chunk 4d narrowed scope invariant
  maintained ✅)
- legacy ``graphindex/integration.py`` and ``prompts.py`` callers:
  ``retrieval/pipeline.py:85`` / ``knowledge_graph/service.py:69+`` /
  ``graph_curation/service.py:37`` / ``graph_curation/integration.py:21``
  / ``service/prompt_template_service.py:161`` —
  remaining migrations land in subsequent Phase 1 commits

Local gates green:
  - ``ruff check ./aperag ./tests`` clean
  - ``ruff format --check ./aperag ./tests`` clean (509 files)
  - ``pytest test_graph_extractor.py + test_full_indexing_pipeline.py
    + test_worker_factory.py + tests/unit_test/indexing/``: 179 passed
    / 2 skipped (Layer 2 stubs)

Production-readiness three-class layer (Phase 1 partial):
  - must-be-real: ``aperag.indexing.llm`` module is the canonical
    production home for the relocated LLM helpers (no import
    indirection through legacy package once retrieval/curation
    migration completes)
  - may-be-gated: legacy ``graphindex/{integration,prompts}.py``
    re-export shims kept active during Wave 5 deprecation window so
    cross-cutting refactor can land incrementally without breaking
    legacy retrieval/curation flow
  - fully-resolves: §K.9.1 Wave 5 acceptance item 3 (`aperag/indexing/llm.py`
    relocate); items 1 (legacy graphindex package elimination) and
    rest of Phase 1 cascade migration land in subsequent commits

Next Phase 1 commits: migrate ``retrieval/pipeline.py`` /
``knowledge_graph/service.py`` / ``graph_curation/*`` to §G.5 read
primitives; once all callers migrated → delete legacy
``graphindex/{storage, service, integration, engine, __init__}.py``
+ delete legacy tests; final commit verifies grep-zero invariant.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ON import to aperag/indexing/llm

Wave 5 task #26 chunk 2 per §K.9.1 acceptance item 3 cascade
(legacy graphindex caller migration):

`aperag/service/prompt_template_service.py:155-163` (the
`get_default_prompt(prompt_type="graph")` branch) was importing the
ENTITY_RELATION_EXTRACTION template from the legacy
`aperag.domains.knowledge_graph.graphindex.prompts` module. Per chunk 1
relocate (`11113acb`), the canonical home for the template is now
`aperag.indexing.llm`. This commit migrates the import site so the
caller no longer transitively touches the legacy module.

Behavior unchanged — the legacy `graphindex/prompts.py` shim already
re-exports the template from the new location, so this is a pure
import-path refactor that leaves the runtime payload identical.

Local gates:
  - `ruff check ./aperag ./tests` clean
  - `ruff format --check ./aperag ./tests` clean (509 files)

Phase 1 cascade progress:
- chunk 1 ✅ (`11113acb`): §K.9 spec + `aperag/indexing/llm.py` relocate
- **chunk 2** (this commit): `service/prompt_template_service.py` import relocate
- chunk 3+ pending: `retrieval/pipeline.py` / `knowledge_graph/service.py` /
  `graph_curation/*` migration (these need new design — legacy
  24-method `GraphStore` Protocol → new 10-method `LineageGraphStore`
  Protocol API surface is NOT a 1-to-1 replacement; per Bryce
  msg=30c7e994 scope reality check + architect msg=b052b1b4 cascade
  ratify lane lock)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ent/intentional split + parse_version short-circuit + reconciler stuck-parse re-enqueue

Wave 5 Phase 4 (PR-C / task #29) — closes 3 latent issues surfaced
through Wave 4 ratify trail:

* **P4-1 cleanup transient-vs-intentional split**
  (per architect msg=6aa8ca88 T2 ratify minor obs A):
  Pre-Wave-5 ``_resolve_cleanup_worker`` collapsed any factory
  exception into "drop the DB row". Transient infra errors
  (Qdrant blip, ES network glitch) lost the retry signal — DB row
  was deleted before next cycle could retry. Wave 5 P4 returns a
  new ``CleanupWorkerResolution(worker, transient)`` distinguishing
  the two: ``WorkerFactoryError`` is a by-design gate (drop the
  row to bound index growth), any other Exception is transient
  (skip DB row drop so next cycle retries when backend recovers).
  Counts surface ``transient_deferred`` so operators can track
  recovery rate.

* **P4-2 parse_version short-circuit**
  (per huangheng T3 chunk 2 obs B + Wave 5 backlog item):
  Pre-Wave-5 ``parse_document`` always re-runs DocParser even when
  the resulting artifact directory is byte-identical (parse_version
  is content-derived). Wave 5 P4 adds ``short_circuit_if_artifacts_exist=True``
  default — if all three derived artifacts (markdown.md /
  outline.json / chunks.jsonl) already exist under the canonical
  ``derived/parse_<version>/`` path, skip DocParser + writes
  entirely. Eliminates the ~30s OCR / Word rerun cost on rebuild
  of unchanged content. Tests can pass ``False`` to force re-parse.

* **P4-3 reconciler stuck-document parse re-enqueue**
  (per architect Wave 4 T3 chunk 2 obs A — production gap close):
  Pre-Wave-5 parse failures (DocParser raise / source missing)
  silently dropped the document_id in the parse worker; operator
  saw ``document.status == PENDING`` forever with no signal to
  re-trigger. Wave 5 P4 adds
  ``reconcile_stuck_documents_for_parse_reenqueue`` to the
  reconciler loop. Detects documents with
  ``Document.gmt_created < now - cooldown_seconds`` AND zero
  ``document_index`` rows AND ``Document.gmt_updated < now -
  cooldown_seconds`` (cooldown filter prevents 30-s tick storm),
  then pushes a fresh ``ParseDispatchPayload`` matching the upload
  handler's contract. ``gmt_updated`` bumps after each push so
  the cooldown predicate rate-limits re-enqueue.

Three-class tag (Wave 3 production-readiness invariant):
* must-be-real: cleanup transient/intentional split prevents
  silent retry-signal loss; parse_version short-circuit eliminates
  OCR rerun waste; stuck-parse reconciler closes the document.status
  PENDING-forever gap
* may-be-gated: ``short_circuit_if_artifacts_exist=False`` for
  tests pinning DocParser invocation count
* fully-resolves: 3 of Wave 5 P1 backlog items (per architect
  ratify trail msg=6aa8ca88 + huangheng obs trail)

Deltas:
* ``aperag/indexing/cleanup.py`` — ``CleanupWorkerResolution``
  dataclass + WorkerFactoryError-vs-Exception split in
  ``_resolve_cleanup_worker``; both ``cleanup_orphan_parse_versions``
  + ``cleanup_for_deleted_documents`` honor ``resolution.transient``
  to skip DB row drop on transient infra failures; collection
  cascade aggregates ``transient_deferred``.
* ``aperag/indexing/parser.py`` — ``_all_artifacts_present``
  predicate + ``short_circuit_if_artifacts_exist`` parameter +
  early-return ParseResult when all 3 artifacts present.
* ``aperag/indexing/reconciler.py`` — ``STUCK_PARSE_COOLDOWN_SECONDS``
  + ``reconcile_stuck_documents_for_parse_reenqueue`` async scan +
  ``_select_stuck_documents_for_reenqueue`` SQL query +
  ``_build_parse_payload_for_document`` payload reconstruction +
  ``_resolve_collection_parser_config`` / ``_resolve_collection_modalities``
  (mirror ``document_service`` shape) + ``_mark_stuck_documents_reenqueued``
  bumps gmt_updated; ``run_reconcile_loop`` calls the new scan.
* ``aperag/indexing/__init__.py`` — re-export new symbols.
* ``tests/integration/test_p4_robustness_3pack.py`` (new) — 13
  tests across 3 layers (cleanup transient/intentional split / parse
  short-circuit / reconciler re-enqueue with cooldown + payload
  shape + skip-when-indexed + skip-when-no-object-path).
* ``tests/unit_test/indexing/test_t3_1_dispatcher_path_c.py`` —
  added ``transient_deferred: 0`` to expected empty-counts dict.

Local gates:
* ``pytest tests/unit_test/indexing/ tests/integration/`` — 232
  passed / 48 skipped (incl. 13 new ``test_p4_robustness_3pack``).
* ``ruff check aperag/ tests/integration/test_p4_robustness_3pack.py``
  — clean.
* ``ruff format`` — applied.

Branch: local ``chenyexuan/celery-wave5-p4`` based on main
``19d3d70`` (Wave 4 PR #1731 squash); will rebase onto
``bryce/celery-wave5`` once Bryce opens the Wave 5 draft PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…e 6 backlog

Per architect Option (a) ruling 2026-04-27 (post-cascade scope check
msg=30c7e994 + ratify msg=8780c937):

* §K.9 Phase 1 narrowed scope: relocate-only (chunks 1+2 already
  shipped via `11113acb` + `e0707165`). Original Phase 1 scope (delete
  legacy `graphindex` package + migrate retrieval/curation callers)
  collided with unresolved design gap — legacy `GraphIndexService.
  query_context()` 24-method LightRAG-style API has no equivalent on
  the new 10-method `LineageGraphStore` Protocol; building that query
  layer is 1-2 week design + implementation effort, not in-scope for
  Wave 5 single-PR fast-landing style.

* §K.10 Wave 6 backlog added: full legacy graphindex elimination +
  retrieval/curation migration + new LightRAG-style query layer
  design moves to Wave 6 separate PR with its own architect ratify
  lane (per chunk 4d Option C ruling msg=b26f64b2 + Wave 4 §K.8
  precedent).

* Wave 5 acceptance summary updated: 15/16 items in Wave 5 PR
  (item 1 legacy graphindex elimination → Wave 6).

* Phase 1 close-out signal: chunk 1 (`11113acb`) +chunk 2 (`e0707165`)
  ship llm.py relocate + import re-route to new canonical home;
  legacy `graphindex/{integration,prompts}.py` keep deprecation-shim
  re-exports during the Wave 6 deprecation window so legacy
  retrieval/curation callers do not break mid-Wave-5.

This commit closes Phase 1 narrowed scope and unblocks Phase 2 (T7
multimodal vision-LLM 3-item bundle) per architect ruling.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wave 5 Phase 3 (task #28) — replaces the pre-Wave-5 ``pytest.skip``
stubs in ``test_full_indexing_pipeline.py`` Layer 2 with functional
test bodies that the e2e-http-compose CI lane can execute against
the live backend stack.

Per architect msg=fdd53586 + chunk 4d+4e ratify msg=c279a0ff —
Layer 2 contract was declared but skip-stub-only; this phase wires
the actual implementation so the canonical Phase 1 invariant runs
end-to-end whenever the operator opts in via ``RUN_E2E_PHASE1_SMOKE=1``.

Two test bodies wired:

* ``test_phase1_full_pipeline_vector_fulltext_summary_active_graph_vision_failed``
  — the canonical Phase 1 smoke (architect msg=da3012a4): real
  ``ProductionWorkerFactory`` + parse_document + dispatch_indexing
  + worker pool drive until every modality finalises. Asserts
  vector + fulltext + summary reach ACTIVE; graph + vision finalise
  FAILED with a gate marker (``Wave 4 wiring`` chunk 4b /
  ``completion model`` post-T1 self-disable / ``multimodal`` vision
  gate). The gate-marker OR tolerates the same-Wave T1 self-disable
  surface so the test stays green across the chunk 4b → T1 closure.

* ``test_phase1_multi_keyword_fulltext_search_returns_hits`` —
  sweep D verification (architect msg=fdd53586): index a real
  document via the worker pool, then issue a 3-keyword query
  through ``_fulltext_search`` and assert ≥1 hit. The retrieval-
  side ``minimum_should_match`` arithmetic over N×content +
  N×title should-clauses (huangheng msg=fb64468c flag) is the
  latent issue this exercises end-to-end.

New fixture machinery:

* ``_phase1_e2e_skip_reason()`` — central skip-reason resolver that
  documents the activation contract: ``RUN_E2E_PHASE1_SMOKE=1`` +
  ``PHASE1_E2E_COLLECTION_ID`` (pointing at a Collection seeded by
  the e2e-http-compose bootstrap with a real model provider
  configured) + backend env vars (``DATABASE_URL`` / ``ES_HOST`` /
  Qdrant + Redis env vars). Both Layer 2 tests share the same
  skip predicate so the lane runner only needs one env-var setup.

* ``_resolve_phase1_e2e_engine()`` — opens a real Postgres engine
  using ``settings.database_url``; skips with a clear reason if
  unset.

* ``_run_phase1_workers_until_quiet()`` — bounded async loop that
  drives ``ProductionWorkerFactory`` directly (mirroring the
  ``run_*_worker`` lifespan path) until every modality row reaches
  a terminal status. Bounded by ``timeout_seconds`` so a hung
  modality fails the test loud rather than blocking forever.

Three-class tag (Wave 3 production-readiness invariant):
* must-be-real: live ProductionWorkerFactory + real backends + real
  ES query
* may-be-gated: skips when env vars missing (local-dev never
  requires the full stack)
* fully-resolves: 2 of Wave 5 P1 backlog items (Layer 2 stubs
  activated for canonical full pipeline + sweep D multi-keyword
  smoke)

Cleanup roundtrip (delete document → cleanup loop → backend
artefacts gone) is left as a follow-up sub-test to land alongside
the e2e-http-compose lane document-delete API access scaffolding.

Local gates:
* ``pytest tests/integration/test_full_indexing_pipeline.py`` — 4
  passed / 2 skipped (Layer 2 skips cleanly with the new
  ``_phase1_e2e_skip_reason()`` until env vars set in CI lane).
* ``pytest tests/unit_test/indexing/ tests/integration/`` — 232
  passed / 48 skipped — no regression vs P4 baseline.
* ``ruff check aperag/ tests/integration/test_full_indexing_pipeline.py``
  — clean.
* ``ruff format`` — applied.

Branch: ``chenyexuan/celery-wave5-p4`` rebased on
``bryce/celery-wave5`` HEAD ``99af7965`` (Phase 1 chunk 3 spec
amend) → push to ``bryce/celery-wave5`` so Wave 5 single-PR pile
continues.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ace (T7 item 1)

Wave 5 task #27 chunk 1 per §G.2.5.1 amendment item 1:
``EmbeddingService.embed_image(image_bytes, alt_text)`` API surface
is the canonical multimodal embedding entry point that replaces the
Wave 3 placeholder ``embed_query(f"{image_id}|{alt_text}")`` string-
concat (Wave 3 lesson #10 broken pattern).

Behavior:

* When ``self.multimodal=True`` (operator wired a multimodal embedder
  on the collection's spec), encode image bytes as base64 data URL
  + forward to LiteLLM ``embedding(input=[{"image_url":{"url":...}}, ...])``
  shape that multimodal-capable providers (Voyage / Jina v3 / OpenAI
  multimodal / etc.) accept natively.
* When ``self.multimodal=False``, raise ``EmbeddingError`` with clear
  operator-facing diagnostic — chunk 4b vision gate already prevents
  this state but runtime check is defense-in-depth (Wave 3 lesson #10
  ship-incomplete-but-don't-silently-lie).
* ``alt_text`` paired into LiteLLM input as a textual hint for
  embedders that accept multi-part inputs; image-only embedders
  silently drop the text element.

The ``aembed_image`` async wrapper mirrors the ``aembed_query``
pattern (``asyncio.to_thread`` since the LiteLLM call is sync).

Wave 5 P2 chunk-1 scope (this commit):

* declares the API surface so Phase 2 chunks 2 (parser image
  extraction) + 3 (provider v3 capability flag UI) and the chunk 4b
  vision gate self-disable path have a concrete contract to wire to
* uses imghdr-based MIME detection + LiteLLM's documented multimodal
  input shape; provider-specific input format variations defer to
  Wave 6 (per §K.10 Wave 6 backlog cross-cutting refactor)

Local gates green:
  - ``ruff check ./aperag ./tests`` clean
  - ``ruff format --check ./aperag ./tests`` clean (510 files)
  - ``pytest`` 179 passed / 2 skipped — no regressions on existing
    EmbeddingService callers (text-only path unchanged)

Production-readiness three-class layer:
  - must-be-real: real LiteLLM multimodal embedding call against the
    operator-configured provider when ``multimodal=True``
  - may-be-gated: provider-specific input-format adjustments (Voyage
    vs Jina vs OpenAI multimodal differences) deferred to Wave 6
  - fully-resolves: §G.2.5.1 item 1 (``embed_image`` API surface) +
    §K.9 Phase 2 partial — items 2+3 (parser image extraction +
    provider v3 capability flag UI) follow in subsequent commits

Next Phase 2 chunks:
- chunk 2: parser image extraction → ``derived/parse_<v>/vision/images/<image_id>.<ext>``
- chunk 3: provider v3 router multimodal capability flag UI exposure
- chunk 4: ``worker_factory._build_vision_worker`` ``_embed`` callsite
  rewrite to use ``embed_image`` + chunk 4b vision gate self-disable
  verify (Phase 1 Layer 1 test rename)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ractor caps + timeout

Per huangheng T1 obs A msg=6b349693: surface the LightRAG-style
extractor's per-chunk caps and LLM timeout as collection-level
overrides (`KnowledgeGraphConfig.{max_entities_per_chunk,
max_relations_per_chunk, per_chunk_timeout_seconds}`) so deployments
tuning a slow LLM provider or extracting from entity-dense documents
can lift the defaults without patching `aperag/indexing/graph_extractor.py`
constants.

* `aperag/schema/common.py`: 3 new optional fields on KnowledgeGraphConfig.
* `aperag/indexing/graph_extractor.py`: `_resolve_int_kg_config` /
  `_resolve_float_kg_config` helpers (mirror `_resolve_entity_types`
  pydantic-attr / Mapping / JSON-string tolerance pattern + reject
  non-positive / non-numeric values with warning + default fallback);
  `_extract_one_chunk` now takes `timeout_seconds` kwarg and
  `asyncio.wait_for` wires it instead of the global constant.
* `tests/integration/test_graph_extractor.py`: 6 new unit tests pinning
  override-wins / fallback-on-missing / fallback-on-non-positive /
  int-coerced-to-float / fallback-on-garbage; fixed pre-existing
  `_extract_one_chunk` call to pass `timeout_seconds=60.0`.

Defaults unchanged (32 / 32 / 60.0); collections that don't set the
new fields keep current behavior.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… forward-compat + OTLP config cross-check + cleanup builder share

Wave 5 Phase 5B (task #31) — 4 polish items from the Wave 4 ratify
trail accumulated obs (chunk_id schema unify deferred to a follow-up
once parser canonical schema lock lands):

* **P5B-A utc_now → CURRENT_TIMESTAMP unify** (per huangheng chunk 4a
  obs A): ``_LineageEntityRow`` + ``_LineageRelationRow``
  ``gmt_created`` / ``gmt_updated`` columns now use
  ``server_default=text("CURRENT_TIMESTAMP")`` mirroring the alembic
  migration declaration. Pre-Wave-5 ORM used ``default=utc_now``
  Python-side; alembic check passed because Postgres treats both as
  semantic-equivalent but the per-mirror discipline is stronger when
  both layers speak the same dialect (schema-touching follow-ups
  cannot drift undetected).

* **P5B-B tenant_scope_key org-prefix forward-compat** (per T3 chunk
  2 obs C + §H.2 forward-compat lock): new
  ``aperag.indexing.parse_orchestrator.resolve_tenant_scope_key``
  helper centralises the prefix construction. Pre-Wave-5 the prefix
  was hard-coded as ``f"user:{document.user}"`` at every callsite
  (upload handler, reconciler stuck-doc re-enqueue); the helper
  unifies the construction so a future Wave 6/7 organisation-tenant
  rollout flips one place. The org branch stays inert for Wave 5
  (Document/Collection schemas don't carry org_id yet) — the helper
  just makes the seam explicit.

* **P5B-C OTLP config cross-check** (per huangheng T6 chunk 1 obs):
  lifespan startup logs a clear warning when
  ``INDEXING_METRICS_EMITTER=otlp`` but
  ``APERAG_OBSERVABILITY_MODE`` is not set to ``otlp`` /
  ``collector``. Pre-Wave-5 this combination silently produced an
  ``OTLPMetricsEmitter`` whose ``MeterProvider`` was never installed
  by ``aperag.observability.metrics.init_metrics_provider`` —
  samples no-op'd silently, defeating the explicit
  ``INDEXING_METRICS_EMITTER=otlp`` opt-in.

* **P5B-D cleanup builder share helper** (per huangheng T2 obs B
  drift risk): new ``_build_collection_qdrant_connector(collection,
  *, allow_vector_size_fallback)`` shared helper used by
  ``_build_vector_worker`` / ``_build_summary_worker`` (dispatch,
  ``allow_vector_size_fallback=False``) AND
  ``_build_qdrant_cleanup_backend`` (cleanup,
  ``allow_vector_size_fallback=True``). Pre-Wave-5 dispatch and
  cleanup paths duplicated the embedder + connector wiring; a
  future Qdrant adapter signature change had to be applied twice.
  ``_build_vision_worker`` keeps its pre-helper shape — its
  ``is_multimodal()`` gate must fire BEFORE any network call to
  preserve the Wave 4 chunk 4b "fail fast on non-multimodal embedder"
  invariant the existing test pins.

Three-class tag (Wave 3 production-readiness invariant):
* must-be-real: 4 polish items each tighten a real production seam
  (alembic mirror discipline / org-prefix forward-compat / OTLP
  config consistency / cleanup-vs-dispatch drift surface)
* may-be-gated: org-prefix branch stays inert until org_id columns
  land (declared in helper docstring)
* fully-resolves: 4 of the 5 P5B chenyexuan-batch backlog items.
  ``chunk_id`` schema unify is left to a follow-up — parser
  canonical schema lock first needs the §C.x amend; touching the
  ``or chunk.get("id")`` fallback now without that lock would
  introduce drift in the opposite direction.

Deltas:
* ``aperag/indexing/graph_storage/postgres.py`` —
  ``_LineageEntityRow`` + ``_LineageRelationRow`` switched to
  ``server_default=CURRENT_TIMESTAMP``; ``utc_now`` import dropped.
* ``aperag/indexing/parse_orchestrator.py`` —
  ``resolve_tenant_scope_key(*, document, collection)`` helper +
  re-export.
* ``aperag/domains/knowledge_base/service/document_service.py`` —
  ``_create_or_update_document_indexes`` calls
  ``resolve_tenant_scope_key`` in place of the inline
  ``f"user:{document.user}"``.
* ``aperag/indexing/reconciler.py`` —
  ``_build_parse_payload_for_document`` calls the same helper for
  the stuck-document re-enqueue producer (consistency with upload
  handler).
* ``aperag/app.py`` — lifespan logs the OTLP-vs-observability-mode
  cross-check warning before constructing
  ``OTLPMetricsEmitter``.
* ``aperag/indexing/worker_factory.py`` —
  ``_build_collection_qdrant_connector`` shared helper +
  ``_build_vector_worker`` / ``_build_summary_worker`` /
  ``_build_qdrant_cleanup_backend`` reuse it; vision keeps its
  multimodal-gate-first shape.

Local gates:
* ``pytest tests/unit_test/indexing/ tests/integration/`` — 232
  passed / 48 skipped (no regression).
* ``ruff check aperag/ tests/integration/`` — clean.
* ``ruff format`` — applied.

Branch: ``chenyexuan/celery-wave5-p4`` rebased on
``bryce/celery-wave5`` HEAD ``4f0e9b0`` (P2 chunk 1 embed_image
API surface) → push to ``bryce/celery-wave5`` for Wave 5 single-PR
pile.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per §K.9.1 P5A item 14 + design pack §K.9 Phase 5A item 4:
deployments that share a Neo4j instance with user-owned graphs would
collide on the generic `LineageEntity` / `LineageRelation` labels.
Bump them to `aperag_LineageEntity` / `aperag_LineageRelation` and
align constraint names (`aperag_lineage_entity_collection_name_unique`
/ `aperag_lineage_relation_collection_triple_unique`).

Hard-cut second round (no data migration): Wave 4 graph indexing
defaulted `enable_knowledge_graph=False`, so no production deployment
has data on the legacy unprefixed labels. Postgres / Nebula backends
are unaffected (table names already namespaced via `aperag_lineage_*`
schema in alembic migration `e7a3b9c2d1f6`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…h explicit reasoning

Per Wave 5 P5A scope close-out 2026-04-27:

* Item 2 (chunk 4b _no_op_extractor identity check → attribute marker)
  is N/A: the placeholder was deleted in Wave 4 T1 (`19d3d70f`) when
  `build_collection_graph_extractor` replaced `_no_op_extractor`. There
  is no surviving identity-check site to harden — moot in current code.

* Item 3 (W5-perf-graph-lineage parallel-list O(N) cross-backend) is
  deferred to Wave 6: Postgres needs JSONB → text[] migration, Nebula
  needs tag re-model, plus alembic column-shape change. >10k
  docs/entity is not a Wave 5 acceptance criterion; trigger when
  observed in real deployments.

* Item 5 (Cypher type keyword rename `n.type` → `entity_type` /
  `relation_type`) is deferred to Wave 6: cross-backend rename
  (Postgres column + alembic; Cypher property rename; Nebula tag-prop
  rename) plus §D.3 Protocol-surface change (`EntityRecord.type` →
  `EntityRecord.entity_type` etc.). Cypher `TYPE()` is technically only
  for relationships so current code is unambiguous — forward-compat
  hygiene rather than correctness fix.

§K.9 Phase 5A acceptance lines amended to reflect ship status (items
1+4 shipped; 2 N/A; 3+5 → Wave 6). §K.10 Wave 6 backlog extended with
items 6+7 covering the deferred cross-backend rewrites.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…se_<v>/vision/{images,source.jsonl}

Per §G.2.5.1 spec amend item 2: when DocParser produces `AssetBinPart`
payloads (PDF page images, single-image-input passthrough, data-URI
extracted images), the parser writes each image blob to
`derived/parse_<v>/vision/images/<image_id>.<ext>` and lands a
`vision/source.jsonl` descriptor enumerating them. The vision worker
(chunk 4) consumes the descriptor instead of the T1 simulator's
synthetic `images.json` companion.

`aperag/indexing/parser.py`:
* `_docparser_extract_markdown` now returns
  `tuple[str, list[_VisionImageAsset]]` — markdown body unchanged plus
  the extracted image assets list. Non-image `AssetBinPart`s (audio /
  PDF data) drop. Duplicate `asset_id`s deduplicated to keep
  Qdrant-point-id stable.
* `_VisionImageAsset` carrier holds (`image_id`, `data`, `mime_type`,
  `alt_text`, `page_idx`, `bbox`).
* `_vision_image_extension(mime_type)` maps known image MIMEs to a
  filename extension; falls back to `.bin` for unknown types since
  vision worker uses `imghdr` on bytes at embed time.
* `_write_vision_assets` persists each blob via `write_atomic` and
  writes the JSONL descriptor.
* `ParseResult.vision_source_path` and `vision_image_count` are new
  optional fields (defaults `""` / `0`) so pre-Wave-5 callers see no
  behaviour change. Image-only inputs (no markdown emitted) still land
  their assets so vision modality has bytes to embed.

`tests/unit_test/indexing/test_parser_image_extraction.py`: 9 unit tests
covering MIME → extension mapping, descriptor schema, simulator-path
no-op, image-only input asset persistence, unknown-MIME `.bin`
fallback, and the new `ParseResult` fields. Tests monkeypatch
`_docparser_extract_markdown` to avoid pulling MinerU / MarkItDown
into the unit-test path.

Production-readiness 三类:
- must-be-real: real `AssetBinPart` extraction wired through DocParser
- may-be-gated: descriptor + image artefacts only land when DocParser
  surfaces `AssetBinPart`s (image-less docs see no vision/ writes)
- fully-resolves: §G.2.5.1 spec item 2 (parser image extraction) +
  Wave 5 P2 chunk 2 acceptance

The vision worker callsite rewrite (chunk 4) and provider v3 multimodal
flag UI (chunk 3) remain. Defaults preserve existing behaviour for
text-only callers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…apability flag

Per §G.2.5.1 spec amend item 3: surface a typed capability flag for
embedding models that accept image bytes (CLIP / Voyage Multimodal /
Jina v3 / OpenAI multimodal embeddings) so operators can register
them via the v3 model platform UI. The flag drives the chunk 4b
vision gate's `EmbeddingService.is_multimodal()` runtime check —
flip it on the collection's embedder spec model and the gate
self-disables, enabling vision modality.

Distinct from existing `supports_vision` (chat/completion models
that accept image input) — `supports_multimodal_embedding` describes
embedding models that produce vectors from images.

Changes:
* `aperag/domains/model_platform/schemas.py`: new optional
  `supports_multimodal_embedding: bool = False` on `Model` /
  `ModelCreate` / `ModelUpdate` (default False, no behaviour change
  for existing rows).
* `aperag/domains/model_platform/db/models.py`: new
  `supports_multimodal_embedding` Boolean column with default False.
* `aperag/migration/versions/...f1c8d2a5b6e3...`: alembic migration
  adding the column with `server_default=false`. Forward-only
  cutover safe because pre-Wave-5 rows default to False (matches
  prior implicit behaviour).
* `aperag/domains/model_platform/service/model_service.py`:
  `_model_to_schema` maps the new column.
* `aperag/db/repositories/llm_provider.py`: `create_model` accepts
  the new field so the v3 `POST /models` flow persists it.
* `aperag/llm/embed/base_embedding.py`: `get_embedding_service`
  prefers the typed `supports_multimodal_embedding` column; falls
  back to the legacy `runner_config["multimodal"]` JSON entry so
  pre-Wave-5 operators who edited the JSON keep working without
  re-saving the row.

Tests: 3 new unit tests in `test_model_platform_v3_contract.py`
covering default-False behaviour, opt-in True path, and v3 OpenAPI
schema exposure on Model / ModelCreate / ModelUpdate.

Production-readiness 三类:
- must-be-real: real DB column + alembic migration + v3 API surface
- may-be-gated: legacy `runner_config["multimodal"]` fallback for
  rows created before the typed column landed
- fully-resolves: §G.2.5.1 spec item 3 (provider v3 multimodal
  capability flag UI) + Wave 5 P2 chunk 3 acceptance

Phase 2 chunk 4 (callsite rewrite + chunk 4b gate self-disable
verify + Layer 1/2 test rename) remains as the final Wave 5 blocker.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…te self-disable verify

Per §G.2.5.1 spec amend final piece: rewire `_build_vision_worker._embed`
to call `EmbeddingService.embed_image(image_bytes, alt_text)` (chunk 1)
with the actual image bytes the parser persisted (chunk 2 wrote
`derived/parse_<v>/vision/images/<image_id>.<ext>` + a JSONL descriptor
at `vision/source.jsonl`). The chunk 4b vision gate self-disables when
the operator flips `Model.supports_multimodal_embedding=True` (chunk 3).

`aperag/indexing/worker_factory.py`:
* `_embed(image_id, alt_text, image_bytes=None)` — when image_bytes
  is provided, route to `embedding_service.embed_image(image_bytes,
  alt_text)`. None falls back to the legacy text-concat path so the
  T1 simulator + tests that hand the worker synthetic JSON keep
  working.
* Gate-raise message reframed: drops "Wave 4 wiring" phrasing (now
  Wave 5 wiring is land), names the typed
  `Model.supports_multimodal_embedding` flag so an operator can fix
  the config directly.

`aperag/indexing/vision.py`:
* `VisionModality.derive` accepts both source-path formats: the
  legacy single-JSON-array shape (T1 simulator / pre-Wave-5 tests)
  AND the new JSONL-with-image-path shape (parser chunk 2 output).
  Format detection is by first non-whitespace byte (`[` → JSON array,
  else → JSONL).
* `_load_image_bytes(record)` reads the descriptor's `image_path`
  through the object store; missing blob logs a warning and returns
  None (embedder still runs on the alt-text/id placeholder digest)
  so a partial parser write doesn't block the whole derive cycle.
* `_placeholder_embedding(..., image_bytes=None)` mirrors the new
  embedder signature; placeholder ignores bytes.
* Embedder Protocol is widened: `(image_id, alt_text, image_bytes=None)`.

`tests/integration/test_full_indexing_pipeline.py`:
* Renamed Layer 1 `test_phase1_vision_modality_raises_wave4_wiring_gate`
  → `..._gate_raises_when_embedder_not_multimodal`. Asserts the
  reframed message names `multimodal embedding model` +
  `supports_multimodal_embedding` flag.
* New positive-path Layer 1
  `test_phase1_vision_modality_gate_self_disables_when_embedder_multimodal`:
  with `is_multimodal()=True`, the factory builds a vision worker
  without raising — pins the chunk 4 gate-self-disable contract.
* Layer 2 e2e assertion: vision modality may be ACTIVE (when CI
  fixture has multimodal embedder configured) OR FAILED with a gate
  marker including `supports_multimodal_embedding`. OR-on-marker
  tolerance kept for transition state.

`tests/unit_test/indexing/test_t1_4_summary_vision.py`: 3 new tests
covering JSONL descriptor with image_path (bytes loaded + forwarded),
missing-blob graceful fallback, and legacy JSON-array backward compat.

Production-readiness 三类:
- must-be-real: real `embed_image` callsite + real bytes load from
  parser's descriptor
- may-be-gated: legacy text-concat fallback + simulator JSON format
  preserved for tests
- fully-resolves: §G.2.5.1 spec items 1+2+3 all wired end-to-end +
  chunk 4b vision gate self-disable verify (Wave 5 P2 closure)

Wave 5 P2 (T7 multimodal vision-LLM 3-item bundle) closed. The chunk
4b vision gate self-disables when an operator configures a multimodal
embedder; default-off behaviour preserved for text-only collections.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@earayu earayu force-pushed the bryce/celery-wave5 branch from fdad09e to 8e7c6ec Compare April 27, 2026 11:01
@earayu earayu marked this pull request as ready for review April 27, 2026 11:08
@earayu earayu merged commit 7c9e4c3 into main Apr 27, 2026
7 checks passed
@earayu earayu deleted the bryce/celery-wave5 branch April 27, 2026 11:08
earayu added a commit that referenced this pull request Apr 27, 2026
…cation cache (#1735)

Wave 5 P2 chunk 4 (#1733) landed the canonical multimodal embedding
API surface (`EmbeddingService.embed_image(image_bytes, alt_text)`)
but left the call un-cached — every vision modality embed now hits
the LiteLLM provider, even for the same image. This wires it into
the canonical `aperag.cache.NAMESPACE_EMBEDDING` infra (PR #1734)
mirroring the existing text `_embed_batch` pattern.

Cache key shape (per `aperag/cache/README.md` no-raw-bytes policy):

    {
      "kind": "image",
      "provider": ...,
      "model": ...,
      "api_base": ...,
      "api_key_hash": sha256(api_key),
      "file_hash": sha256(image_bytes),
      "alt_text": ...,
      "multimodal": True,
    }

Image bytes are identified by their sha256 hex digest so the Redis
key stays bounded; alt_text is part of the key because providers
that accept paired text+image inputs return a different vector when
the textual hint changes (alt_text="" collapses to one key for
image-only callers).

Tests
-----
New `tests/unit_test/llm/test_embed_image_cache.py` (7 tests):

* identical (bytes, alt_text) → second call hits cache (no upstream)
* same bytes + different alt_text → distinct keys, both compute
* different bytes + same alt_text → distinct keys, both compute
* key shape uses sha256 file_hash, raw bytes never appear in key
* `caching=False` bypasses cache (always upstream)
* `multimodal=False` raises EmbeddingError (defense-in-depth)
* empty image_bytes raises EmptyTextError

Full unit suite: 1022 passed, 29 skipped, ruff + format clean.

Out of scope (per task #37 boundary)
------------------------------------
Provider-specific multimodal embedder format variations (Voyage /
Jina v3 / OpenAI multimodal SDK input shapes) stay on task #39 per
PM dispatch + simple-stable directive (`feedback_simple_stable_zero
_maintenance.md`).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant