feat(celery Wave 5 P2 chunk 4): vision callsite rewrite + chunk 4b gate self-disable verify

earayu · claude · earayu · commit fdad09e98e47 · 2026-04-27T19:00:16.000+08:00
Per §G.2.5.1 spec amend final piece: rewire `_build_vision_worker._embed`
to call `EmbeddingService.embed_image(image_bytes, alt_text)` (chunk 1)
with the actual image bytes the parser persisted (chunk 2 wrote
`derived/parse_&lt;v&gt;/vision/images/&lt;image_id&gt;.&lt;ext&gt;` + a JSONL descriptor
at `vision/source.jsonl`). The chunk 4b vision gate self-disables when
the operator flips `Model.supports_multimodal_embedding=True` (chunk 3).

`aperag/indexing/worker_factory.py`:
* `_embed(image_id, alt_text, image_bytes=None)` — when image_bytes
  is provided, route to `embedding_service.embed_image(image_bytes,
  alt_text)`. None falls back to the legacy text-concat path so the
  T1 simulator + tests that hand the worker synthetic JSON keep
  working.
* Gate-raise message reframed: drops "Wave 4 wiring" phrasing (now
  Wave 5 wiring is land), names the typed
  `Model.supports_multimodal_embedding` flag so an operator can fix
  the config directly.

`aperag/indexing/vision.py`:
* `VisionModality.derive` accepts both source-path formats: the
  legacy single-JSON-array shape (T1 simulator / pre-Wave-5 tests)
  AND the new JSONL-with-image-path shape (parser chunk 2 output).
  Format detection is by first non-whitespace byte (`[` → JSON array,
  else → JSONL).
* `_load_image_bytes(record)` reads the descriptor's `image_path`
  through the object store; missing blob logs a warning and returns
  None (embedder still runs on the alt-text/id placeholder digest)
  so a partial parser write doesn't block the whole derive cycle.
* `_placeholder_embedding(..., image_bytes=None)` mirrors the new
  embedder signature; placeholder ignores bytes.
* Embedder Protocol is widened: `(image_id, alt_text, image_bytes=None)`.

`tests/integration/test_full_indexing_pipeline.py`:
* Renamed Layer 1 `test_phase1_vision_modality_raises_wave4_wiring_gate`
  → `..._gate_raises_when_embedder_not_multimodal`. Asserts the
  reframed message names `multimodal embedding model` +
  `supports_multimodal_embedding` flag.
* New positive-path Layer 1
  `test_phase1_vision_modality_gate_self_disables_when_embedder_multimodal`:
  with `is_multimodal()=True`, the factory builds a vision worker
  without raising — pins the chunk 4 gate-self-disable contract.
* Layer 2 e2e assertion: vision modality may be ACTIVE (when CI
  fixture has multimodal embedder configured) OR FAILED with a gate
  marker including `supports_multimodal_embedding`. OR-on-marker
  tolerance kept for transition state.

`tests/unit_test/indexing/test_t1_4_summary_vision.py`: 3 new tests
covering JSONL descriptor with image_path (bytes loaded + forwarded),
missing-blob graceful fallback, and legacy JSON-array backward compat.

Production-readiness 三类:
- must-be-real: real `embed_image` callsite + real bytes load from
  parser's descriptor
- may-be-gated: legacy text-concat fallback + simulator JSON format
  preserved for tests
- fully-resolves: §G.2.5.1 spec items 1+2+3 all wired end-to-end +
  chunk 4b vision gate self-disable verify (Wave 5 P2 closure)

Wave 5 P2 (T7 multimodal vision-LLM 3-item bundle) closed. The chunk
4b vision gate self-disables when an operator configures a multimodal
embedder; default-off behaviour preserved for text-only collections.

Co-Authored-By: Claude Opus 4.7 &lt;noreply@anthropic.com&gt;
diff --git a/aperag/indexing/vision.py b/aperag/indexing/vision.py
@@ -112,7 +112,21 @@ def points_for_document(self, document_id: str, parse_version: str | None = None
         return sorted(out, key=lambda r: r["point_id"])
 
 
-def _placeholder_embedding(image_id: str, alt_text: str, dim: int = SIMULATOR_VISION_EMBEDDING_DIM) -> list[float]:
+def _placeholder_embedding(
+    image_id: str,
+    alt_text: str,
+    image_bytes: bytes | None = None,
+    dim: int = SIMULATOR_VISION_EMBEDDING_DIM,
+) -> list[float]:
+    """Deterministic synthetic embedding for tests + the simulator.
+
+    Wave 5 P2 chunk 4: the optional ``image_bytes`` parameter mirrors
+    the production embedder signature (real multimodal embedders consume
+    bytes). The placeholder ignores it — the digest is over
+    ``image_id|alt_text`` so re-running the simulator produces the same
+    vector for the same record.
+    """
+    del image_bytes  # placeholder ignores actual bytes
     digest = hashlib.sha256(f"{image_id}|{alt_text}".encode("utf-8")).digest()
     repeat = (dim + len(digest) - 1) // len(digest)
     expanded = (digest * repeat)[:dim]
@@ -144,11 +158,21 @@ async def derive(
     ) -> DeriveResult:
         """Extract images, run vision-LLM, persist manifest atomically.
 
-        T1 simulator contract: ``source_path`` is a JSON file in the
-        object store containing ``[{"image_id": ..., "alt_text": ...,
-        "page_idx": int|None, "bbox": [...]|None}, ...]``. The real
-        T2.x pipeline replaces this with PDF extraction; the
-        ``manifest.jsonl`` schema is the contract that must not change.
+        Two source-path formats are supported (the production parser
+        wrote the second one starting Wave 5 P2 chunk 2):
+
+        * **Legacy simulator** (single JSON array): a JSON list at
+          ``source_path`` holding ``[{"image_id": ..., "alt_text": ...,
+          "page_idx": int|None, "bbox": [...]|None}, ...]``. No image
+          bytes — the embedder gets ``image_bytes=None`` and falls back
+          to the alt-text/id placeholder digest.
+        * **Wave 5 production descriptor** (JSONL with ``image_path``):
+          one record per line at ``derived/parse_<v>/vision/source.jsonl``
+          with ``{"image_id", "image_path", "mime_type", "alt_text",
+          "page_idx", "bbox"}``. The worker loads each image's bytes
+          from ``image_path`` and hands them to the embedder so a real
+          multimodal embedding model can produce a real visual vector
+          (Wave 5 P2 chunk 4 callsite rewrite).
         """
         body = read_or_none(self._store, source_path)
         if body is None:
@@ -158,12 +182,7 @@ async def derive(
             )
             return DeriveResult(derived_artifact_path="")
 
-        try:
-            image_records = json.loads(body)
-        except json.JSONDecodeError as exc:
-            raise ValueError(f"vision.derive expected JSON list at {source_path}, got non-JSON: {exc}") from exc
-        if not isinstance(image_records, list):
-            raise ValueError(f"vision.derive expected JSON list of image records, got {type(image_records).__name__}")
+        image_records = self._parse_descriptor(body, source_path=source_path)
 
         parts = source_path.split("/")
         try:
@@ -179,7 +198,8 @@ async def derive(
         for record in image_records:
             image_id = record["image_id"]
             alt_text = record.get("alt_text", "")
-            embedding = self._embedder(image_id, alt_text)
+            image_bytes = self._load_image_bytes(record)
+            embedding = self._embedder(image_id, alt_text, image_bytes=image_bytes)
             entry = {
                 "image_id": image_id,
                 "alt_text": alt_text,
@@ -251,6 +271,60 @@ async def sync(
                 payload=payload,
             )
 
+    def _parse_descriptor(self, body: bytes, *, source_path: str) -> list[dict]:
+        """Decode the source-image descriptor file.
+
+        Tolerates both the legacy single-JSON-array shape and the
+        Wave 5 P2 chunk 2 JSONL-with-image-path shape (parser writes
+        the latter). The first byte of body picks the format: ``[``
+        means a JSON array, anything else is JSONL.
+        """
+        text = body.decode("utf-8")
+        stripped = text.lstrip()
+        if stripped.startswith("["):
+            try:
+                records = json.loads(text)
+            except json.JSONDecodeError as exc:
+                raise ValueError(f"vision.derive expected JSON list at {source_path}, got non-JSON: {exc}") from exc
+            if not isinstance(records, list):
+                raise ValueError(f"vision.derive expected JSON list of image records, got {type(records).__name__}")
+            return [record for record in records if isinstance(record, dict)]
+        records: list[dict] = []
+        for line in text.splitlines():
+            if not line.strip():
+                continue
+            try:
+                record = json.loads(line)
+            except json.JSONDecodeError as exc:
+                raise ValueError(f"vision.derive expected JSONL at {source_path}, malformed line: {exc}") from exc
+            if not isinstance(record, dict):
+                raise ValueError(f"vision.derive expected JSONL records to be objects, got {type(record).__name__}")
+            records.append(record)
+        return records
+
+    def _load_image_bytes(self, record: dict) -> bytes | None:
+        """Load image bytes from the descriptor's ``image_path`` if
+        present.
+
+        Returns ``None`` for legacy simulator records (no ``image_path``
+        field) so the embedder falls back to its non-bytes path. A
+        missing-blob error logs but doesn't raise — the embedder still
+        runs with ``image_bytes=None`` so a partial parser write doesn't
+        block the whole derive cycle.
+        """
+        image_path = record.get("image_path")
+        if not image_path:
+            return None
+        body = read_or_none(self._store, str(image_path))
+        if body is None:
+            logger.warning(
+                "vision.derive: image_path %s referenced in descriptor but blob missing; "
+                "embedder falls back to text-only path",
+                image_path,
+            )
+            return None
+        return body
+
 
 __all__ = [
     "VisionModality",
diff --git a/aperag/indexing/worker_factory.py b/aperag/indexing/worker_factory.py
@@ -442,23 +442,28 @@ def _build_vision_worker(*, collection: Any, object_store: Any) -> ModalityWorke
     embedding_service, vector_size = get_collection_embedding_service_sync(collection)
     if not embedding_service.is_multimodal():
         raise WorkerFactoryError(
-            "vision modality requires a real multimodal vision-LLM (Wave 4 wiring); "
-            "current text-only embedder produces fake string-concat vision vectors — "
-            "set collection.config.enable_vision=false until Wave 4 lands "
-            "OR configure a multimodal embedding model on the collection's embedding spec"
+            "vision modality requires a multimodal embedding model "
+            "(set Model.supports_multimodal_embedding=True on the collection's "
+            "embedder spec — Voyage Multimodal / Jina v3 / OpenAI multimodal / etc.) "
+            "OR set collection.config.enable_vision=false to keep the modality off"
         )
     qdrant_collection = generate_vector_db_collection_name(collection.id)
     adaptor = get_vector_db_connector(qdrant_collection, vector_size=vector_size)
     backend = _QdrantPointBackend(connector=adaptor.connector)
 
-    def _embed(image_id: str, alt_text: str) -> list[float]:
-        # ``is_multimodal()`` gate above only verifies that operators
-        # explicitly opted into a multimodal embedder. The body is
-        # still the Wave 3 string-concat placeholder until T7 replaces
-        # it with a real image-bytes path (load image from object
-        # store → multimodal embed); ``embed_query`` of an alt-text
-        # surrogate is not actual visual indexing.
-        return embedding_service.embed_query(f"{image_id}|{alt_text}")
+    def _embed(image_id: str, alt_text: str, image_bytes: bytes | None = None) -> list[float]:
+        # Wave 5 P2 chunk 4: real callsite rewrite — load the actual
+        # image bytes from the parser's descriptor (chunk 2 writes
+        # ``vision/source.jsonl`` with ``image_path``) and call the
+        # multimodal-aware ``embed_image`` API surface (chunk 1).
+        # ``image_bytes=None`` only happens on the legacy simulator
+        # path used by tests — fall back to the text-only embedding so
+        # those fixtures keep working without standing up real image
+        # bytes. The chunk 4b gate above already prevents a non-multi-
+        # modal embedder from reaching this body.
+        if image_bytes is None:
+            return embedding_service.embed_query(f"{image_id}|{alt_text}")
+        return embedding_service.embed_image(image_bytes=image_bytes, alt_text=alt_text)
 
     return VisionModality(backend=backend, store=object_store, embedder=_embed)
 
diff --git a/tests/integration/test_full_indexing_pipeline.py b/tests/integration/test_full_indexing_pipeline.py
@@ -211,18 +211,19 @@ async def _run() -> None:
         engine.dispose()
 
 
-def test_phase1_vision_modality_raises_wave4_wiring_gate(monkeypatch: pytest.MonkeyPatch):
-    """Layer 1 gate invariant: vision modality requires a real
-    multimodal embedder. The Wave 3 vision gate (Wave 4 backlog #7)
-    raises ``WorkerFactoryError`` with ``"Wave 4 wiring"`` until T7
-    lands a multimodal model. Phase 1 smoke pins this — Phase 2 (after
-    T7) flips it to ACTIVE assertion.
+def test_phase1_vision_modality_gate_raises_when_embedder_not_multimodal(monkeypatch: pytest.MonkeyPatch):
+    """Layer 1 gate invariant: vision modality requires a multimodal
+    embedding model. When ``EmbeddingService.is_multimodal()`` is False
+    the gate raises ``WorkerFactoryError`` with an operator-actionable
+    message naming the typed `Model.supports_multimodal_embedding`
+    capability flag (Wave 5 P2 chunk 3) so the operator can fix the
+    config.
+
+    Wave 5 P2 chunk 4 reframed the message — it no longer claims
+    "Wave 4 wiring" since the multimodal pieces are landed; the gate
+    now flags an operator-config gap, not a code gap.
     """
 
-    # Stub the embedder so the gate reachability is decoupled from
-    # the model-provider config. The gate compares
-    # ``embedding_service.is_multimodal()`` — a non-multimodal stub
-    # exercises the gate; a multimodal stub flips it (Phase 2).
     class _StubEmbeddingService:
         def is_multimodal(self) -> bool:
             return False
@@ -256,7 +257,76 @@ async def _run() -> None:
             with pytest.raises(WorkerFactoryError) as exc:
                 await factory(payload)
             msg = str(exc.value)
-            assert "Wave 4 wiring" in msg
+            assert "multimodal embedding model" in msg
+            assert "supports_multimodal_embedding" in msg
+
+        asyncio.run(_run())
+    finally:
+        engine.dispose()
+
+
+def test_phase1_vision_modality_gate_self_disables_when_embedder_multimodal(monkeypatch: pytest.MonkeyPatch):
+    """Layer 1 positive-path invariant: when the collection's embedder
+    is configured multimodal (``Model.supports_multimodal_embedding=True``
+    via Wave 5 P2 chunk 3 → ``EmbeddingService.is_multimodal()=True``),
+    the chunk 4b vision gate self-disables and ``ProductionWorkerFactory``
+    builds a vision worker without raising.
+
+    Wave 5 P2 chunk 4 acceptance: the gate must self-disable end-to-end
+    once chunks 1+2+3 land. Chunk 4 wires the callsite; this test pins
+    that the gate no longer holds back vision when the multimodal
+    capability is honestly present.
+    """
+
+    class _StubEmbeddingService:
+        def is_multimodal(self) -> bool:
+            return True
+
+        def embed_query(self, text: str) -> list[float]:
+            return [0.0]
+
+        def embed_image(self, *, image_bytes: bytes, alt_text: str = "") -> list[float]:
+            return [0.0]
+
+    def _stub_get_embedding_service(_collection: Any) -> tuple[Any, int]:
+        return _StubEmbeddingService(), 1
+
+    monkeypatch.setattr(
+        "aperag.llm.embed.base_embedding.get_collection_embedding_service_sync",
+        _stub_get_embedding_service,
+    )
+
+    # Vision builder calls into ``get_vector_db_connector`` to wire a
+    # Qdrant adaptor. Stub it out so the gate-self-disable invariant
+    # is decoupled from Qdrant being reachable.
+    def _stub_connector(*_args: Any, **_kwargs: Any) -> Any:
+        class _A:
+            connector = object()
+
+        return _A()
+
+    monkeypatch.setattr(
+        "aperag.config.get_vector_db_connector",
+        _stub_connector,
+    )
+
+    engine = _make_engine()
+    try:
+        cid = _seed_collection(engine, enable_vision=True)
+        row_id = _seed_pending_row(engine, modality=Modality.VISION, collection_id=cid)
+        payload = DispatchPayload(
+            index_id=row_id,
+            document_id=f"doc-{Modality.VISION.value}-phase1-active",
+            parse_version="parse-v1",
+            modality=Modality.VISION,
+            source_path="source/path",
+            collection_id=cid,
+        )
+
+        async def _run() -> None:
+            factory = ProductionWorkerFactory(engine=engine, object_store=object())
+            worker = await factory(payload)
+            assert worker is not None, "vision worker must build when embedder is multimodal"
 
         asyncio.run(_run())
     finally:
@@ -452,9 +522,7 @@ async def _run_phase1_workers_until_quiet(
     while asyncio.get_event_loop().time() < deadline:
         with Session(engine) as session:
             rows = list(
-                session.execute(
-                    sa_select(DocumentIndex).where(DocumentIndex.document_id == document_id)
-                ).scalars()
+                session.execute(sa_select(DocumentIndex).where(DocumentIndex.document_id == document_id)).scalars()
             )
         if not rows:
             await asyncio.sleep(0.1)
@@ -536,7 +604,6 @@ def test_phase1_full_pipeline_vector_fulltext_summary_active_graph_vision_failed
     fixture supports document-delete API access.
     """
 
-
     from aperag.indexing.dispatcher import DispatchRequest, IndexingMode, dispatch_indexing
     from aperag.indexing.parser import ParseConfig, parse_document
     from aperag.objectstore.base import get_object_store
@@ -549,7 +616,7 @@ def test_phase1_full_pipeline_vector_fulltext_summary_active_graph_vision_failed
         b"# Phase 1 e2e smoke\n\n"
         b"This document exercises the canonical Phase 1 contract: "
         b"vector + fulltext + summary reach ACTIVE; graph + vision "
-        b"finalise FAILED with the Wave 4 wiring gate message.\n"
+        b"finalise per the collection's gate state.\n"
     )
 
     async def _run() -> None:
@@ -605,16 +672,28 @@ async def _run() -> None:
                 )
                 assert row.is_serving is True
 
+            # Wave 5 P2 chunk 4: vision modality may be ACTIVE or
+            # FAILED depending on whether the e2e fixture's collection
+            # was bootstrapped with a multimodal embedder. Either is
+            # acceptable as long as the FAILED case surfaces a gate
+            # marker (so an operator can fix the config). Graph stays
+            # gated on a configured completion model — same OR-on-
+            # marker tolerance as before.
             for modality in (Modality.GRAPH, Modality.VISION):
                 row = finalised[modality]
+                if modality is Modality.VISION and row.status == IndexStatus.ACTIVE.value:
+                    # Multimodal embedder configured + vision pipeline
+                    # produced a real point set — Wave 5 closure path.
+                    assert row.is_serving is True
+                    continue
                 assert row.status == IndexStatus.FAILED.value, (
-                    f"modality={modality.value} must finalise FAILED until Wave 5 T7 lands; "
-                    f"actual={row.status}"
+                    f"modality={modality.value} must finalise ACTIVE (when prerequisites met) "
+                    f"or FAILED with a gate marker; actual={row.status}"
                 )
                 msg = row.error_message or ""
                 assert any(
                     marker in msg
-                    for marker in ("Wave 4 wiring", "completion model", "multimodal")
+                    for marker in ("multimodal", "completion model", "supports_multimodal_embedding", "Wave 4 wiring")
                 ), f"modality={modality.value} FAILED message must surface a gate marker; got {msg!r}"
         finally:
             engine.dispose()
@@ -657,9 +736,7 @@ async def _run() -> None:
             from aperag.indexing.runtime import get_runtime
 
             runtime = get_runtime()
-            assert runtime is not None and runtime.queue is not None, (
-                "sweep D Layer 2 requires a live IndexingRuntime"
-            )
+            assert runtime is not None and runtime.queue is not None, "sweep D Layer 2 requires a live IndexingRuntime"
 
             object_store = get_object_store()
             parsed = parse_document(
diff --git a/tests/unit_test/indexing/test_t1_4_summary_vision.py b/tests/unit_test/indexing/test_t1_4_summary_vision.py