From 3687587e520c51dc1209ca0ccecb2a5d598b524a Mon Sep 17 00:00:00 2001 From: earayu Date: Thu, 30 Apr 2026 18:05:29 +0800 Subject: [PATCH 1/2] =?UTF-8?q?feat(vectorstore):=20task=20#61=20P1-V=20ve?= =?UTF-8?q?ctor=20adapter=20family=20=E2=80=94=20capability=20+=20filter?= =?UTF-8?q?=20Or=20guard=20+=20retrieve=20defense-in-depth?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes task #83 per PM @不穷 dispatch (msg=29c9e753). Folds 4 P1-V items from task #61 spec v1 § 2.3 into a single PR: P1-V1 — collection init failure contract documentation ------------------------------------------------------ ``ensure_collection`` Protocol docstring now spells out the cross- adapter contract (idempotent / race-safe / fail-loud / cache-not- poisoned-on-failure). Both adapters already implement these behaviours; the documentation closes the spec drift gap so future implementers have a checklist. P1-V2 — batch upsert atomicity capability declaration ----------------------------------------------------- New :class:`VectorBackendCapabilities` frozen dataclass on the base module declares static per-backend behaviour flags. Each ``VectorStoreConnector`` subclass exposes an instance via the ``BACKEND_CAPABILITIES`` class-level attribute: * ``PgvectorVectorStoreConnector.BACKEND_CAPABILITIES.supports_atomic_batch_upsert = True`` (PGVector wraps bulk INSERT ON CONFLICT in ``engine.begin()`` — mid-batch failure rolls back the whole batch). * ``QdrantVectorStoreConnector.BACKEND_CAPABILITIES.supports_atomic_batch_upsert = False`` (Qdrant ``client.upsert(points, wait=True)`` is best-effort per-point — partial writes possible on mid-batch failure). ``upsert`` Protocol docstring now points at the capability flag so callers know to chunk + verify on backends that declare ``False``. P1-V3 — filter Or empty-parts guard ----------------------------------- ``Or.__post_init__`` already rejects empty ``parts`` at DSL construction. Both adapter translators now also guard at the translator boundary so a future refactor that bypasses the constructor (e.g. ``object.__setattr__(or_node, "parts", ())`` on the frozen dataclass, or a ``dataclasses.replace`` with empty parts) can't silently degrade to a vacuous "match everything" disjunction: * ``aperag/vectorstore/pgvector_connector.py:_SqlFilter._walk`` — raises ``UnsupportedFilterError`` on empty post-walk parts. * ``aperag/vectorstore/qdrant_connector.py:_translate_filter`` — raises ``UnsupportedFilterError`` on empty post-prune subs (so ``rest.Filter(should=[])`` — which Qdrant treats as match-all — is unreachable). P1-V4 — Qdrant legacy mode defense-in-depth ------------------------------------------- ``QdrantVectorStoreConnector.retrieve`` now applies the same ``TENANT_PAYLOAD_KEY`` filter in **both** multitenant and legacy modes, but with a backwards-compatible "no payload key → pass through" branch so legacy-only rows that don't carry the payload key keep working: * In multitenant mode: filter is the primary tenant-isolation layer (unchanged behaviour). * In legacy mode: collection-name isolation is the primary layer; the new payload-level filter is belt-and-braces against tooling drift / migration mistakes that could plant a stray foreign-tenant row in a legacy collection. The new ``BACKEND_CAPABILITIES.supports_legacy_mode`` flag declares which adapter supports the legacy layout (PGVector ``False``, Qdrant ``True``) so callers can tell the difference machine- readably. Tests ----- * ``tests/unit_test/vectorstore/test_backend_capabilities.py`` (new) — pins shape + per-flag values for each adapter. Coordinates with cuiwenbo task #87 P1-D3 collection metadata Pydantic projection so the static capability matrix stays consistent across PRs. * ``tests/unit_test/vectorstore/test_pgvector_translator.py`` and ``test_qdrant_filter_translation.py`` — pin the new Or empty-parts guard with frozen-dataclass-bypass coverage. * ``tests/unit_test/vectorstore/test_qdrant_multitenancy_integration.py`` — new ``test_retrieve_legacy_mode_filters_stray_foreign_payload`` exercises the P1-V4 belt-and-braces filter on a real ``:memory:`` Qdrant client: legacy-mode rows without payload key pass through (backward compat), own-tenant payload passes, foreign-tenant payload is dropped. Local: ``uv run pytest tests/unit_test/vectorstore/`` → **156 passed, 10 skipped, 1 warning**. Spec / scope alignment ---------------------- * task #61 spec v1 § 2.3 P1-V1 → ensure_collection contract doc ✅ * task #61 spec v1 § 2.3 P1-V2 → BACKEND_CAPABILITIES.supports_atomic_batch_upsert ✅ * task #61 spec v1 § 2.3 P1-V3 → Or empty-parts guard ✅ * task #61 spec v1 § 2.3 P1-V4 → retrieve defense-in-depth + supports_legacy_mode ✅ * Lesson #14 multi-iteration cleanup — legacy mode flagged via ``supports_legacy_mode`` so a future PR can drop the mode entirely once telemetry confirms zero production usage ✅ * Lesson #17 backend 收敛 contract — capability declaration is the backend-side contract that lets callers (FE / API / MCP) read a single source of truth instead of forking on backend type ✅ Follow-ups (NOT in this PR) --------------------------- * task #84 P1-G1+G2 graph store boundary tests — ziang * task #85 P1-D1 e2e shape matrix — huangzhangshu * task #86 P1-D2 Helm Nebula first-class — Planetegg * task #87 P1-D3 collection metadata vector_backend projection — cuiwenbo + dongdong (consumes ``BACKEND_CAPABILITIES`` values) * task #88 P2-S1+S2 batch alias resolution — Bryce after this PR Co-Authored-By: Claude Opus 4.7 --- aperag/vectorstore/base.py | 87 +++++++++++++-- aperag/vectorstore/pgvector_connector.py | 27 +++++ aperag/vectorstore/qdrant_connector.py | 54 +++++++++- .../vectorstore/test_backend_capabilities.py | 102 ++++++++++++++++++ .../vectorstore/test_pgvector_translator.py | 26 +++++ .../test_qdrant_filter_translation.py | 21 ++++ .../test_qdrant_multitenancy_integration.py | 53 +++++++++ 7 files changed, 359 insertions(+), 11 deletions(-) create mode 100644 tests/unit_test/vectorstore/test_backend_capabilities.py diff --git a/aperag/vectorstore/base.py b/aperag/vectorstore/base.py index 928179873..6bdf1199a 100644 --- a/aperag/vectorstore/base.py +++ b/aperag/vectorstore/base.py @@ -46,7 +46,8 @@ import math from abc import ABC, abstractmethod -from typing import Any, Dict, List, Sequence +from dataclasses import dataclass +from typing import Any, ClassVar, Dict, List, Sequence from aperag.vectorstore.dto import ( QueryRequest, @@ -185,6 +186,48 @@ def denormalize_threshold_to_native(distance: str, normalized: float) -> float: raise ValueError(f"Unknown distance metric: {distance!r}") +@dataclass(frozen=True) +class VectorBackendCapabilities: + """Static capability flags for a vector store backend (task #61 P1-V2 / P1-V3 / P1-V4). + + Per task-61 spec v1 § 2.3 「允许差异但显式 declaration」: not every + backend supports every operation atomically / identically, so + callers (FE / API / MCP / capability-aware optimizers) need a + machine-readable declaration of what each adapter is actually + capable of, instead of guessing from the backend name. + + These are **static** declarations — they describe what the backend + *can* do, independent of any runtime probe / fallback logic. A + runtime degradation surface (e.g. "PG connection pool exhausted → + graph search degraded to fulltext-only") is a separate concern and + intentionally NOT covered here (see architect msg=3163bb4b). + + Each adapter exposes its capabilities via the + :attr:`VectorStoreConnector.BACKEND_CAPABILITIES` class-level + attribute. Callers (e.g. cuiwenbo task #87 P1-D3 collection + metadata Pydantic projection) read this static declaration and + surface it on the API. + """ + + #: P1-V2 — does ``upsert(points)`` apply the entire batch + #: atomically? PGVector wraps the INSERT ON CONFLICT in a + #: ``engine.begin()`` transaction so a mid-batch failure rolls back + #: the whole batch (``True``). Qdrant's ``client.upsert(points, + #: wait=True)`` is best-effort per-point — a mid-batch failure can + #: leave some points written and others not (``False``). Callers + #: that need atomic semantics must chunk + verify on Qdrant; on + #: PGVector the semantics come for free. + supports_atomic_batch_upsert: bool + + #: P1-V4 — does the backend support a "legacy" non-multitenant + #: physical layout (one collection per tenant, no payload-level + #: tenant filter)? Qdrant supports both legacy and multitenant + #: modes (``True``); PGVector is multitenant-only (``False``). + #: Legacy mode is preserved for migration rollback compatibility + #: only — new collections always use the multitenant layout. + supports_legacy_mode: bool + + class VectorStoreConnector(ABC): """Abstract contract for per-tenant vector storage. @@ -194,6 +237,11 @@ class VectorStoreConnector(ABC): rest; unknown keys must never be a hard error. """ + #: Static capability flags for this backend (task #61 P1-V2/V3/V4). + #: Each concrete subclass overrides with its actual values; see + #: :class:`VectorBackendCapabilities` for the per-flag contract. + BACKEND_CAPABILITIES: ClassVar[VectorBackendCapabilities] + def __init__(self, ctx: Dict[str, Any], **_kwargs: Any) -> None: self.ctx = ctx @@ -214,9 +262,22 @@ def ensure_collection(self) -> None: """Idempotently make sure the physical storage (Qdrant collection, pgvector table, …) exists for this connector's shape. - Must be safe to call from many connectors concurrently: typical - implementations use ``CREATE IF NOT EXISTS`` / ``collection_exists - ? no-op : create`` with module-level deduping caches. + Cross-adapter contract (task #61 P1-V1): + + * **Idempotent** — repeat calls after first success are a no-op, + gated through the per-process ``_ENSURED_*`` cache. + * **Race-safe** — concurrent calls from multiple processes / + connectors must not all fail when the underlying CREATE + collides. PGVector relies on ``CREATE IF NOT EXISTS``; Qdrant + treats "already exists" responses on ``create_collection`` as + success. + * **Fail-loud** — any other failure (missing privilege, bad + config, transient DB outage) raises so the caller sees the + error rather than silently leaving an unusable connector. + * **Cache-not-poisoned-on-failure** — failed runs MUST NOT + populate the ``_ENSURED_*`` cache; the next call retries. + Otherwise a transient failure during boot would wedge the + connector for the rest of the process lifetime. """ @abstractmethod @@ -242,8 +303,22 @@ def upsert(self, points: Sequence[VectorPoint]) -> List[str]: Must inject the tenant guard into each point's storage so later searches / deletes can filter by it. Returns the ids written (in - input order). Raises on write failure — callers treat the batch - as atomic per-point. + input order). Raises on write failure. + + **Batch atomicity** is **backend-specific** and declared on + :attr:`BACKEND_CAPABILITIES.supports_atomic_batch_upsert` + (task #61 P1-V2): + + * PGVector wraps the bulk INSERT ON CONFLICT in + ``engine.begin()`` so a mid-batch failure rolls back the + entire batch (``supports_atomic_batch_upsert=True``). + * Qdrant's ``client.upsert(points, wait=True)`` is best-effort + per-point — a mid-batch failure can leave a partial write + (``supports_atomic_batch_upsert=False``). + + Callers that need atomic-batch semantics must read the + capability flag and chunk + verify when ``False``; on + ``True``-declaring backends the semantics come for free. """ @abstractmethod diff --git a/aperag/vectorstore/pgvector_connector.py b/aperag/vectorstore/pgvector_connector.py index 98f432a4b..3da5dae5b 100644 --- a/aperag/vectorstore/pgvector_connector.py +++ b/aperag/vectorstore/pgvector_connector.py @@ -73,6 +73,7 @@ from aperag.vectorstore.base import ( UnsupportedFilterError, + VectorBackendCapabilities, VectorStoreConnector, denormalize_threshold_to_native, normalize_score, @@ -254,6 +255,21 @@ def _walk(self, flt: VectorFilter) -> str: return "(" + " AND ".join(parts) + ")" if isinstance(flt, Or): parts = [self._walk(p) for p in flt.parts] + # task #61 P1-V3 defense-in-depth: ``Or.__post_init__`` + # already rejects empty ``parts`` at DSL construction so + # this list is normally non-empty. The translator-level + # guard catches future refactors that bypass the DSL + # constructor (e.g. ``dataclasses.replace(or_node, parts=())``) + # before they reach pgvector. An empty Or in SQL would + # collapse to ``()`` which is a syntax error — but in + # principle could degrade to a vacuous "always true" via + # some future translator change. Symmetric with the Qdrant + # ``Or`` translator guard for cross-adapter parity. + if not parts: + raise UnsupportedFilterError( + "pgvector: Or filter has zero translatable parts; " + "an empty Or is a vacuous disjunction (task #61 P1-V3)." + ) return "(" + " OR ".join(parts) + ")" if isinstance(flt, Not): return f"NOT ({self._walk(flt.inner)})" @@ -325,6 +341,17 @@ def _vector_literal(vec: Sequence[float]) -> str: class PgvectorVectorStoreConnector(VectorStoreConnector): """pgvector implementation of ``VectorStoreConnector``.""" + #: Static capability declaration (task #61 P1-V2 / P1-V4). + #: ``upsert`` wraps the bulk INSERT ON CONFLICT in a SQLAlchemy + #: ``engine.begin()`` transaction so a mid-batch failure rolls back + #: the whole batch — atomic. Legacy mode is not supported on + #: pgvector (would require one table per tenant; defeats the point + #: of sharing PG with the main ApeRAG DB). + BACKEND_CAPABILITIES = VectorBackendCapabilities( + supports_atomic_batch_upsert=True, + supports_legacy_mode=False, + ) + def __init__(self, ctx: Dict[str, Any], **kwargs: Any) -> None: super().__init__(ctx, **kwargs) diff --git a/aperag/vectorstore/qdrant_connector.py b/aperag/vectorstore/qdrant_connector.py index 809a0525c..5fe07e1a1 100644 --- a/aperag/vectorstore/qdrant_connector.py +++ b/aperag/vectorstore/qdrant_connector.py @@ -51,6 +51,7 @@ from aperag.vectorstore.base import ( UnsupportedFilterError, + VectorBackendCapabilities, VectorStoreConnector, denormalize_threshold_to_native, normalize_score, @@ -271,6 +272,21 @@ def _translate_filter(flt: Optional[VectorFilter]) -> Optional[rest.Filter]: if isinstance(flt, Or): subs = [_translate_filter(p) for p in flt.parts] subs = [s for s in subs if s is not None] + # task #61 P1-V3 defense-in-depth: ``Or.__post_init__`` already + # rejects empty ``parts`` at DSL construction, so this list is + # normally non-empty. The translator-level guard catches future + # refactors that bypass the DSL constructor (e.g. + # ``dataclasses.replace(or_node, parts=())``) before they reach + # Qdrant. Without this, ``rest.Filter(should=[])`` is a vacuous + # disjunction that Qdrant treats as "match everything" — a + # silent data-correctness footgun. Cross-adapter parity with + # the pgvector ``_SqlFilter._walk`` Or-empty guard. + if not subs: + raise UnsupportedFilterError( + "qdrant: Or filter has zero translatable parts after pruning; " + "an empty Or is a vacuous disjunction that would match every " + "point in the collection (task #61 P1-V3)." + ) return rest.Filter(should=subs) if isinstance(flt, Not): sub = _translate_filter(flt.inner) @@ -416,6 +432,18 @@ def _extract_vector(p: Any) -> Optional[List[float]]: class QdrantVectorStoreConnector(VectorStoreConnector): """Qdrant implementation of ``VectorStoreConnector``.""" + #: Static capability declaration (task #61 P1-V2 / P1-V4). + #: Qdrant's ``client.upsert(points, wait=True)`` is best-effort + #: per-point — a mid-batch failure can leave a partial write — so + #: ``supports_atomic_batch_upsert=False``. The legacy + #: (``multitenant=False``) layout is supported, where each ApeRAG + #: collection gets its own physical Qdrant collection; new + #: deployments default to multitenant. + BACKEND_CAPABILITIES = VectorBackendCapabilities( + supports_atomic_batch_upsert=False, + supports_legacy_mode=True, + ) + def __init__(self, ctx: Dict[str, Any], **kwargs: Any) -> None: super().__init__(ctx, **kwargs) @@ -713,11 +741,27 @@ def retrieve( ) for p in raw ] - # Represent "no vector requested" as empty list rather than None - # to keep VectorPoint.__post_init__ happy (vector must be list). - if not self.multitenant: - return out - return [p for p in out if p.payload.get(TENANT_PAYLOAD_KEY) == self._tenant.id] + # task #61 P1-V4 defense-in-depth: apply the same payload-level + # tenant filter to BOTH multitenant and legacy modes. In + # multitenant mode the filter is the primary tenant-isolation + # layer (the physical collection is shared). In legacy mode + # the primary isolation is the per-tenant physical collection + # name (``collection_name == tenant_id``) — this connector is + # already bound to a tenant-specific physical collection, so + # foreign-tenant ids cannot return cross-tenant data — but the + # ``TENANT_PAYLOAD_KEY in payload`` short-circuit means legacy + # rows that don't carry the payload key still pass through + # unchanged, while a stray multitenant-style row that ended up + # in a legacy collection (e.g. via tooling drift) would be + # filtered out. Lesson #14 multi-iteration cleanup family — + # legacy mode is preserved for migration rollback only; a + # future PR can drop the mode entirely once telemetry confirms + # zero production usage. + return [ + p + for p in out + if TENANT_PAYLOAD_KEY not in p.payload or p.payload.get(TENANT_PAYLOAD_KEY) == self._tenant.id + ] # ================================================================ delete def delete(self, ids: Sequence[str]) -> None: diff --git a/tests/unit_test/vectorstore/test_backend_capabilities.py b/tests/unit_test/vectorstore/test_backend_capabilities.py new file mode 100644 index 000000000..c4d33a5e5 --- /dev/null +++ b/tests/unit_test/vectorstore/test_backend_capabilities.py @@ -0,0 +1,102 @@ +# Copyright 2026 ApeCloud, Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Static capability declaration tests (task #61 P1-V2 / P1-V4). + +Each :class:`VectorStoreConnector` subclass declares its +``BACKEND_CAPABILITIES`` class-level attribute so callers +(API / FE / capability-aware optimizers) can read a machine-readable +declaration of what the adapter actually does — instead of guessing +from the backend name. + +These tests pin the static values so: + +1. A future refactor that drops a flag declaration on a concrete + subclass fails fast (e.g. removing ``BACKEND_CAPABILITIES`` from + :class:`PgvectorVectorStoreConnector` makes the flag undefined). +2. The cross-adapter capability matrix surfaces in code review as a + single test file — consumers (cuiwenbo task #87 P1-D3 collection + metadata Pydantic projection) read these values verbatim, so any + change to the behaviour they describe must update both the adapter + docstring + this test in the same PR. +""" + +from __future__ import annotations + +from aperag.vectorstore.base import VectorBackendCapabilities, VectorStoreConnector +from aperag.vectorstore.pgvector_connector import PgvectorVectorStoreConnector +from aperag.vectorstore.qdrant_connector import QdrantVectorStoreConnector + +# --------------------------------------------------------------------- +# Shape — ensure both adapters declare the attribute and it's the +# right type. +# --------------------------------------------------------------------- + + +def test_pgvector_declares_backend_capabilities(): + caps = PgvectorVectorStoreConnector.BACKEND_CAPABILITIES + assert isinstance(caps, VectorBackendCapabilities) + + +def test_qdrant_declares_backend_capabilities(): + caps = QdrantVectorStoreConnector.BACKEND_CAPABILITIES + assert isinstance(caps, VectorBackendCapabilities) + + +def test_abstract_base_does_not_set_concrete_capabilities(): + """:class:`VectorStoreConnector` is abstract — only concrete + subclasses declare a value. Keeping the base class assignment + absent means a future subclass that forgets to declare gets a + ``AttributeError`` at the call site, not a silent default.""" + # ``BACKEND_CAPABILITIES`` is a ``ClassVar`` annotation on the base + # class without a value, so it doesn't actually exist on the base. + assert "BACKEND_CAPABILITIES" not in VectorStoreConnector.__dict__ + + +# --------------------------------------------------------------------- +# Capability matrix values — pinned by spec § 2.3 + task #83 P1-V* +# implementation. cuiwenbo task #87 P1-D3 reads these values for the +# collection metadata Pydantic projection, so changes here must be +# coordinated with that PR. +# --------------------------------------------------------------------- + + +def test_pgvector_supports_atomic_batch_upsert(): + """PGVector wraps the bulk INSERT ON CONFLICT in + ``engine.begin()`` so a mid-batch failure rolls back the entire + batch (task #61 P1-V2).""" + assert PgvectorVectorStoreConnector.BACKEND_CAPABILITIES.supports_atomic_batch_upsert is True + + +def test_qdrant_does_not_support_atomic_batch_upsert(): + """Qdrant ``client.upsert(points, wait=True)`` is best-effort + per-point — a mid-batch failure can leave some points written and + others not (task #61 P1-V2). Callers needing atomic-batch + semantics must chunk + verify.""" + assert QdrantVectorStoreConnector.BACKEND_CAPABILITIES.supports_atomic_batch_upsert is False + + +def test_pgvector_does_not_support_legacy_mode(): + """PGVector is multitenant-only — a per-tenant table layout would + require dropping the shared-PG topology entirely (task #61 P1-V4).""" + assert PgvectorVectorStoreConnector.BACKEND_CAPABILITIES.supports_legacy_mode is False + + +def test_qdrant_supports_legacy_mode(): + """Qdrant supports both legacy (``multitenant=False``, + one-collection-per-tenant) and multitenant + (``multitenant=True``, shared-collection + payload filter) + layouts, controlled by the ``multitenant`` ctx flag (task #61 + P1-V4). New deployments default to multitenant.""" + assert QdrantVectorStoreConnector.BACKEND_CAPABILITIES.supports_legacy_mode is True diff --git a/tests/unit_test/vectorstore/test_pgvector_translator.py b/tests/unit_test/vectorstore/test_pgvector_translator.py index 6aa9ea028..220e942a4 100644 --- a/tests/unit_test/vectorstore/test_pgvector_translator.py +++ b/tests/unit_test/vectorstore/test_pgvector_translator.py @@ -153,6 +153,32 @@ class Bogus: _translate_filter(Bogus()) # type: ignore[arg-type] +def test_translate_or_with_zero_translatable_parts_raises(): + """Pinned by task #61 P1-V3: an Or filter that ends up with zero + translatable parts (after the construction guard is bypassed) MUST + raise rather than degrade to a vacuous "always-true" SQL fragment. + + ``Or.__post_init__`` already rejects empty ``parts`` at construction; + we exercise the translator-level defense-in-depth path by + constructing the dataclass directly. Cross-adapter parity with the + Qdrant translator's identical guard. + """ + import dataclasses + + from aperag.vectorstore.base import UnsupportedFilterError + from aperag.vectorstore.filters import Eq, Or + + # ``object.__setattr__`` works around the frozen dataclass so we + # can simulate a downstream caller that built an Or via + # ``dataclasses.replace`` with empty ``parts``. + or_node = Or(parts=(Eq(key="k", value="v"),)) + object.__setattr__(or_node, "parts", ()) + assert dataclasses.is_dataclass(or_node) + + with pytest.raises(UnsupportedFilterError, match="zero translatable parts"): + _translate_filter(or_node) + + # --------------------------------------------------------------------------- # vector literal # --------------------------------------------------------------------------- diff --git a/tests/unit_test/vectorstore/test_qdrant_filter_translation.py b/tests/unit_test/vectorstore/test_qdrant_filter_translation.py index fcef49922..5eaa4ccd3 100644 --- a/tests/unit_test/vectorstore/test_qdrant_filter_translation.py +++ b/tests/unit_test/vectorstore/test_qdrant_filter_translation.py @@ -60,6 +60,27 @@ def test_normalize_unknown_type_raises_unsupported_filter_error(): qc._normalize_filter_input(object()) +def test_translate_or_with_zero_translatable_parts_raises(): + """Pinned by task #61 P1-V3: an Or filter that ends up with zero + translatable parts (after the construction guard is bypassed) MUST + raise rather than produce a vacuous ``Filter(should=[])`` that + Qdrant would treat as "match everything". + + ``Or.__post_init__`` already rejects empty ``parts`` at construction; + we exercise the translator-level defense-in-depth path by mutating + a frozen dataclass via ``object.__setattr__`` to simulate a + downstream caller that ended up with empty parts. + """ + from aperag.vectorstore.base import UnsupportedFilterError + from aperag.vectorstore.filters import Eq, Or + + or_node = Or(parts=(Eq(key="k", value="v"),)) + object.__setattr__(or_node, "parts", ()) + + with pytest.raises(UnsupportedFilterError, match="zero translatable parts"): + qc._translate_filter(or_node) + + # --------------------------------------------------------------------------- # leaf nodes # --------------------------------------------------------------------------- diff --git a/tests/unit_test/vectorstore/test_qdrant_multitenancy_integration.py b/tests/unit_test/vectorstore/test_qdrant_multitenancy_integration.py index 7101dcb65..172c9c88f 100644 --- a/tests/unit_test/vectorstore/test_qdrant_multitenancy_integration.py +++ b/tests/unit_test/vectorstore/test_qdrant_multitenancy_integration.py @@ -300,6 +300,59 @@ def test_retrieve_drops_foreign_tenant_points(shared_client): assert retrieved_ids == set(ids_a), f"leaked foreign points: {retrieved_ids - set(ids_a)}" +def test_retrieve_legacy_mode_filters_stray_foreign_payload(shared_client): + """Pinned by task #61 P1-V4: even in legacy (per-tenant physical + collection) mode the ``retrieve()`` post-filter must drop any + point that carries a ``TENANT_PAYLOAD_KEY`` payload whose value + doesn't match the connector's tenant id. + + Primary isolation in legacy mode is the physical collection name + (``collection_name == tenant_id``) — a stray foreign-tenant point + can't normally reach a legacy collection. But tooling drift / + migration mistakes / re-ingest scripts could accidentally write + one. The defense-in-depth filter catches that case before it + surfaces as a cross-tenant data leak. + + Legacy rows that don't carry ``TENANT_PAYLOAD_KEY`` at all (the + common case in legacy mode) pass through unchanged — backward- + compatible with the historical legacy data. + """ + legacy = _make_connector("col_legacy_owner", client=shared_client, multitenant=False) + + # Seed 3 points: 1 with no payload key (typical legacy row), 1 + # with the matching tenant id, 1 with a *foreign* tenant id (the + # stray case the filter must catch). + no_payload_id = str(uuid.uuid4()) + own_payload_id = str(uuid.uuid4()) + foreign_payload_id = str(uuid.uuid4()) + shared_client.upsert( + collection_name=legacy.collection_name, + points=[ + rest.PointStruct(id=no_payload_id, vector=VEC_A, payload={}), + rest.PointStruct( + id=own_payload_id, + vector=VEC_A, + payload={TENANT_PAYLOAD_KEY: legacy.tenant.id}, + ), + rest.PointStruct( + id=foreign_payload_id, + vector=VEC_A, + payload={TENANT_PAYLOAD_KEY: "col_some_other_tenant"}, + ), + ], + wait=True, + ) + + out = legacy.retrieve([no_payload_id, own_payload_id, foreign_payload_id]) + out_ids = {p.id for p in out} + + # The no-payload + own-payload rows pass through; the foreign- + # payload row is filtered. + assert no_payload_id in out_ids, "legacy row without TENANT_PAYLOAD_KEY must pass through (backward compat)" + assert own_payload_id in out_ids, "own-tenant payload must pass through" + assert foreign_payload_id not in out_ids, "stray foreign-tenant payload must be dropped (P1-V4 defense-in-depth)" + + # --------------------------------------------------------------------------- # delete_by_filter # --------------------------------------------------------------------------- From 542dd65f4b43f910d2a22cedac6c64af2be4cc2d Mon Sep 17 00:00:00 2001 From: earayu Date: Thu, 30 Apr 2026 18:11:44 +0800 Subject: [PATCH 2/2] fix(vectorstore): mode-specific tenant filter on Qdrant retrieve (Weston PR #1948 BLOCKER) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Weston PR #1948 architecture CR (msg=910cad66 BLOCKER) caught a real correctness regression in the initial P1-V4 commit: the uniform "no payload key → pass through" branch leaked stray ``{}`` payload rows in the **shared multitenant collection** to every tenant on a ``retrieve(ids=...)`` call. Local Qdrant ``:memory:`` repro (per Weston): a multitenant connector ``tenant_a`` writes a point with ``payload={}`` directly to the shared collection, then ``tenant_a.retrieve([id])`` returns the row. Because ``upsert()`` always stamps the payload key, the only way a missing-key row reaches the shared collection is tooling drift / migration drift — exactly the case P1-V4 defense-in-depth is supposed to catch. Fix --- Mode-specific semantics: * **Multitenant mode** (shared physical collection): STRICT — every row MUST carry ``TENANT_PAYLOAD_KEY`` matching the connector's tenant id. No "no payload key → pass through" branch, because the shared collection means a missing key would expose the row to every tenant. * **Legacy mode** (per-tenant physical collection, unchanged from initial commit): PERMISSIVE — a row that doesn't carry the payload key still passes through (typical pre-multitenant data shape), but a stray foreign-tenant payload gets dropped (catches tooling drift / migration mistakes). Tests ----- ``test_retrieve_multitenant_mode_strict_requires_payload_key`` (new) — Weston's exact repro: seed shared collection with ``{}`` payload + own-tenant payload + foreign-tenant payload, assert only the own-tenant row passes through. The legacy-mode permissive counterpart (``test_retrieve_legacy_mode_filters_stray_foreign_payload``) stays unchanged so a future refactor that unifies them silently re-opens the leak fails fast. Local: ``uv run pytest tests/unit_test/vectorstore/`` → **157 passed, 10 skipped** (one new case). Sediment trigger ---------------- This is Lesson #12 v9 fifth-application demo same family — Weston first-principles repro catches the unified branch as silent leak that I missed when applying the legacy-compat optimization uniformly. The narrower ``mode-specific`` framing matches the spec language ("legacy compat for legacy mode only") more precisely. Co-Authored-By: Claude Opus 4.7 --- aperag/vectorstore/qdrant_connector.py | 48 ++++++++++------ .../test_qdrant_multitenancy_integration.py | 55 +++++++++++++++++++ 2 files changed, 87 insertions(+), 16 deletions(-) diff --git a/aperag/vectorstore/qdrant_connector.py b/aperag/vectorstore/qdrant_connector.py index 5fe07e1a1..92bb75746 100644 --- a/aperag/vectorstore/qdrant_connector.py +++ b/aperag/vectorstore/qdrant_connector.py @@ -741,22 +741,38 @@ def retrieve( ) for p in raw ] - # task #61 P1-V4 defense-in-depth: apply the same payload-level - # tenant filter to BOTH multitenant and legacy modes. In - # multitenant mode the filter is the primary tenant-isolation - # layer (the physical collection is shared). In legacy mode - # the primary isolation is the per-tenant physical collection - # name (``collection_name == tenant_id``) — this connector is - # already bound to a tenant-specific physical collection, so - # foreign-tenant ids cannot return cross-tenant data — but the - # ``TENANT_PAYLOAD_KEY in payload`` short-circuit means legacy - # rows that don't carry the payload key still pass through - # unchanged, while a stray multitenant-style row that ended up - # in a legacy collection (e.g. via tooling drift) would be - # filtered out. Lesson #14 multi-iteration cleanup family — - # legacy mode is preserved for migration rollback only; a - # future PR can drop the mode entirely once telemetry confirms - # zero production usage. + # task #61 P1-V4 defense-in-depth — payload-level tenant + # filter applied with **mode-specific semantics** (per Weston + # msg=910cad66 BLOCKER catch on initial commit: a uniform + # "no payload key → pass through" branch leaked stray ``{}`` + # payload rows in the shared multitenant collection to every + # tenant on ``retrieve(ids=...)``): + # + # * Multitenant mode (shared physical collection): payload + # key is the **primary** isolation layer. STRICT — a row + # must carry ``TENANT_PAYLOAD_KEY`` matching this tenant. + # No "no payload key → pass through" branch here, because + # in the shared collection a missing key would expose the + # row to every tenant, not just this one. ``upsert()`` + # stamps the key on every point we write, so the only way + # a missing-key row reaches the collection is tooling + # drift / migration drift, exactly the case this gate is + # meant to catch. + # * Legacy mode (per-tenant physical collection): the + # collection name is the primary isolation layer + # (``collection_name == tenant_id``); the payload-level + # filter is belt-and-braces. PERMISSIVE — a row that + # doesn't carry the payload key still passes through + # (typical legacy data shape pre-multitenant), but a stray + # foreign-tenant payload gets dropped (catches tooling + # drift / migration mistakes). + # + # Lesson #14 multi-iteration cleanup family — legacy mode is + # preserved for migration rollback only; a future PR can drop + # the mode entirely once telemetry confirms zero production + # usage. + if self.multitenant: + return [p for p in out if p.payload.get(TENANT_PAYLOAD_KEY) == self._tenant.id] return [ p for p in out diff --git a/tests/unit_test/vectorstore/test_qdrant_multitenancy_integration.py b/tests/unit_test/vectorstore/test_qdrant_multitenancy_integration.py index 172c9c88f..911907450 100644 --- a/tests/unit_test/vectorstore/test_qdrant_multitenancy_integration.py +++ b/tests/unit_test/vectorstore/test_qdrant_multitenancy_integration.py @@ -300,6 +300,61 @@ def test_retrieve_drops_foreign_tenant_points(shared_client): assert retrieved_ids == set(ids_a), f"leaked foreign points: {retrieved_ids - set(ids_a)}" +def test_retrieve_multitenant_mode_strict_requires_payload_key(shared_client): + """Pinned by task #61 P1-V4 BLOCKER from Weston msg=910cad66: + in multitenant mode the payload-level tenant filter must be + **strict** — a row MUST carry ``TENANT_PAYLOAD_KEY`` matching + this tenant. No "no payload key → pass through" branch in + multitenant mode, because the physical collection is shared, + and a missing-key row would otherwise expose to every tenant + on a ``retrieve(ids=...)`` call. + + The legacy-mode permissive branch (no payload key passes + through) is exercised by + :func:`test_retrieve_legacy_mode_filters_stray_foreign_payload`; + this test pins the multitenant strict counterpart so a future + refactor can't unify them and silently re-open the leak. + """ + a = _make_connector("col_aaaaaaaaaaaaa_a", client=shared_client) + + # Seed three points directly on the shared multitenant collection: + # 1. ``{}`` payload (no TENANT_PAYLOAD_KEY at all — the stray + # case Weston caught). + # 2. Own-tenant payload — must pass through. + # 3. Foreign-tenant payload — must be dropped. + no_payload_id = str(uuid.uuid4()) + own_payload_id = str(uuid.uuid4()) + foreign_payload_id = str(uuid.uuid4()) + shared_client.upsert( + collection_name=a.collection_name, + points=[ + rest.PointStruct(id=no_payload_id, vector=VEC_A, payload={}), + rest.PointStruct( + id=own_payload_id, + vector=VEC_A, + payload={TENANT_PAYLOAD_KEY: a.tenant.id}, + ), + rest.PointStruct( + id=foreign_payload_id, + vector=VEC_A, + payload={TENANT_PAYLOAD_KEY: "col_some_other_tenant"}, + ), + ], + wait=True, + ) + + out = a.retrieve([no_payload_id, own_payload_id, foreign_payload_id]) + out_ids = {p.id for p in out} + + # Strict: only own-tenant row passes. + assert no_payload_id not in out_ids, ( + "multitenant mode must NOT pass-through rows missing TENANT_PAYLOAD_KEY — " + "shared collection means a missing key would leak to every tenant" + ) + assert own_payload_id in out_ids, "own-tenant payload must pass through" + assert foreign_payload_id not in out_ids, "foreign-tenant payload must be dropped" + + def test_retrieve_legacy_mode_filters_stray_foreign_payload(shared_client): """Pinned by task #61 P1-V4: even in legacy (per-tenant physical collection) mode the ``retrieve()`` post-filter must drop any