Skip to content

feat: Add Amazon S3 Vectors document store integration#3149

Open
dotKokott wants to merge 7 commits intodeepset-ai:mainfrom
dotKokott:feature/amazon-s3-vectors-integration
Open

feat: Add Amazon S3 Vectors document store integration#3149
dotKokott wants to merge 7 commits intodeepset-ai:mainfrom
dotKokott:feature/amazon-s3-vectors-integration

Conversation

@dotKokott
Copy link
Copy Markdown
Contributor

@dotKokott dotKokott commented Apr 13, 2026

Related Issues

Proposed Changes:

Adds an Amazon S3 Vectors document store integration — a serverless vector storage capability native to S3.

Components:

  • S3VectorsDocumentStore — full DocumentStore protocol (write, count, filter, delete)
  • S3VectorsEmbeddingRetriever — embedding-based retrieval with server-side metadata filtering

Key design decisions:

  • Content stored as non-filterable metadata (AWS-recommended pattern for large text)
  • Cosine distance converted to similarity score (1 - distance) for Haystack convention
  • Blob data uses base64 encoding for round-trip fidelity
  • filter_documents() uses list_vectors(returnData=True, returnMetadata=True) with client-side filtering (warning logged) since S3 Vectors has no standalone filter API
  • Batch existence checks for DuplicatePolicy.SKIP/NONE (batches of 100)

Known limitations (documented in README):

  • top_k capped at 100 (service limit)
  • query_vectors does not return embedding data
  • 40KB total metadata per vector, 2KB filterable
  • Only float32, cosine/euclidean, eventual consistency

How did you test it?

  • 26 unit tests — serialization, score conversion, filter conversion, duplicate policy logic, document conversion (mocked boto3)
  • 12 integration tests — full lifecycle against live AWS S3 Vectors, with pytestmark credential guard for CI
  • hatch run test:all, hatch run fmt, hatch run test:types
  • Example script (examples/example.py) verified against live AWS

Notes for the reviewer

This PR was fully generated with an AI assistant. I have reviewed the changes and run the relevant tests.
Structure and test style follow the Pinecone integration pattern.

Checklist

@github-actions github-actions bot added topic:CI type:documentation Improvements or additions to documentation labels Apr 13, 2026
Implements issue deepset-ai#2110 - Amazon S3 Vectors document store integration with:

- S3VectorsDocumentStore: full DocumentStore protocol (count, write, filter, delete)
- S3VectorsEmbeddingRetriever: embedding-based retrieval with metadata filtering
- Filter conversion from Haystack format to S3 Vectors filter syntax
- Auto-creation of vector buckets and indexes
- AWS credential support via Secret (or default credential chain)
- 49 unit tests covering store, retriever, filters, and serialization
- README with usage examples and known limitations
…rkflow

- boto3 lower bound set to 1.42.0 (when s3vectors service was added)
- pydoc filename changed to amazon_s3_vectors.md (underscores, matching folder name)
- Quote $GITHUB_OUTPUT in workflow to fix shellcheck SC2086
- Flatten test classes into standalone functions (matching pinecone/qdrant pattern)
- Assert full serialized dict structure in to_dict/from_dict tests
- Use Mock(spec=...) for retriever tests instead of MagicMock+patch
- Verify _embedding_retrieval call args match exactly
- Add test_from_dict_no_filter_policy (backward compat)
- Add test_init_is_lazy
Remove tests that just verify mock plumbing (count, write, delete calling
the mock client). Keep tests that verify our actual logic:
- Serialization roundtrip (full dict structure)
- Score conversion (cosine + euclidean)
- Filter conversion (pure function with real logic)
- Duplicate policy batch checks (SKIP/NONE)
- Document <-> S3 vector conversion
- Input validation

Before: 49 unit tests (many testing mock behavior)
After: 26 unit tests (all testing our code) + 12 integration tests
- Class docstring: top_k cap, dimension limit, metadata limits, float32 only
- write_documents: embedding required, 40KB metadata limit
- _embedding_retrieval: top_k=100 cap, no embeddings in response
- Retriever run: top_k=100, server-side filters, no embeddings returned
…ity, deduplicate retrieval logic

- Replace hand-rolled _apply_filters_in_memory/_document_matches/_compare
  with haystack.utils.filters.document_matches_filter (same utility used by
  InMemoryDocumentStore). Gains NOT operator, nested dotted field paths, and
  date comparison support for free. (-65 lines)
- Deduplicate blob/content reconstruction in _embedding_retrieval() by
  reusing _s3_vector_to_document() + dataclasses.replace() (-20 lines)
- Make filter_documents() warning conditional on filters actually being
  provided (no warning when listing all documents)
@dotKokott dotKokott force-pushed the feature/amazon-s3-vectors-integration branch from 1df9666 to 90c4977 Compare April 13, 2026 13:28
@dotKokott dotKokott marked this pull request as ready for review April 13, 2026 13:39
@dotKokott dotKokott requested a review from a team as a code owner April 13, 2026 13:39
@dotKokott dotKokott requested review from anakin87 and removed request for a team April 13, 2026 13:39
@dotKokott dotKokott marked this pull request as draft April 13, 2026 13:39
@dotKokott dotKokott marked this pull request as ready for review April 13, 2026 13:40
@dotKokott
Copy link
Copy Markdown
Contributor Author

CI: Integration tests need AWS credential setup

The integration tests currently run unconditionally in CI with no AWS credentials configured. The tests have a pytestmark = pytest.mark.skipif(not _aws_credentials_available(), ...) guard so they silently skip (0 collected), but this means:

  • Integration tests never actually run in CI — only locally by developers with AWS credentials
  • The "combined" coverage badge will just reflect unit test coverage

What needs to happen

The workflow should match the amazon_bedrock.yml pattern — add an OIDC role assumption step and gate the integration test run on its success:

# Do not authenticate on PRs from forks and on PRs created by dependabot
- name: AWS authentication
  id: aws-auth
  if: github.event_name == 'schedule' || (github.event.pull_request.head.repo.full_name == github.repository && !startsWith(github.event.pull_request.head.ref, 'dependabot/'))
  uses: aws-actions/configure-aws-credentials@ec61189d14ec14c8efccab744f656cffd0e33f37
  with:
    aws-region: us-east-1
    role-to-assume: ${{ secrets.AWS_S3_VECTORS_CI_ROLE_ARN }}

- name: Run integration tests
  if: success() && steps.aws-auth.outcome == 'success'
  run: hatch run test:integration-cov-append-retry

Prerequisites (maintainer action required)

  1. Create an IAM role with s3vectors:* permissions (scoped to haystack-test-* bucket names)
  2. Configure the role's trust policy for GitHub OIDC (token.actions.githubusercontent.com)
  3. Add the role ARN as a repository secret (e.g. AWS_S3_VECTORS_CI_ROLE_ARN)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Amazon S3 Vectors (DocStore)

1 participant