feat: Add Amazon S3 Vectors document store integration#3149
Open
dotKokott wants to merge 7 commits intodeepset-ai:mainfrom
Open
feat: Add Amazon S3 Vectors document store integration#3149dotKokott wants to merge 7 commits intodeepset-ai:mainfrom
dotKokott wants to merge 7 commits intodeepset-ai:mainfrom
Conversation
Implements issue deepset-ai#2110 - Amazon S3 Vectors document store integration with: - S3VectorsDocumentStore: full DocumentStore protocol (count, write, filter, delete) - S3VectorsEmbeddingRetriever: embedding-based retrieval with metadata filtering - Filter conversion from Haystack format to S3 Vectors filter syntax - Auto-creation of vector buckets and indexes - AWS credential support via Secret (or default credential chain) - 49 unit tests covering store, retriever, filters, and serialization - README with usage examples and known limitations
…rkflow - boto3 lower bound set to 1.42.0 (when s3vectors service was added) - pydoc filename changed to amazon_s3_vectors.md (underscores, matching folder name) - Quote $GITHUB_OUTPUT in workflow to fix shellcheck SC2086
- Flatten test classes into standalone functions (matching pinecone/qdrant pattern) - Assert full serialized dict structure in to_dict/from_dict tests - Use Mock(spec=...) for retriever tests instead of MagicMock+patch - Verify _embedding_retrieval call args match exactly - Add test_from_dict_no_filter_policy (backward compat) - Add test_init_is_lazy
Remove tests that just verify mock plumbing (count, write, delete calling the mock client). Keep tests that verify our actual logic: - Serialization roundtrip (full dict structure) - Score conversion (cosine + euclidean) - Filter conversion (pure function with real logic) - Duplicate policy batch checks (SKIP/NONE) - Document <-> S3 vector conversion - Input validation Before: 49 unit tests (many testing mock behavior) After: 26 unit tests (all testing our code) + 12 integration tests
- Class docstring: top_k cap, dimension limit, metadata limits, float32 only - write_documents: embedding required, 40KB metadata limit - _embedding_retrieval: top_k=100 cap, no embeddings in response - Retriever run: top_k=100, server-side filters, no embeddings returned
…ity, deduplicate retrieval logic - Replace hand-rolled _apply_filters_in_memory/_document_matches/_compare with haystack.utils.filters.document_matches_filter (same utility used by InMemoryDocumentStore). Gains NOT operator, nested dotted field paths, and date comparison support for free. (-65 lines) - Deduplicate blob/content reconstruction in _embedding_retrieval() by reusing _s3_vector_to_document() + dataclasses.replace() (-20 lines) - Make filter_documents() warning conditional on filters actually being provided (no warning when listing all documents)
1df9666 to
90c4977
Compare
Contributor
Author
CI: Integration tests need AWS credential setupThe integration tests currently run unconditionally in CI with no AWS credentials configured. The tests have a
What needs to happenThe workflow should match the # Do not authenticate on PRs from forks and on PRs created by dependabot
- name: AWS authentication
id: aws-auth
if: github.event_name == 'schedule' || (github.event.pull_request.head.repo.full_name == github.repository && !startsWith(github.event.pull_request.head.ref, 'dependabot/'))
uses: aws-actions/configure-aws-credentials@ec61189d14ec14c8efccab744f656cffd0e33f37
with:
aws-region: us-east-1
role-to-assume: ${{ secrets.AWS_S3_VECTORS_CI_ROLE_ARN }}
- name: Run integration tests
if: success() && steps.aws-auth.outcome == 'success'
run: hatch run test:integration-cov-append-retryPrerequisites (maintainer action required)
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issues
Proposed Changes:
Adds an Amazon S3 Vectors document store integration — a serverless vector storage capability native to S3.
Components:
S3VectorsDocumentStore— full DocumentStore protocol (write, count, filter, delete)S3VectorsEmbeddingRetriever— embedding-based retrieval with server-side metadata filteringKey design decisions:
1 - distance) for Haystack conventionfilter_documents()useslist_vectors(returnData=True, returnMetadata=True)with client-side filtering (warning logged) since S3 Vectors has no standalone filter APIDuplicatePolicy.SKIP/NONE(batches of 100)Known limitations (documented in README):
top_kcapped at 100 (service limit)query_vectorsdoes not return embedding datafloat32,cosine/euclidean, eventual consistencyHow did you test it?
pytestmarkcredential guard for CIhatch run test:all,hatch run fmt,hatch run test:typesexamples/example.py) verified against live AWSNotes for the reviewer
This PR was fully generated with an AI assistant. I have reviewed the changes and run the relevant tests.
Structure and test style follow the Pinecone integration pattern.
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.