feat: Add Amazon S3 Vectors document store integration#3149
Conversation
1df9666 to
90c4977
Compare
CI: Integration tests need AWS credential setupThe integration tests currently run unconditionally in CI with no AWS credentials configured. The tests have a
What needs to happenThe workflow should match the # Do not authenticate on PRs from forks and on PRs created by dependabot
- name: AWS authentication
id: aws-auth
if: github.event_name == 'schedule' || (github.event.pull_request.head.repo.full_name == github.repository && !startsWith(github.event.pull_request.head.ref, 'dependabot/'))
uses: aws-actions/configure-aws-credentials@ec61189d14ec14c8efccab744f656cffd0e33f37
with:
aws-region: us-east-1
role-to-assume: ${{ secrets.AWS_S3_VECTORS_CI_ROLE_ARN }}
- name: Run integration tests
if: success() && steps.aws-auth.outcome == 'success'
run: hatch run test:integration-cov-append-retryPrerequisites (maintainer action required)
|
|
@dotKokott I'll try to take a look in the next few days. Have you tried the integration yourself in a real-world setting with AWS? |
I have tried all integration tests and examples on my AWS account. However I did not try with any large datasets. That might be next thing to validate: does this work as expected with real load. |
anakin87
left a comment
There was a problem hiding this comment.
I left some initial comments.
Will take a better look soon
Implements issue deepset-ai#2110 - Amazon S3 Vectors document store integration with: - S3VectorsDocumentStore: full DocumentStore protocol (count, write, filter, delete) - S3VectorsEmbeddingRetriever: embedding-based retrieval with metadata filtering - Filter conversion from Haystack format to S3 Vectors filter syntax - Auto-creation of vector buckets and indexes - AWS credential support via Secret (or default credential chain) - 49 unit tests covering store, retriever, filters, and serialization - README with usage examples and known limitations
…rkflow - boto3 lower bound set to 1.42.0 (when s3vectors service was added) - pydoc filename changed to amazon_s3_vectors.md (underscores, matching folder name) - Quote $GITHUB_OUTPUT in workflow to fix shellcheck SC2086
- Flatten test classes into standalone functions (matching pinecone/qdrant pattern) - Assert full serialized dict structure in to_dict/from_dict tests - Use Mock(spec=...) for retriever tests instead of MagicMock+patch - Verify _embedding_retrieval call args match exactly - Add test_from_dict_no_filter_policy (backward compat) - Add test_init_is_lazy
Remove tests that just verify mock plumbing (count, write, delete calling the mock client). Keep tests that verify our actual logic: - Serialization roundtrip (full dict structure) - Score conversion (cosine + euclidean) - Filter conversion (pure function with real logic) - Duplicate policy batch checks (SKIP/NONE) - Document <-> S3 vector conversion - Input validation Before: 49 unit tests (many testing mock behavior) After: 26 unit tests (all testing our code) + 12 integration tests
- Class docstring: top_k cap, dimension limit, metadata limits, float32 only - write_documents: embedding required, 40KB metadata limit - _embedding_retrieval: top_k=100 cap, no embeddings in response - Retriever run: top_k=100, server-side filters, no embeddings returned
…ity, deduplicate retrieval logic - Replace hand-rolled _apply_filters_in_memory/_document_matches/_compare with haystack.utils.filters.document_matches_filter (same utility used by InMemoryDocumentStore). Gains NOT operator, nested dotted field paths, and date comparison support for free. (-65 lines) - Deduplicate blob/content reconstruction in _embedding_retrieval() by reusing _s3_vector_to_document() + dataclasses.replace() (-20 lines) - Make filter_documents() warning conditional on filters actually being provided (no warning when listing all documents)
Matches the pattern used by the amazon_bedrock workflow: - top-level id-token: write permission - AWS_REGION env var - configure-aws-credentials step (skipped on fork PRs and dependabot) - integration tests gated on successful auth
Matches the repo convention used across other integrations.
Trim the README down to badges + integration page / changelog links + the AWS auth note, matching the pattern used by amazon_bedrock, anthropic, qdrant, and other integrations in this repo. The rich usage / limitations content is preserved for the haystack.deepset.ai integration page. Addresses deepset-ai#3149 (comment)
…ments Move type / embedding validation out of the per-batch loop and into a single pass before any `put_vectors` call, so a bad input never leaves the store with a partial write. Also align with Haystack's duplicate-policy convention: `DuplicatePolicy.FAIL` is now supported and, together with `DuplicatePolicy.NONE`, raises `DuplicateDocumentError` (was `DocumentStoreError` for NONE) when any target id already exists. Unit tests updated accordingly. Addresses deepset-ai#3149 (comment)
…ion tests Replace the handcrafted integration suite with Haystack's standard DocumentStore test contract, following the pgvector / pinecone pattern: * tests/conftest.py: shared `document_store` fixture. One vector bucket per session, one index per test for isolation. Wraps `write_documents` and `delete_documents` to (a) inject a default zero embedding for any Document missing one, since S3 Vectors requires embeddings, and (b) sleep briefly afterwards to absorb eventual consistency. * test_document_store.py: appends `TestDocumentStore` integration class inheriting CountDocumentsTest, WriteDocumentsTest, DeleteDocumentsTest, FilterableDocsFixtureMixin. Overrides `assert_documents_are_equal` for float32 round-trip tolerance. * test_filters.py: appends `TestFilters` inheriting FilterDocumentsTest. `filter_documents` already routes matching through haystack's document_matches_filter, so no operators are skipped. * test_integration.py: trimmed to the retriever-specific tests not covered by the base mixins (embedding retrieval, retriever component, to_dict/from_dict roundtrip on a live store). Coverage goes from 12 handcrafted integration tests to 53 (10 store + 39 filter + 4 retriever). Addresses deepset-ai#3149 (comment)
c9e8399 to
3eb9a38
Compare
| with warnings.catch_warnings(): | ||
| warnings.simplefilter("ignore") |
There was a problem hiding this comment.
Could you please explain why ignoring these warnings?
| # --------------------------------------------------------------------------- | ||
| # Integration tests — exercise a real S3 Vectors bucket via the `document_store` | ||
| # fixture in conftest.py. The mixins below come from Haystack's test kit so we | ||
| # get its standard Document Store contract for free. | ||
| # --------------------------------------------------------------------------- |
There was a problem hiding this comment.
| # --------------------------------------------------------------------------- | |
| # Integration tests — exercise a real S3 Vectors bucket via the `document_store` | |
| # fixture in conftest.py. The mixins below come from Haystack's test kit so we | |
| # get its standard Document Store contract for free. | |
| # --------------------------------------------------------------------------- |
| # --------------------------------------------------------------------------- | ||
| # Integration tests — run Haystack's full filter contract against a real S3 | ||
| # Vectors index. `S3VectorsDocumentStore.filter_documents` delegates the | ||
| # actual matching to `haystack.utils.filters.document_matches_filter`, so the | ||
| # only S3-specific quirk we have to absorb here is the float32 round-trip on | ||
| # embeddings. | ||
| # --------------------------------------------------------------------------- |
There was a problem hiding this comment.
| # --------------------------------------------------------------------------- | |
| # Integration tests — run Haystack's full filter contract against a real S3 | |
| # Vectors index. `S3VectorsDocumentStore.filter_documents` delegates the | |
| # actual matching to `haystack.utils.filters.document_matches_filter`, so the | |
| # only S3-specific quirk we have to absorb here is the float32 round-trip on | |
| # embeddings. | |
| # --------------------------------------------------------------------------- |
| from haystack_integrations.document_stores.amazon_s3_vectors import S3VectorsDocumentStore | ||
|
|
||
|
|
||
| @pytest.mark.integration |
There was a problem hiding this comment.
let's put the following tests in test_embedding_retriever.py
| for i in range(0, len(documents), _WRITE_BATCH_SIZE): | ||
| batch = documents[i : i + _WRITE_BATCH_SIZE] | ||
|
|
||
| # Batch-check for existing documents when needed | ||
| existing_ids: set[str] = set() | ||
| if policy in (DuplicatePolicy.SKIP, DuplicatePolicy.NONE, DuplicatePolicy.FAIL): | ||
| batch_ids = [doc.id for doc in batch] | ||
| for j in range(0, len(batch_ids), _GET_BATCH_SIZE): | ||
| id_chunk = batch_ids[j : j + _GET_BATCH_SIZE] | ||
| response = client.get_vectors( | ||
| vectorBucketName=self.vector_bucket_name, | ||
| indexName=self.index_name, | ||
| keys=id_chunk, | ||
| ) | ||
| for v in response.get("vectors", []): | ||
| existing_ids.add(v["key"]) | ||
|
|
||
| if policy in (DuplicatePolicy.NONE, DuplicatePolicy.FAIL) and existing_ids: | ||
| msg = ( | ||
| f"Document(s) {sorted(existing_ids)} already exist in the document store. " | ||
| "Use DuplicatePolicy.OVERWRITE or DuplicatePolicy.SKIP." | ||
| ) | ||
| raise DuplicateDocumentError(msg) |
There was a problem hiding this comment.
Doing this check before writing seems expensive to me
- is there a simpler and faster way to check existing docs ids?
- if not, I'd only support
DuplicatePolicy.OVERWRITEas we do for example in Pinecone
|
In addition, running integration tests locally (which are skipped in the CI for PRs from forks), I get several errors. An example: |
Related Issues
Proposed Changes:
Adds an Amazon S3 Vectors document store integration — a serverless vector storage capability native to S3.
Components:
S3VectorsDocumentStore— full DocumentStore protocol (write, count, filter, delete)S3VectorsEmbeddingRetriever— embedding-based retrieval with server-side metadata filteringKey design decisions:
1 - distance) for Haystack conventionfilter_documents()useslist_vectors(returnData=True, returnMetadata=True)with client-side filtering (warning logged) since S3 Vectors has no standalone filter APIDuplicatePolicy.SKIP/NONE(batches of 100)Known limitations (documented in README):
top_kcapped at 100 (service limit)query_vectorsdoes not return embedding datafloat32,cosine/euclidean, eventual consistencyHow did you test it?
pytestmarkcredential guard for CIhatch run test:all,hatch run fmt,hatch run test:typesexamples/example.py) verified against live AWSNotes for the reviewer
This PR was fully generated with an AI assistant. I have reviewed the changes and run the relevant tests.
Structure and test style follow the Pinecone integration pattern.
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:.