feat: Add Amazon S3 Vectors document store integration by dotKokott · Pull Request #3149 · deepset-ai/haystack-core-integrations

dotKokott · 2026-04-13T12:49:32Z

Related Issues

fixes Amazon S3 Vectors (DocStore) #2110

Proposed Changes:

Adds an Amazon S3 Vectors document store integration — a serverless vector storage capability native to S3.

Components:

S3VectorsDocumentStore — full DocumentStore protocol (write, count, filter, delete)
S3VectorsEmbeddingRetriever — embedding-based retrieval with server-side metadata filtering

Key design decisions:

Content stored as non-filterable metadata (AWS-recommended pattern for large text)
Cosine distance converted to similarity score (1 - distance) for Haystack convention
Blob data uses base64 encoding for round-trip fidelity
filter_documents() uses list_vectors(returnData=True, returnMetadata=True) with client-side filtering (warning logged) since S3 Vectors has no standalone filter API
Batch existence checks for DuplicatePolicy.SKIP/NONE (batches of 100)

Known limitations (documented in README):

top_k capped at 100 (service limit)
query_vectors does not return embedding data
40KB total metadata per vector, 2KB filterable
Only float32, cosine/euclidean, eventual consistency

How did you test it?

26 unit tests — serialization, score conversion, filter conversion, duplicate policy logic, document conversion (mocked boto3)
12 integration tests — full lifecycle against live AWS S3 Vectors, with pytestmark credential guard for CI
hatch run test:all, hatch run fmt, hatch run test:types
Example script (examples/example.py) verified against live AWS

Notes for the reviewer

This PR was fully generated with an AI assistant. I have reviewed the changes and run the relevant tests.
Structure and test style follow the Pinecone integration pattern.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.

dotKokott · 2026-04-13T14:18:48Z

CI: Integration tests need AWS credential setup

The integration tests currently run unconditionally in CI with no AWS credentials configured. The tests have a pytestmark = pytest.mark.skipif(not _aws_credentials_available(), ...) guard so they silently skip (0 collected), but this means:

Integration tests never actually run in CI — only locally by developers with AWS credentials
The "combined" coverage badge will just reflect unit test coverage

What needs to happen

The workflow should match the amazon_bedrock.yml pattern — add an OIDC role assumption step and gate the integration test run on its success:

# Do not authenticate on PRs from forks and on PRs created by dependabot
- name: AWS authentication
  id: aws-auth
  if: github.event_name == 'schedule' || (github.event.pull_request.head.repo.full_name == github.repository && !startsWith(github.event.pull_request.head.ref, 'dependabot/'))
  uses: aws-actions/configure-aws-credentials@ec61189d14ec14c8efccab744f656cffd0e33f37
  with:
    aws-region: us-east-1
    role-to-assume: ${{ secrets.AWS_S3_VECTORS_CI_ROLE_ARN }}

- name: Run integration tests
  if: success() && steps.aws-auth.outcome == 'success'
  run: hatch run test:integration-cov-append-retry

Prerequisites (maintainer action required)

Create an IAM role with s3vectors:* permissions (scoped to haystack-test-* bucket names)
Configure the role's trust policy for GitHub OIDC (token.actions.githubusercontent.com)
Add the role ARN as a repository secret (e.g. AWS_S3_VECTORS_CI_ROLE_ARN)

anakin87 · 2026-04-15T07:52:35Z

@dotKokott I'll try to take a look in the next few days.

Have you tried the integration yourself in a real-world setting with AWS?

dotKokott · 2026-04-15T10:57:12Z

@dotKokott I'll try to take a look in the next few days.

Have you tried the integration yourself in a real-world setting with AWS?

I have tried all integration tests and examples on my AWS account.

However I did not try with any large datasets. That might be next thing to validate: does this work as expected with real load.

anakin87

I left some initial comments.
Will take a better look soon

Implements issue deepset-ai#2110 - Amazon S3 Vectors document store integration with: - S3VectorsDocumentStore: full DocumentStore protocol (count, write, filter, delete) - S3VectorsEmbeddingRetriever: embedding-based retrieval with metadata filtering - Filter conversion from Haystack format to S3 Vectors filter syntax - Auto-creation of vector buckets and indexes - AWS credential support via Secret (or default credential chain) - 49 unit tests covering store, retriever, filters, and serialization - README with usage examples and known limitations

…rkflow - boto3 lower bound set to 1.42.0 (when s3vectors service was added) - pydoc filename changed to amazon_s3_vectors.md (underscores, matching folder name) - Quote $GITHUB_OUTPUT in workflow to fix shellcheck SC2086

- Flatten test classes into standalone functions (matching pinecone/qdrant pattern) - Assert full serialized dict structure in to_dict/from_dict tests - Use Mock(spec=...) for retriever tests instead of MagicMock+patch - Verify _embedding_retrieval call args match exactly - Add test_from_dict_no_filter_policy (backward compat) - Add test_init_is_lazy

Remove tests that just verify mock plumbing (count, write, delete calling the mock client). Keep tests that verify our actual logic: - Serialization roundtrip (full dict structure) - Score conversion (cosine + euclidean) - Filter conversion (pure function with real logic) - Duplicate policy batch checks (SKIP/NONE) - Document <-> S3 vector conversion - Input validation Before: 49 unit tests (many testing mock behavior) After: 26 unit tests (all testing our code) + 12 integration tests

- Class docstring: top_k cap, dimension limit, metadata limits, float32 only - write_documents: embedding required, 40KB metadata limit - _embedding_retrieval: top_k=100 cap, no embeddings in response - Retriever run: top_k=100, server-side filters, no embeddings returned

…ity, deduplicate retrieval logic - Replace hand-rolled _apply_filters_in_memory/_document_matches/_compare with haystack.utils.filters.document_matches_filter (same utility used by InMemoryDocumentStore). Gains NOT operator, nested dotted field paths, and date comparison support for free. (-65 lines) - Deduplicate blob/content reconstruction in _embedding_retrieval() by reusing _s3_vector_to_document() + dataclasses.replace() (-20 lines) - Make filter_documents() warning conditional on filters actually being provided (no warning when listing all documents)

Matches the pattern used by the amazon_bedrock workflow: - top-level id-token: write permission - AWS_REGION env var - configure-aws-credentials step (skipped on fork PRs and dependabot) - integration tests gated on successful auth

Matches the repo convention used across other integrations.

Trim the README down to badges + integration page / changelog links + the AWS auth note, matching the pattern used by amazon_bedrock, anthropic, qdrant, and other integrations in this repo. The rich usage / limitations content is preserved for the haystack.deepset.ai integration page. Addresses deepset-ai#3149 (comment)

…ments Move type / embedding validation out of the per-batch loop and into a single pass before any `put_vectors` call, so a bad input never leaves the store with a partial write. Also align with Haystack's duplicate-policy convention: `DuplicatePolicy.FAIL` is now supported and, together with `DuplicatePolicy.NONE`, raises `DuplicateDocumentError` (was `DocumentStoreError` for NONE) when any target id already exists. Unit tests updated accordingly. Addresses deepset-ai#3149 (comment)

…ion tests Replace the handcrafted integration suite with Haystack's standard DocumentStore test contract, following the pgvector / pinecone pattern: * tests/conftest.py: shared `document_store` fixture. One vector bucket per session, one index per test for isolation. Wraps `write_documents` and `delete_documents` to (a) inject a default zero embedding for any Document missing one, since S3 Vectors requires embeddings, and (b) sleep briefly afterwards to absorb eventual consistency. * test_document_store.py: appends `TestDocumentStore` integration class inheriting CountDocumentsTest, WriteDocumentsTest, DeleteDocumentsTest, FilterableDocsFixtureMixin. Overrides `assert_documents_are_equal` for float32 round-trip tolerance. * test_filters.py: appends `TestFilters` inheriting FilterDocumentsTest. `filter_documents` already routes matching through haystack's document_matches_filter, so no operators are skipped. * test_integration.py: trimmed to the retriever-specific tests not covered by the base mixins (embedding retrieval, retriever component, to_dict/from_dict roundtrip on a live store). Coverage goes from 12 handcrafted integration tests to 53 (10 store + 39 filter + 4 retriever). Addresses deepset-ai#3149 (comment)

anakin87 · 2026-05-21T15:34:47Z

+            with warnings.catch_warnings():
+                warnings.simplefilter("ignore")


Could you please explain why ignoring these warnings?

anakin87 · 2026-05-21T15:36:42Z

+# ---------------------------------------------------------------------------
+# Integration tests — exercise a real S3 Vectors bucket via the `document_store`
+# fixture in conftest.py. The mixins below come from Haystack's test kit so we
+# get its standard Document Store contract for free.
+# ---------------------------------------------------------------------------


Suggested change

# ---------------------------------------------------------------------------

# Integration tests — exercise a real S3 Vectors bucket via the `document_store`

# fixture in conftest.py. The mixins below come from Haystack's test kit so we

# get its standard Document Store contract for free.

# ---------------------------------------------------------------------------

anakin87 · 2026-05-21T15:37:40Z

+# ---------------------------------------------------------------------------
+# Integration tests — run Haystack's full filter contract against a real S3
+# Vectors index. `S3VectorsDocumentStore.filter_documents` delegates the
+# actual matching to `haystack.utils.filters.document_matches_filter`, so the
+# only S3-specific quirk we have to absorb here is the float32 round-trip on
+# embeddings.
+# ---------------------------------------------------------------------------


Suggested change

# ---------------------------------------------------------------------------

# Integration tests — run Haystack's full filter contract against a real S3

# Vectors index. `S3VectorsDocumentStore.filter_documents` delegates the

# actual matching to `haystack.utils.filters.document_matches_filter`, so the

# only S3-specific quirk we have to absorb here is the float32 round-trip on

# embeddings.

# ---------------------------------------------------------------------------

anakin87 · 2026-05-21T15:38:46Z

+from haystack_integrations.document_stores.amazon_s3_vectors import S3VectorsDocumentStore
+
+
+@pytest.mark.integration


let's put the following tests in test_embedding_retriever.py

anakin87 · 2026-05-21T15:45:04Z

+        for i in range(0, len(documents), _WRITE_BATCH_SIZE):
+            batch = documents[i : i + _WRITE_BATCH_SIZE]
+
+            # Batch-check for existing documents when needed
+            existing_ids: set[str] = set()
+            if policy in (DuplicatePolicy.SKIP, DuplicatePolicy.NONE, DuplicatePolicy.FAIL):
+                batch_ids = [doc.id for doc in batch]
+                for j in range(0, len(batch_ids), _GET_BATCH_SIZE):
+                    id_chunk = batch_ids[j : j + _GET_BATCH_SIZE]
+                    response = client.get_vectors(
+                        vectorBucketName=self.vector_bucket_name,
+                        indexName=self.index_name,
+                        keys=id_chunk,
+                    )
+                    for v in response.get("vectors", []):
+                        existing_ids.add(v["key"])
+
+                if policy in (DuplicatePolicy.NONE, DuplicatePolicy.FAIL) and existing_ids:
+                    msg = (
+                        f"Document(s) {sorted(existing_ids)} already exist in the document store. "
+                        "Use DuplicatePolicy.OVERWRITE or DuplicatePolicy.SKIP."
+                    )
+                    raise DuplicateDocumentError(msg)


Doing this check before writing seems expensive to me

is there a simpler and faster way to check existing docs ids?

if not, I'd only support DuplicatePolicy.OVERWRITE as we do for example in Pinecone

anakin87 · 2026-05-21T16:09:13Z

In addition, running integration tests locally (which are skipped in the CI for PRs from forks), I get several errors. An example:

FAILED tests/test_document_store.py::TestDocumentStore::test_delete_documents -
botocore.errorfactory.ValidationException: An error occurred (ValidationException) when calling the PutVectors operation: 
Invalid record for key '657...': cosine distance does not support vectors with zero norm

github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Apr 13, 2026

dotKokott force-pushed the feature/amazon-s3-vectors-integration branch from 1df9666 to 90c4977 Compare April 13, 2026 13:28

dotKokott marked this pull request as ready for review April 13, 2026 13:39

dotKokott requested a review from a team as a code owner April 13, 2026 13:39

dotKokott requested review from anakin87 and removed request for a team April 13, 2026 13:39

dotKokott marked this pull request as draft April 13, 2026 13:39

dotKokott marked this pull request as ready for review April 13, 2026 13:40

anakin87 requested changes Apr 20, 2026

View reviewed changes

Comment thread .github/workflows/amazon_s3_vectors.yml

Comment thread integrations/amazon_s3_vectors/README.md

Comment thread ...zon_s3_vectors/src/haystack_integrations/document_stores/amazon_s3_vectors/document_store.py Outdated

anakin87 requested changes Apr 23, 2026

View reviewed changes

Comment thread ...zon_s3_vectors/src/haystack_integrations/document_stores/amazon_s3_vectors/document_store.py Outdated

Comment thread integrations/amazon_s3_vectors/tests/test_integration.py

dotKokott added 12 commits May 16, 2026 12:39

fix: pin haystack-ai>=2.26.1 for FilterPolicy support

7381e96

ci: add AWS authentication step for integration tests

52854c6

Matches the pattern used by the amazon_bedrock workflow: - top-level id-token: write permission - AWS_REGION env var - configure-aws-credentials step (skipped on fork PRs and dependabot) - integration tests gated on successful auth

docs: use single backticks for inline code in docstrings

fdc3067

Matches the repo convention used across other integrations.

dotKokott force-pushed the feature/amazon-s3-vectors-integration branch from c9e8399 to 3eb9a38 Compare May 16, 2026 10:40

anakin87 requested changes May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Amazon S3 Vectors document store integration#3149

feat: Add Amazon S3 Vectors document store integration#3149
dotKokott wants to merge 12 commits into
deepset-ai:mainfrom
dotKokott:feature/amazon-s3-vectors-integration

dotKokott commented Apr 13, 2026 •

edited

Loading

Uh oh!

dotKokott commented Apr 13, 2026

Uh oh!

anakin87 commented Apr 15, 2026

Uh oh!

dotKokott commented Apr 15, 2026 •

edited

Loading

Uh oh!

anakin87 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anakin87 May 21, 2026

Uh oh!

anakin87 May 21, 2026

Uh oh!

anakin87 May 21, 2026

Uh oh!

anakin87 May 21, 2026

Uh oh!

anakin87 May 21, 2026

Uh oh!

anakin87 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		with warnings.catch_warnings():
		warnings.simplefilter("ignore")

		from haystack_integrations.document_stores.amazon_s3_vectors import S3VectorsDocumentStore


		@pytest.mark.integration

Conversation

dotKokott commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

dotKokott commented Apr 13, 2026

CI: Integration tests need AWS credential setup

What needs to happen

Prerequisites (maintainer action required)

Uh oh!

anakin87 commented Apr 15, 2026

Uh oh!

dotKokott commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anakin87 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anakin87 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

anakin87 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

anakin87 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

anakin87 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

anakin87 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

anakin87 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dotKokott commented Apr 13, 2026 •

edited

Loading

dotKokott commented Apr 15, 2026 •

edited

Loading