Skip to content

feat: Add Amazon S3 Vectors document store integration#3149

Open
dotKokott wants to merge 12 commits into
deepset-ai:mainfrom
dotKokott:feature/amazon-s3-vectors-integration
Open

feat: Add Amazon S3 Vectors document store integration#3149
dotKokott wants to merge 12 commits into
deepset-ai:mainfrom
dotKokott:feature/amazon-s3-vectors-integration

Conversation

@dotKokott
Copy link
Copy Markdown
Contributor

@dotKokott dotKokott commented Apr 13, 2026

Related Issues

Proposed Changes:

Adds an Amazon S3 Vectors document store integration — a serverless vector storage capability native to S3.

Components:

  • S3VectorsDocumentStore — full DocumentStore protocol (write, count, filter, delete)
  • S3VectorsEmbeddingRetriever — embedding-based retrieval with server-side metadata filtering

Key design decisions:

  • Content stored as non-filterable metadata (AWS-recommended pattern for large text)
  • Cosine distance converted to similarity score (1 - distance) for Haystack convention
  • Blob data uses base64 encoding for round-trip fidelity
  • filter_documents() uses list_vectors(returnData=True, returnMetadata=True) with client-side filtering (warning logged) since S3 Vectors has no standalone filter API
  • Batch existence checks for DuplicatePolicy.SKIP/NONE (batches of 100)

Known limitations (documented in README):

  • top_k capped at 100 (service limit)
  • query_vectors does not return embedding data
  • 40KB total metadata per vector, 2KB filterable
  • Only float32, cosine/euclidean, eventual consistency

How did you test it?

  • 26 unit tests — serialization, score conversion, filter conversion, duplicate policy logic, document conversion (mocked boto3)
  • 12 integration tests — full lifecycle against live AWS S3 Vectors, with pytestmark credential guard for CI
  • hatch run test:all, hatch run fmt, hatch run test:types
  • Example script (examples/example.py) verified against live AWS

Notes for the reviewer

This PR was fully generated with an AI assistant. I have reviewed the changes and run the relevant tests.
Structure and test style follow the Pinecone integration pattern.

Checklist

@github-actions github-actions Bot added topic:CI type:documentation Improvements or additions to documentation labels Apr 13, 2026
@dotKokott dotKokott force-pushed the feature/amazon-s3-vectors-integration branch from 1df9666 to 90c4977 Compare April 13, 2026 13:28
@dotKokott dotKokott marked this pull request as ready for review April 13, 2026 13:39
@dotKokott dotKokott requested a review from a team as a code owner April 13, 2026 13:39
@dotKokott dotKokott requested review from anakin87 and removed request for a team April 13, 2026 13:39
@dotKokott dotKokott marked this pull request as draft April 13, 2026 13:39
@dotKokott dotKokott marked this pull request as ready for review April 13, 2026 13:40
@dotKokott
Copy link
Copy Markdown
Contributor Author

CI: Integration tests need AWS credential setup

The integration tests currently run unconditionally in CI with no AWS credentials configured. The tests have a pytestmark = pytest.mark.skipif(not _aws_credentials_available(), ...) guard so they silently skip (0 collected), but this means:

  • Integration tests never actually run in CI — only locally by developers with AWS credentials
  • The "combined" coverage badge will just reflect unit test coverage

What needs to happen

The workflow should match the amazon_bedrock.yml pattern — add an OIDC role assumption step and gate the integration test run on its success:

# Do not authenticate on PRs from forks and on PRs created by dependabot
- name: AWS authentication
  id: aws-auth
  if: github.event_name == 'schedule' || (github.event.pull_request.head.repo.full_name == github.repository && !startsWith(github.event.pull_request.head.ref, 'dependabot/'))
  uses: aws-actions/configure-aws-credentials@ec61189d14ec14c8efccab744f656cffd0e33f37
  with:
    aws-region: us-east-1
    role-to-assume: ${{ secrets.AWS_S3_VECTORS_CI_ROLE_ARN }}

- name: Run integration tests
  if: success() && steps.aws-auth.outcome == 'success'
  run: hatch run test:integration-cov-append-retry

Prerequisites (maintainer action required)

  1. Create an IAM role with s3vectors:* permissions (scoped to haystack-test-* bucket names)
  2. Configure the role's trust policy for GitHub OIDC (token.actions.githubusercontent.com)
  3. Add the role ARN as a repository secret (e.g. AWS_S3_VECTORS_CI_ROLE_ARN)

@anakin87
Copy link
Copy Markdown
Member

@dotKokott I'll try to take a look in the next few days.

Have you tried the integration yourself in a real-world setting with AWS?

@dotKokott
Copy link
Copy Markdown
Contributor Author

dotKokott commented Apr 15, 2026

@dotKokott I'll try to take a look in the next few days.

Have you tried the integration yourself in a real-world setting with AWS?

I have tried all integration tests and examples on my AWS account.

However I did not try with any large datasets. That might be next thing to validate: does this work as expected with real load.

Copy link
Copy Markdown
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some initial comments.
Will take a better look soon

Comment thread .github/workflows/amazon_s3_vectors.yml
Comment thread integrations/amazon_s3_vectors/README.md
Comment thread integrations/amazon_s3_vectors/tests/test_integration.py
dotKokott added 12 commits May 16, 2026 12:39
Implements issue deepset-ai#2110 - Amazon S3 Vectors document store integration with:

- S3VectorsDocumentStore: full DocumentStore protocol (count, write, filter, delete)
- S3VectorsEmbeddingRetriever: embedding-based retrieval with metadata filtering
- Filter conversion from Haystack format to S3 Vectors filter syntax
- Auto-creation of vector buckets and indexes
- AWS credential support via Secret (or default credential chain)
- 49 unit tests covering store, retriever, filters, and serialization
- README with usage examples and known limitations
…rkflow

- boto3 lower bound set to 1.42.0 (when s3vectors service was added)
- pydoc filename changed to amazon_s3_vectors.md (underscores, matching folder name)
- Quote $GITHUB_OUTPUT in workflow to fix shellcheck SC2086
- Flatten test classes into standalone functions (matching pinecone/qdrant pattern)
- Assert full serialized dict structure in to_dict/from_dict tests
- Use Mock(spec=...) for retriever tests instead of MagicMock+patch
- Verify _embedding_retrieval call args match exactly
- Add test_from_dict_no_filter_policy (backward compat)
- Add test_init_is_lazy
Remove tests that just verify mock plumbing (count, write, delete calling
the mock client). Keep tests that verify our actual logic:
- Serialization roundtrip (full dict structure)
- Score conversion (cosine + euclidean)
- Filter conversion (pure function with real logic)
- Duplicate policy batch checks (SKIP/NONE)
- Document <-> S3 vector conversion
- Input validation

Before: 49 unit tests (many testing mock behavior)
After: 26 unit tests (all testing our code) + 12 integration tests
- Class docstring: top_k cap, dimension limit, metadata limits, float32 only
- write_documents: embedding required, 40KB metadata limit
- _embedding_retrieval: top_k=100 cap, no embeddings in response
- Retriever run: top_k=100, server-side filters, no embeddings returned
…ity, deduplicate retrieval logic

- Replace hand-rolled _apply_filters_in_memory/_document_matches/_compare
  with haystack.utils.filters.document_matches_filter (same utility used by
  InMemoryDocumentStore). Gains NOT operator, nested dotted field paths, and
  date comparison support for free. (-65 lines)
- Deduplicate blob/content reconstruction in _embedding_retrieval() by
  reusing _s3_vector_to_document() + dataclasses.replace() (-20 lines)
- Make filter_documents() warning conditional on filters actually being
  provided (no warning when listing all documents)
Matches the pattern used by the amazon_bedrock workflow:
- top-level id-token: write permission
- AWS_REGION env var
- configure-aws-credentials step (skipped on fork PRs and dependabot)
- integration tests gated on successful auth
Matches the repo convention used across other integrations.
Trim the README down to badges + integration page / changelog links + the
AWS auth note, matching the pattern used by amazon_bedrock, anthropic,
qdrant, and other integrations in this repo. The rich usage / limitations
content is preserved for the haystack.deepset.ai integration page.

Addresses deepset-ai#3149 (comment)
…ments

Move type / embedding validation out of the per-batch loop and into a
single pass before any `put_vectors` call, so a bad input never leaves
the store with a partial write.

Also align with Haystack's duplicate-policy convention: `DuplicatePolicy.FAIL`
is now supported and, together with `DuplicatePolicy.NONE`, raises
`DuplicateDocumentError` (was `DocumentStoreError` for NONE) when any
target id already exists. Unit tests updated accordingly.

Addresses deepset-ai#3149 (comment)
…ion tests

Replace the handcrafted integration suite with Haystack's standard
DocumentStore test contract, following the pgvector / pinecone pattern:

* tests/conftest.py: shared `document_store` fixture. One vector bucket
  per session, one index per test for isolation. Wraps `write_documents`
  and `delete_documents` to (a) inject a default zero embedding for any
  Document missing one, since S3 Vectors requires embeddings, and
  (b) sleep briefly afterwards to absorb eventual consistency.
* test_document_store.py: appends `TestDocumentStore` integration class
  inheriting CountDocumentsTest, WriteDocumentsTest, DeleteDocumentsTest,
  FilterableDocsFixtureMixin. Overrides `assert_documents_are_equal` for
  float32 round-trip tolerance.
* test_filters.py: appends `TestFilters` inheriting FilterDocumentsTest.
  `filter_documents` already routes matching through haystack's
  document_matches_filter, so no operators are skipped.
* test_integration.py: trimmed to the retriever-specific tests not covered
  by the base mixins (embedding retrieval, retriever component,
  to_dict/from_dict roundtrip on a live store).

Coverage goes from 12 handcrafted integration tests to 53 (10 store + 39
filter + 4 retriever).

Addresses deepset-ai#3149 (comment)
@dotKokott dotKokott force-pushed the feature/amazon-s3-vectors-integration branch from c9e8399 to 3eb9a38 Compare May 16, 2026 10:40
Comment on lines +139 to +140
with warnings.catch_warnings():
warnings.simplefilter("ignore")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why ignoring these warnings?

Comment on lines +280 to +284
# ---------------------------------------------------------------------------
# Integration tests — exercise a real S3 Vectors bucket via the `document_store`
# fixture in conftest.py. The mixins below come from Haystack's test kit so we
# get its standard Document Store contract for free.
# ---------------------------------------------------------------------------
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# ---------------------------------------------------------------------------
# Integration tests — exercise a real S3 Vectors bucket via the `document_store`
# fixture in conftest.py. The mixins below come from Haystack's test kit so we
# get its standard Document Store contract for free.
# ---------------------------------------------------------------------------

Comment on lines +96 to +102
# ---------------------------------------------------------------------------
# Integration tests — run Haystack's full filter contract against a real S3
# Vectors index. `S3VectorsDocumentStore.filter_documents` delegates the
# actual matching to `haystack.utils.filters.document_matches_filter`, so the
# only S3-specific quirk we have to absorb here is the float32 round-trip on
# embeddings.
# ---------------------------------------------------------------------------
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# ---------------------------------------------------------------------------
# Integration tests — run Haystack's full filter contract against a real S3
# Vectors index. `S3VectorsDocumentStore.filter_documents` delegates the
# actual matching to `haystack.utils.filters.document_matches_filter`, so the
# only S3-specific quirk we have to absorb here is the float32 round-trip on
# embeddings.
# ---------------------------------------------------------------------------

from haystack_integrations.document_stores.amazon_s3_vectors import S3VectorsDocumentStore


@pytest.mark.integration
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put the following tests in test_embedding_retriever.py

Comment on lines +280 to +302
for i in range(0, len(documents), _WRITE_BATCH_SIZE):
batch = documents[i : i + _WRITE_BATCH_SIZE]

# Batch-check for existing documents when needed
existing_ids: set[str] = set()
if policy in (DuplicatePolicy.SKIP, DuplicatePolicy.NONE, DuplicatePolicy.FAIL):
batch_ids = [doc.id for doc in batch]
for j in range(0, len(batch_ids), _GET_BATCH_SIZE):
id_chunk = batch_ids[j : j + _GET_BATCH_SIZE]
response = client.get_vectors(
vectorBucketName=self.vector_bucket_name,
indexName=self.index_name,
keys=id_chunk,
)
for v in response.get("vectors", []):
existing_ids.add(v["key"])

if policy in (DuplicatePolicy.NONE, DuplicatePolicy.FAIL) and existing_ids:
msg = (
f"Document(s) {sorted(existing_ids)} already exist in the document store. "
"Use DuplicatePolicy.OVERWRITE or DuplicatePolicy.SKIP."
)
raise DuplicateDocumentError(msg)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this check before writing seems expensive to me

  • is there a simpler and faster way to check existing docs ids?
  • if not, I'd only support DuplicatePolicy.OVERWRITE as we do for example in Pinecone

@anakin87
Copy link
Copy Markdown
Member

In addition, running integration tests locally (which are skipped in the CI for PRs from forks), I get several errors. An example:

FAILED tests/test_document_store.py::TestDocumentStore::test_delete_documents -
botocore.errorfactory.ValidationException: An error occurred (ValidationException) when calling the PutVectors operation: 
Invalid record for key '657...': cosine distance does not support vectors with zero norm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:CI type:documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Amazon S3 Vectors (DocStore)

2 participants