Is it possible to use aws bedrock embeddings or Custom Embedding for HybridChunker? #3124

princeGedeon · 2026-03-13T10:57:04Z

princeGedeon
Mar 13, 2026

For process complex document with tables and unstructured table in AWS ecosystem, can i use Docling with aws embeddings via Bedrock?

Answered by dosubot[bot]

Mar 13, 2026

Yes, you can use Docling with AWS Bedrock embeddings, but there's an important distinction to understand:

HybridChunker only handles chunking, not embedding generation. It uses a tokenizer to determine chunk boundaries (counting tokens to stay within limits), but outputs text chunks—not embeddings. The actual embedding step happens separately in your RAG pipeline.

For AWS Bedrock integration, you have two approaches:

Use HybridChunker for chunking, then embed with Bedrock separately:

from docling_core.transforms.chunker import HybridChunker

chunker = HybridChunker(max_tokens=512)
chunks = list(chunker.chunk(doc))

# Then embed each chunk using your Bedrock client
for chunk in chunks:
…

View full answer

2026-03-13T11:00:13Z

dosubot[bot]
bot Mar 13, 2026

Yes, you can use Docling with AWS Bedrock embeddings, but there's an important distinction to understand:

HybridChunker only handles chunking, not embedding generation. It uses a tokenizer to determine chunk boundaries (counting tokens to stay within limits), but outputs text chunks—not embeddings. The actual embedding step happens separately in your RAG pipeline.

For AWS Bedrock integration, you have two approaches:

Use HybridChunker for chunking, then embed with Bedrock separately:

from docling_core.transforms.chunker import HybridChunker

chunker = HybridChunker(max_tokens=512)
chunks = list(chunker.chunk(doc))

# Then embed each chunk using your Bedrock client
for chunk in chunks:
    embedding = bedrock_client.embed(chunk.text)

Create a custom tokenizer for Bedrock models by implementing the BaseTokenizer interface:

from docling_core.transforms.chunker.tokenizer.base import BaseTokenizer

class BedrockTokenizer(BaseTokenizer):
    def count_tokens(self, text: str) -> int:
        # Implement token counting for your Bedrock model
        pass
    def get_max_tokens(self) -> int:
        return 8192  # or your model's limit
    def get_tokenizer(self):
        return self

Built-in tokenizers include HuggingFace and OpenAI tiktoken. HybridChunker also handles complex tables well, automatically splitting large tables across chunks while preserving headers.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to use aws bedrock embeddings or Custom Embedding for HybridChunker? #3124

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is it possible to use aws bedrock embeddings or Custom Embedding for HybridChunker? #3124

Uh oh!

princeGedeon Mar 13, 2026

Replies: 1 comment

Uh oh!

dosubot[bot] bot Mar 13, 2026

princeGedeon
Mar 13, 2026

dosubot[bot]
bot Mar 13, 2026