feat: MarkdownHeaderSplitter #9660

OGuggenbuehl · 2025-07-29T13:55:52Z

Proposed Changes:

Implement MarkdownHeaderSplitter to split Documents written in .md based on their headers

How did you test it?

unit tests

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I documented my code
I ran pre-commit hooks and fixed any issue

CLAassistant · 2025-07-29T13:55:59Z

All committers have signed the CLA.

haystack/components/preprocessors/markdown_header_splitter.py

sjrl · 2025-08-19T11:46:09Z

@OGuggenbuehl definitely looks like an interesting approach! I've left an initial set of comments, but to further review I'd appreciate if you could add a set of tests like the ones we have for the DocumentSplitter https://github.com/deepset-ai/haystack/blob/main/test/components/preprocessors/test_document_splitter.py

This will help me be able to review the actual algorithm for splitting since it's easier to understand with examples.

use haystack logging Co-authored-by: Sebastian Husch Lee <[email protected]>

remove temp toc Co-authored-by: Sebastian Husch Lee <[email protected]>

…enbuehl/haystack into feature/md-header-splitter

…dated page number

Co-authored-by: Sebastian Husch Lee <[email protected]>

…enbuehl/haystack into feature/md-header-splitter

Co-authored-by: Sebastian Husch Lee <[email protected]>

vercel · 2025-10-21T08:53:58Z

@OGuggenbuehl is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

…enbuehl/haystack into feature/md-header-splitter

sjrl · 2025-10-21T09:58:13Z

haystack/components/preprocessors/markdown_header_splitter.py

+        :param keep_headers: If True, headers are kept in the content. If False, headers are moved to metadata.
+            Defaults to True.


Does this mean that if keep_headers is True we don't store them in the metadata?

sjrl · 2025-10-21T09:59:51Z

haystack/components/preprocessors/markdown_header_splitter.py

+        :param skip_empty_documents: If True, skip documents with empty content. If False, process empty documents.
+            Defaults to True.


Let's reuse the docstring from DocumentSplitter

Suggested change

:param skip_empty_documents: If True, skip documents with empty content. If False, process empty documents.

Defaults to True.

:param skip_empty_documents: Choose whether to skip documents with empty content. Default is True.

Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text

from non-textual documents.

sjrl · 2025-10-21T10:03:21Z

haystack/components/preprocessors/markdown_header_splitter.py

+            self.secondary_splitter = DocumentSplitter(
+                split_by=self.secondary_split,
+                split_length=self.split_length,
+                split_overlap=self.split_overlap,
+                split_threshold=self.split_threshold,
+            )


I just realized we are missing a warm_up method for this component MarkdownHeaderSplitter which calls the warm_up of DocumentSplitter if secondary_split is provided.

This is needed in the case where the secondary_split of sentence is chosen.

So I'd add a warm up method like

def warm_up(self): """ Warm up the MarkdownHeaderSplitter """ if self.secondary_splitter and not self._is_warmed_up: self.secondary_splitter.warm_up() self._is_warmed_up = True

then add self._is_warmed_up = False in the init method of MarkdownHeaderSplitter

sjrl · 2025-10-21T10:05:00Z

haystack/components/preprocessors/markdown_header_splitter.py

+        for i, match in enumerate(matches):
+            # extract header info
+            header_prefix = match.group(1)
+            header_text = match.group(2).strip()


Could we avoid using strip() here? This would remove any white space characters from the text we probably want to keep.

sjrl · 2025-10-21T10:05:11Z

haystack/components/preprocessors/markdown_header_splitter.py

+            # get content
+            start = match.end()
+            end = matches[i + 1].start() if i + 1 < len(matches) else len(text)
+            content = text[start:end].strip()


Same here let's remove the .strip()

sjrl · 2025-10-21T10:06:17Z

haystack/components/preprocessors/markdown_header_splitter.py

+                header_stack[j] = None
+
+            # prepare header_line if keep_headers
+            header_line = f"{header_prefix} {header_text}"


If we don't use .strip() anywhere then we should be able to do

Suggested change

header_line = f"{header_prefix} {header_text}"

header_line = f"{header_prefix}{header_text}"

I believe

Also if this is only used if self.keep_headers is True we could move it's construction to inside that if statement.

sjrl · 2025-10-21T10:08:17Z

haystack/components/preprocessors/markdown_header_splitter.py

+                if pending_headers:
+                    chunk_content += "\n".join(pending_headers) + "\n"
+                chunk_content += f"{header_line}\n{content}"


Also here we shouldn't just add in our own newline characters. If we avoid using strip earlier we should be able to do

Suggested change

if pending_headers:

chunk_content += "\n".join(pending_headers) + "\n"

chunk_content += f"{header_line}\n{content}"

if pending_headers:

chunk_content += "".join(pending_headers)

chunk_content += f"{header_line}{content}"

sjrl · 2025-10-21T10:11:21Z

haystack/components/preprocessors/markdown_header_splitter.py

+        processed_documents = []
+        for doc in documents:
+            # handle empty documents
+            if not doc.content or not doc.content.strip():
+                if self.skip_empty_documents:
+                    logger.warning("Document ID {doc_id} has an empty content. Skipping this document.", doc_id=doc.id)
+                    continue
+                # keep empty documents
+                processed_documents.append(doc)
+                logger.warning(
+                    "Document ID {doc_id} has an empty content. Keeping this document as per configuration.",
+                    doc_id=doc.id,
+                )
+                continue
+
+            processed_documents.append(doc)
+
+        if not processed_documents:
+            return {"documents": []}
+
+        header_split_docs = self._split_documents_by_markdown_headers(processed_documents)


I think we could refactor this a bit to make the code overall simpler. E.g.

Suggested change

processed_documents = []

for doc in documents:

# handle empty documents

if not doc.content or not doc.content.strip():

if self.skip_empty_documents:

logger.warning("Document ID {doc_id} has an empty content. Skipping this document.", doc_id=doc.id)

continue

# keep empty documents

processed_documents.append(doc)

logger.warning(

"Document ID {doc_id} has an empty content. Keeping this document as per configuration.",

doc_id=doc.id,

)

continue

processed_documents.append(doc)

if not processed_documents:

return {"documents": []}

header_split_docs = self._split_documents_by_markdown_headers(processed_documents)

processed_documents = []

for doc in documents:

# handle empty documents

if not doc.content or not doc.content.strip():

if self.skip_empty_documents:

logger.warning("Document ID {doc_id} has an empty content. Skipping this document.", doc_id=doc.id)

continue

# keep empty documents

processed_documents.append(doc)

logger.warning(

"Document ID {doc_id} has an empty content. Keeping this document as per configuration.",

doc_id=doc.id,

)

continue

else:

header_split_docs = self._split_documents_by_markdown_headers([doc])

final_docs = self._apply_secondary_splitting(header_split_docs) if self.secondary_split else header_split_docs

processed_documents.extend(final_docs)

this way we know we will only run the splitting logic on non-empty documents. Which hopefully means we don't need to worry about handling documents with empty contents within the splitting logic (i.e. all checks for empty strings can be removed)

sjrl · 2025-10-21T11:02:18Z

test/components/preprocessors/test_markdown_header_splitter.py

+    # Should start at 1 and increment at each \f
+    assert page_numbers[0] == 1
+    assert max(page_numbers) == 3


Let's explicitly check the page number of each split

sjrl · 2025-10-21T11:05:56Z

test/components/preprocessors/test_markdown_header_splitter.py

+    for doc in split_docs[:-1]:
+        assert len(doc.content.split()) == 3
+    # The last chunk should have at least 2 words (threshold)
+    assert len(split_docs[-1].content.split()) >= 2


Let's make these tests explicit. So check the actual content of each one.

sjrl · 2025-10-21T11:06:57Z

test/components/preprocessors/test_markdown_header_splitter.py

+    for i in range(1, len(split_docs)):
+        prev_doc = split_docs[i - 1]
+        curr_doc = split_docs[i]
+        if prev_doc.meta["header"] == curr_doc.meta["header"]:  # only check overlap within same header
+            prev_words = prev_doc.content.split()
+            curr_words = curr_doc.content.split()
+            assert prev_words[-2:] == curr_words[:2]


Sorry to be a broken record but let's make this an explicit check for each split. It's easier then for me to review to see if the output is what we expect.

sjrl · 2025-10-21T11:10:40Z

test/components/preprocessors/test_markdown_header_splitter.py

+    split_ids = [doc.meta["split_id"] for doc in split_docs]
+    assert split_ids == list(range(len(split_ids)))


This should be updated once this comment https://github.com/deepset-ai/haystack/pull/9660/files#r2432221323 is addressed

OGuggenbuehl added 3 commits July 11, 2025 16:17

implement md-header-splitter and add tests

a2a4f86

rework md-header splitter to rewrite md-header levels

4337a5b

remove deprecated test

393cd53

github-actions bot added topic:tests type:documentation Improvements on the docs labels Jul 29, 2025

OGuggenbuehl changed the title ~~Feature/md header splitter~~ feat:MarkdownHeaderSplitter Jul 29, 2025

Merge branch 'main' into feature/md-header-splitter

de6b0d9

sjrl self-assigned this Aug 19, 2025