Skip to content

Conversation

@OGuggenbuehl
Copy link

@OGuggenbuehl OGuggenbuehl commented Jul 29, 2025

Proposed Changes:

Implement MarkdownHeaderSplitter to split Documents written in .md based on their headers

How did you test it?

unit tests

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@CLAassistant
Copy link

CLAassistant commented Jul 29, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Jul 29, 2025
@OGuggenbuehl OGuggenbuehl changed the title Feature/md header splitter feat:MarkdownHeaderSplitter Jul 29, 2025
@sjrl sjrl self-assigned this Aug 19, 2025
@sjrl
Copy link
Contributor

sjrl commented Aug 19, 2025

@OGuggenbuehl definitely looks like an interesting approach! I've left an initial set of comments, but to further review I'd appreciate if you could add a set of tests like the ones we have for the DocumentSplitter https://github.com/deepset-ai/haystack/blob/main/test/components/preprocessors/test_document_splitter.py

This will help me be able to review the actual algorithm for splitting since it's easier to understand with examples.

@sjrl sjrl changed the title feat:MarkdownHeaderSplitter feat: MarkdownHeaderSplitter Aug 27, 2025
@vercel
Copy link

vercel bot commented Oct 21, 2025

@OGuggenbuehl is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

Comment on lines +41 to +42
:param keep_headers: If True, headers are kept in the content. If False, headers are moved to metadata.
Defaults to True.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that if keep_headers is True we don't store them in the metadata?

Comment on lines +49 to +50
:param skip_empty_documents: If True, skip documents with empty content. If False, process empty documents.
Defaults to True.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reuse the docstring from DocumentSplitter

Suggested change
:param skip_empty_documents: If True, skip documents with empty content. If False, process empty documents.
Defaults to True.
:param skip_empty_documents: Choose whether to skip documents with empty content. Default is True.
Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text
from non-textual documents.

Comment on lines +63 to +68
self.secondary_splitter = DocumentSplitter(
split_by=self.secondary_split,
split_length=self.split_length,
split_overlap=self.split_overlap,
split_threshold=self.split_threshold,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized we are missing a warm_up method for this component MarkdownHeaderSplitter which calls the warm_up of DocumentSplitter if secondary_split is provided.

This is needed in the case where the secondary_split of sentence is chosen.

So I'd add a warm up method like

    def warm_up(self):
        """
        Warm up the MarkdownHeaderSplitter
        """
        if self.secondary_splitter and not self._is_warmed_up:
            self.secondary_splitter.warm_up()
            self._is_warmed_up = True

then add self._is_warmed_up = False in the init method of MarkdownHeaderSplitter

for i, match in enumerate(matches):
# extract header info
header_prefix = match.group(1)
header_text = match.group(2).strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid using strip() here? This would remove any white space characters from the text we probably want to keep.

# get content
start = match.end()
end = matches[i + 1].start() if i + 1 < len(matches) else len(text)
content = text[start:end].strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here let's remove the .strip()

header_stack[j] = None

# prepare header_line if keep_headers
header_line = f"{header_prefix} {header_text}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't use .strip() anywhere then we should be able to do

Suggested change
header_line = f"{header_prefix} {header_text}"
header_line = f"{header_prefix}{header_text}"

I believe

Also if this is only used if self.keep_headers is True we could move it's construction to inside that if statement.

Comment on lines +129 to +131
if pending_headers:
chunk_content += "\n".join(pending_headers) + "\n"
chunk_content += f"{header_line}\n{content}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here we shouldn't just add in our own newline characters. If we avoid using strip earlier we should be able to do

Suggested change
if pending_headers:
chunk_content += "\n".join(pending_headers) + "\n"
chunk_content += f"{header_line}\n{content}"
if pending_headers:
chunk_content += "".join(pending_headers)
chunk_content += f"{header_line}{content}"

Comment on lines +295 to +315
processed_documents = []
for doc in documents:
# handle empty documents
if not doc.content or not doc.content.strip():
if self.skip_empty_documents:
logger.warning("Document ID {doc_id} has an empty content. Skipping this document.", doc_id=doc.id)
continue
# keep empty documents
processed_documents.append(doc)
logger.warning(
"Document ID {doc_id} has an empty content. Keeping this document as per configuration.",
doc_id=doc.id,
)
continue

processed_documents.append(doc)

if not processed_documents:
return {"documents": []}

header_split_docs = self._split_documents_by_markdown_headers(processed_documents)
Copy link
Contributor

@sjrl sjrl Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could refactor this a bit to make the code overall simpler. E.g.

Suggested change
processed_documents = []
for doc in documents:
# handle empty documents
if not doc.content or not doc.content.strip():
if self.skip_empty_documents:
logger.warning("Document ID {doc_id} has an empty content. Skipping this document.", doc_id=doc.id)
continue
# keep empty documents
processed_documents.append(doc)
logger.warning(
"Document ID {doc_id} has an empty content. Keeping this document as per configuration.",
doc_id=doc.id,
)
continue
processed_documents.append(doc)
if not processed_documents:
return {"documents": []}
header_split_docs = self._split_documents_by_markdown_headers(processed_documents)
processed_documents = []
for doc in documents:
# handle empty documents
if not doc.content or not doc.content.strip():
if self.skip_empty_documents:
logger.warning("Document ID {doc_id} has an empty content. Skipping this document.", doc_id=doc.id)
continue
# keep empty documents
processed_documents.append(doc)
logger.warning(
"Document ID {doc_id} has an empty content. Keeping this document as per configuration.",
doc_id=doc.id,
)
continue
else:
header_split_docs = self._split_documents_by_markdown_headers([doc])
final_docs = self._apply_secondary_splitting(header_split_docs) if self.secondary_split else header_split_docs
processed_documents.extend(final_docs)

this way we know we will only run the splitting logic on non-empty documents. Which hopefully means we don't need to worry about handling documents with empty contents within the splitting logic (i.e. all checks for empty strings can be removed)

Comment on lines +281 to +283
# Should start at 1 and increment at each \f
assert page_numbers[0] == 1
assert max(page_numbers) == 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's explicitly check the page number of each split

Comment on lines +268 to +271
for doc in split_docs[:-1]:
assert len(doc.content.split()) == 3
# The last chunk should have at least 2 words (threshold)
assert len(split_docs[-1].content.split()) >= 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make these tests explicit. So check the actual content of each one.

Comment on lines +253 to +259
for i in range(1, len(split_docs)):
prev_doc = split_docs[i - 1]
curr_doc = split_docs[i]
if prev_doc.meta["header"] == curr_doc.meta["header"]: # only check overlap within same header
prev_words = prev_doc.content.split()
curr_words = curr_doc.content.split()
assert prev_words[-2:] == curr_words[:2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be a broken record but let's make this an explicit check for each split. It's easier then for me to review to see if the output is what we expect.

Comment on lines +232 to +233
split_ids = [doc.meta["split_id"] for doc in split_docs]
assert split_ids == list(range(len(split_ids)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be updated once this comment https://github.com/deepset-ai/haystack/pull/9660/files#r2432221323 is addressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants