-
Notifications
You must be signed in to change notification settings - Fork 2.4k
feat: MarkdownHeaderSplitter #9660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: MarkdownHeaderSplitter #9660
Conversation
|
@OGuggenbuehl definitely looks like an interesting approach! I've left an initial set of comments, but to further review I'd appreciate if you could add a set of tests like the ones we have for the This will help me be able to review the actual algorithm for splitting since it's easier to understand with examples. |
use haystack logging Co-authored-by: Sebastian Husch Lee <[email protected]>
remove temp toc Co-authored-by: Sebastian Husch Lee <[email protected]>
…enbuehl/haystack into feature/md-header-splitter
…dated page number
Co-authored-by: Sebastian Husch Lee <[email protected]>
…enbuehl/haystack into feature/md-header-splitter
Co-authored-by: Sebastian Husch Lee <[email protected]>
Co-authored-by: Sebastian Husch Lee <[email protected]>
|
@OGuggenbuehl is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
…enbuehl/haystack into feature/md-header-splitter
| :param keep_headers: If True, headers are kept in the content. If False, headers are moved to metadata. | ||
| Defaults to True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that if keep_headers is True we don't store them in the metadata?
| :param skip_empty_documents: If True, skip documents with empty content. If False, process empty documents. | ||
| Defaults to True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's reuse the docstring from DocumentSplitter
| :param skip_empty_documents: If True, skip documents with empty content. If False, process empty documents. | |
| Defaults to True. | |
| :param skip_empty_documents: Choose whether to skip documents with empty content. Default is True. | |
| Set to False when downstream components in the Pipeline (like LLMDocumentContentExtractor) can extract text | |
| from non-textual documents. |
| self.secondary_splitter = DocumentSplitter( | ||
| split_by=self.secondary_split, | ||
| split_length=self.split_length, | ||
| split_overlap=self.split_overlap, | ||
| split_threshold=self.split_threshold, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized we are missing a warm_up method for this component MarkdownHeaderSplitter which calls the warm_up of DocumentSplitter if secondary_split is provided.
This is needed in the case where the secondary_split of sentence is chosen.
So I'd add a warm up method like
def warm_up(self):
"""
Warm up the MarkdownHeaderSplitter
"""
if self.secondary_splitter and not self._is_warmed_up:
self.secondary_splitter.warm_up()
self._is_warmed_up = Truethen add self._is_warmed_up = False in the init method of MarkdownHeaderSplitter
| for i, match in enumerate(matches): | ||
| # extract header info | ||
| header_prefix = match.group(1) | ||
| header_text = match.group(2).strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we avoid using strip() here? This would remove any white space characters from the text we probably want to keep.
| # get content | ||
| start = match.end() | ||
| end = matches[i + 1].start() if i + 1 < len(matches) else len(text) | ||
| content = text[start:end].strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here let's remove the .strip()
| header_stack[j] = None | ||
|
|
||
| # prepare header_line if keep_headers | ||
| header_line = f"{header_prefix} {header_text}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't use .strip() anywhere then we should be able to do
| header_line = f"{header_prefix} {header_text}" | |
| header_line = f"{header_prefix}{header_text}" |
I believe
Also if this is only used if self.keep_headers is True we could move it's construction to inside that if statement.
| if pending_headers: | ||
| chunk_content += "\n".join(pending_headers) + "\n" | ||
| chunk_content += f"{header_line}\n{content}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here we shouldn't just add in our own newline characters. If we avoid using strip earlier we should be able to do
| if pending_headers: | |
| chunk_content += "\n".join(pending_headers) + "\n" | |
| chunk_content += f"{header_line}\n{content}" | |
| if pending_headers: | |
| chunk_content += "".join(pending_headers) | |
| chunk_content += f"{header_line}{content}" |
| processed_documents = [] | ||
| for doc in documents: | ||
| # handle empty documents | ||
| if not doc.content or not doc.content.strip(): | ||
| if self.skip_empty_documents: | ||
| logger.warning("Document ID {doc_id} has an empty content. Skipping this document.", doc_id=doc.id) | ||
| continue | ||
| # keep empty documents | ||
| processed_documents.append(doc) | ||
| logger.warning( | ||
| "Document ID {doc_id} has an empty content. Keeping this document as per configuration.", | ||
| doc_id=doc.id, | ||
| ) | ||
| continue | ||
|
|
||
| processed_documents.append(doc) | ||
|
|
||
| if not processed_documents: | ||
| return {"documents": []} | ||
|
|
||
| header_split_docs = self._split_documents_by_markdown_headers(processed_documents) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could refactor this a bit to make the code overall simpler. E.g.
| processed_documents = [] | |
| for doc in documents: | |
| # handle empty documents | |
| if not doc.content or not doc.content.strip(): | |
| if self.skip_empty_documents: | |
| logger.warning("Document ID {doc_id} has an empty content. Skipping this document.", doc_id=doc.id) | |
| continue | |
| # keep empty documents | |
| processed_documents.append(doc) | |
| logger.warning( | |
| "Document ID {doc_id} has an empty content. Keeping this document as per configuration.", | |
| doc_id=doc.id, | |
| ) | |
| continue | |
| processed_documents.append(doc) | |
| if not processed_documents: | |
| return {"documents": []} | |
| header_split_docs = self._split_documents_by_markdown_headers(processed_documents) | |
| processed_documents = [] | |
| for doc in documents: | |
| # handle empty documents | |
| if not doc.content or not doc.content.strip(): | |
| if self.skip_empty_documents: | |
| logger.warning("Document ID {doc_id} has an empty content. Skipping this document.", doc_id=doc.id) | |
| continue | |
| # keep empty documents | |
| processed_documents.append(doc) | |
| logger.warning( | |
| "Document ID {doc_id} has an empty content. Keeping this document as per configuration.", | |
| doc_id=doc.id, | |
| ) | |
| continue | |
| else: | |
| header_split_docs = self._split_documents_by_markdown_headers([doc]) | |
| final_docs = self._apply_secondary_splitting(header_split_docs) if self.secondary_split else header_split_docs | |
| processed_documents.extend(final_docs) |
this way we know we will only run the splitting logic on non-empty documents. Which hopefully means we don't need to worry about handling documents with empty contents within the splitting logic (i.e. all checks for empty strings can be removed)
| # Should start at 1 and increment at each \f | ||
| assert page_numbers[0] == 1 | ||
| assert max(page_numbers) == 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's explicitly check the page number of each split
| for doc in split_docs[:-1]: | ||
| assert len(doc.content.split()) == 3 | ||
| # The last chunk should have at least 2 words (threshold) | ||
| assert len(split_docs[-1].content.split()) >= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make these tests explicit. So check the actual content of each one.
| for i in range(1, len(split_docs)): | ||
| prev_doc = split_docs[i - 1] | ||
| curr_doc = split_docs[i] | ||
| if prev_doc.meta["header"] == curr_doc.meta["header"]: # only check overlap within same header | ||
| prev_words = prev_doc.content.split() | ||
| curr_words = curr_doc.content.split() | ||
| assert prev_words[-2:] == curr_words[:2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be a broken record but let's make this an explicit check for each split. It's easier then for me to review to see if the output is what we expect.
| split_ids = [doc.meta["split_id"] for doc in split_docs] | ||
| assert split_ids == list(range(len(split_ids))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be updated once this comment https://github.com/deepset-ai/haystack/pull/9660/files#r2432221323 is addressed
Proposed Changes:
Implement MarkdownHeaderSplitter to split Documents written in .md based on their headers
How did you test it?
unit tests
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.