-
Notifications
You must be signed in to change notification settings - Fork 17.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community: Add custom sitemap URL parameter to GitbookLoader #30549
community: Add custom sitemap URL parameter to GitbookLoader #30549
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good two minor things to resolve and we can merge!
libs/community/tests/unit_tests/document_loaders/test_gitbook.py
Outdated
Show resolved
Hide resolved
assert paths == ["/page1", "/page2", "/page3"] | ||
|
||
|
||
@patch("requests.get") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@andrasfe I don't know if you're familiar with the responses library, but it's generally a much cleaner way to patch requests.get / patch / post if you're comparing it to using the built in patch methods.
Going to merge as is (not a blocker). But if you have time to re-write w/ responses will be appreciated. It'll make the code much less brittle!
I've observed that the usage of @patch
w/ strings specifying the patched code introduces more technical debt sometimes than it solves. patch object is slightly better since it's easier to catch namespacing changes, but for network requests responses
is better in virtually all cases
libs/community/tests/integration_tests/document_loaders/test_gitbook.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
Description
This PR adds a new
sitemap_url
parameter to theGitbookLoader
class that allows users to specify a custom sitemap URL when loading content from a GitBook site. This is particularly useful for GitBook sites that use non-standard sitemap file names likesitemap-pages.xml
instead of the defaultsitemap.xml
.The standard
GitbookLoader
assumes that the sitemap is located at/sitemap.xml
, but some GitBook instances (including GitBook's own documentation) use different paths for their sitemaps. This parameter makes the loader more flexible and helps users extract content from a wider range of GitBook sites.Issue
Fixes bug 30473 where the
GitbookLoader
would fail to find pages on GitBook sites that use custom sitemap URLs.Dependencies
No new dependencies required.
I've added:
The changes are fully backward compatible, as the parameter is optional with a sensible default.