Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: Add custom sitemap URL parameter to GitbookLoader #30549

Merged
merged 6 commits into from
Apr 1, 2025

Conversation

andrasfe
Copy link
Contributor

Description

This PR adds a new sitemap_url parameter to the GitbookLoader class that allows users to specify a custom sitemap URL when loading content from a GitBook site. This is particularly useful for GitBook sites that use non-standard sitemap file names like sitemap-pages.xml instead of the default sitemap.xml.
The standard GitbookLoader assumes that the sitemap is located at /sitemap.xml, but some GitBook instances (including GitBook's own documentation) use different paths for their sitemaps. This parameter makes the loader more flexible and helps users extract content from a wider range of GitBook sites.

Issue

Fixes bug 30473 where the GitbookLoader would fail to find pages on GitBook sites that use custom sitemap URLs.

Dependencies

No new dependencies required.
I've added:

  • Unit tests to verify the parameter works correctly
  • Integration tests to confirm the parameter is properly used with real GitBook sites
  • Updated docstrings with parameter documentation
    The changes are fully backward compatible, as the parameter is optional with a sensible default.

Copy link

vercel bot commented Mar 29, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Apr 1, 2025 4:12pm

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Mar 29, 2025
@eyurtsev eyurtsev self-assigned this Mar 31, 2025
Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good two minor things to resolve and we can merge!

assert paths == ["/page1", "/page2", "/page3"]


@patch("requests.get")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@andrasfe I don't know if you're familiar with the responses library, but it's generally a much cleaner way to patch requests.get / patch / post if you're comparing it to using the built in patch methods.

Going to merge as is (not a blocker). But if you have time to re-write w/ responses will be appreciated. It'll make the code much less brittle!


I've observed that the usage of @patch w/ strings specifying the patched code introduces more technical debt sometimes than it solves. patch object is slightly better since it's easier to catch namespacing changes, but for network requests responses is better in virtually all cases

Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Apr 1, 2025
@eyurtsev eyurtsev enabled auto-merge (squash) April 1, 2025 16:12
@eyurtsev eyurtsev merged commit 64df60e into langchain-ai:master Apr 1, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) lgtm PR looks good. Use to confirm that a PR is ready for merging. size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants