Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GitBook loader does not load any pages when Sitemap has nested Sitemaps #30473

Open
5 tasks done
mutje opened this issue Mar 25, 2025 · 1 comment
Open
5 tasks done

GitBook loader does not load any pages when Sitemap has nested Sitemaps #30473

mutje opened this issue Mar 25, 2025 · 1 comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@mutje
Copy link

mutje commented Mar 25, 2025

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

loader = GitbookLoader(
            web_page="https://docs.gitbook.com/",
            load_all_paths=True
        )
docs = loader.load()
print(len(docs))

Error Message and Stack Trace (if applicable)

No response

Description

  • Trying to fetch all pages from gitbook documentation, by using GitBookLoader
  • The sitemap (e.g. documentation of GitBook itself) contains references to other sitemaps
  • Instead of fetching correct sub pages into docs variable, docs is empty list (0 is printed)

The problem can be fixed by replacing the webpage in gitbook.py init by

if load_all_paths:
    # set web_path to the sitemap if we want to crawl all paths
    web_page = f"{self.base_url}/sitemap-pages.xml"

So perhaps a constructor parameter to provide custom sitemap url would be sufficient.

System Info

System Information

OS: Windows
OS Version: 10.0.19045
Python Version: 3.11.9 (tags/v3.11.9:de54cf5, Apr 2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]

Package Information

langchain_core: 0.3.48
langchain: 0.3.21
langchain_community: 0.3.20
langsmith: 0.1.137
langchain_openai: 0.3.10
langchain_text_splitters: 0.3.7

@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Mar 25, 2025
@andrasfe
Copy link
Contributor

PR submitted with fix in line with OP's suggestion + tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants