Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZeroxPDFLoader is not compatible with GenericLoader #30455

Open
5 tasks done
pprados opened this issue Mar 24, 2025 · 0 comments
Open
5 tasks done

ZeroxPDFLoader is not compatible with GenericLoader #30455

pprados opened this issue Mar 24, 2025 · 0 comments
Labels
04 new feature New functionality (use for larger scope enhancements)

Comments

@pprados
Copy link
Contributor

pprados commented Mar 24, 2025

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_community.document_loaders.parsers import ZeroxPDFParser

Error Message and Stack Trace (if applicable)

No response

Description

The typical requirements for RAG projects are generally as follows:

  • Import PDF files into a vector database
  • From a directory structure
  • Be able to update the files
  • Without re-importing everything
  • Oh, and don't forget to remove files that are no longer present from the vector database
  • Since the PDF format isn’t great, we also have some files in Word format
  • It’s not just 10 sample documents, but 50,000 with 20 pages each, evolving daily
  • The files are, of course, stored in cloud storage

In my opinion, the best approach to handle this using LangChain is with code similar to this:

vector_store=...
record_manager=...
loader=GenericLoader(
    blob_loader=FileSystemBlobLoader(  # Or CloudBlobLoader
        path="mydata/",
        glob="**/*",
        show_progress=True,
    ),
    blob_parser=MimeTypeBasedParser(
        handlers={
          "application/pdf": ZeroxPDFParser(),  # IMPOSSIBLE
          "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
            MsWordParser(),
        },
        fallback_parser=TextParser(),
    )
)
index(
    loader.lazy_load(),
    record_manager,
    vector_store,
    batch_size=100,
)

For this to work, access to the "Parsers" version for the different Loaders is required.

ZeroxPDFParser has several limitations:

  • It does not provide a parser
  • Does not support image conversions.

A PR solve this.

System Info

System Information

OS: Linux
OS Version: #19~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Feb 17 11:51:52 UTC 2
Python Version: 3.12.9 | packaged by Anaconda, Inc. | (main, Feb 6 2025, 18:56:27) [GCC 11.2.0]

Package Information

langchain_core: 0.3.45
langchain: 0.3.20
langchain_community: 0.3.19
langsmith: 0.3.8
langchain_openai: 0.3.8
langchain_tests: 0.3.11
langchain_text_splitters: 0.3.6

Optional packages not installed

langserve

Other Dependencies

aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
httpx: 0.28.1
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
httpx<1,>=0.25.0: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.34: Installed. No version info available.
langchain-core<1.0.0,>=0.3.41: Installed. No version info available.
langchain-core<1.0.0,>=0.3.42: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.6: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.20: Installed. No version info available.
langsmith-pyo3: Installed. No version info available.
langsmith<0.4,>=0.1.125: Installed. No version info available.
langsmith<0.4,>=0.1.17: Installed. No version info available.
numpy<2.0.0,>=1.24.0;: Installed. No version info available.
numpy<3,>=1.26.2: Installed. No version info available.
numpy<3,>=1.26.2;: Installed. No version info available.
openai<2.0.0,>=1.58.1: Installed. No version info available.
orjson: 3.10.15
packaging<25,>=23.2: Installed. No version info available.
pydantic: 2.10.6
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3.0.0,>=2.5.2;: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic<3.0.0,>=2.7.4;: Installed. No version info available.
pytest: 7.4.4
pytest-asyncio<1,>=0.20: Installed. No version info available.
pytest-socket<1,>=0.6.0: Installed. No version info available.
pytest<9,>=7: Installed. No version info available.
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests-toolbelt: 1.0.0
requests<3,>=2: Installed. No version info available.
rich: 12.6.0
SQLAlchemy<3,>=1.4: Installed. No version info available.
syrupy<5,>=4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
tiktoken<1,>=0.7: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
zstandard: 0.23.0

@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Mar 24, 2025
@eyurtsev eyurtsev added 04 new feature New functionality (use for larger scope enhancements) and removed 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Mar 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
04 new feature New functionality (use for larger scope enhancements)
Projects
None yet
Development

No branches or pull requests

2 participants