feat: implement local vision parsing and fix docker deployment#888
feat: implement local vision parsing and fix docker deployment#888emontele wants to merge 24 commits into
Conversation
Added documentation for Vision Parsing feature, including its overview, functionality, configuration steps, and fallback mechanism.
There was a problem hiding this comment.
11 issues found across 16 files
Confidence score: 2/5
pyproject.tomladds PyMuPDF as a core dependency even though it is AGPLv3, which conflicts with an MIT-licensed distribution and creates immediate legal/compliance risk if merged as-is — either replace it with a license-compatible library or move it to an optional, clearly isolated component before merging.docker-compose.ymlnow defaults to local build instead of pulling the published image, which can break the documented deployment path and surprise operators during rollout — restore pull-first behavior (or update docs/scripts in lockstep) before merge.open_notebook/ai/vision_parser.pycurrently parses every PDF page with no cap and performs blocking file/encoding work in an async path, so user-provided large PDFs can drive unbounded LLM cost and stall request handling under load — add a strict max-page limit and move blocking I/O off the event loop before merging.open_notebook/ai/vision_parser.pyandapi/routers/models.pyhave correctness gaps (document handle not safely closed on exceptions,response.contentmay violate the annotatedstrtype, and PUT defaults omitdefault_vision_model), which can lead to resource leaks and inconsistent model-settings behavior — add explicit close/typing guards and make PUT/GET model fields symmetric before merge.
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
There was a problem hiding this comment.
4 issues found across 9 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="open_notebook/ai/vision_parser.py">
<violation number="1" location="open_notebook/ai/vision_parser.py:115">
P1: Hard page cap introduces silent partial-document parsing for PDFs over 50 pages.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| with fitz.open(file_path) as doc: | ||
| total_pages = len(doc) | ||
|
|
||
| page_count = min(total_pages, MAX_PAGES) |
There was a problem hiding this comment.
P1: Hard page cap introduces silent partial-document parsing for PDFs over 50 pages.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At open_notebook/ai/vision_parser.py, line 115:
<comment>Hard page cap introduces silent partial-document parsing for PDFs over 50 pages.</comment>
<file context>
@@ -91,20 +96,25 @@ async def process_pdf_with_vision(file_path: str) -> Optional[str]:
+ with fitz.open(file_path) as doc:
+ total_pages = len(doc)
+
+ page_count = min(total_pages, MAX_PAGES)
+ if total_pages > MAX_PAGES:
+ logger.warning(f"PDF {file_path} has {total_pages} pages. Truncating to {MAX_PAGES} for vision parsing.")
</file context>
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
7 issues found across 14 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="frontend/src/lib/locales/tr-TR/index.ts">
<violation number="1" location="frontend/src/lib/locales/tr-TR/index.ts:860">
P2: Newly added vision model labels are in English instead of Turkish in the tr-TR locale file, breaking localization consistency.</violation>
</file>
<file name="frontend/src/lib/locales/pt-BR/index.ts">
<violation number="1" location="frontend/src/lib/locales/pt-BR/index.ts:860">
P2: New vision model translation keys are in English instead of Portuguese</violation>
</file>
<file name="frontend/src/lib/locales/ca-ES/index.ts">
<violation number="1" location="frontend/src/lib/locales/ca-ES/index.ts:860">
P2: Untranslated English strings for new vision model keys in Catalan locale file.</violation>
</file>
<file name="frontend/src/lib/locales/zh-TW/index.ts">
<violation number="1" location="frontend/src/lib/locales/zh-TW/index.ts:860">
P2: Untranslated English strings added to Traditional Chinese locale: `visionModelLabel` and `visionModelDesc` should be localized to match the pattern of all other model-related keys.</violation>
</file>
<file name="frontend/src/lib/locales/pl-PL/index.ts">
<violation number="1" location="frontend/src/lib/locales/pl-PL/index.ts:860">
P2: New vision model localization strings were added in English instead of Polish in the Polish locale file, breaking translation consistency.</violation>
</file>
<file name="frontend/src/lib/locales/ru-RU/index.ts">
<violation number="1" location="frontend/src/lib/locales/ru-RU/index.ts:860">
P2: New vision model UI strings added in English instead of Russian, breaking locale consistency in the ru-RU locale file.</violation>
</file>
<file name="frontend/src/lib/locales/ja-JP/index.ts">
<violation number="1" location="frontend/src/lib/locales/ja-JP/index.ts:860">
P2: New vision model localization keys are in English instead of Japanese, breaking i18n for the Japanese locale.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| embeddingModelDesc: "Anlamsal arama ve vektör gömme için kullanılır", | ||
| ttsModelLabel: "Metinden Sese Modeli", | ||
| ttsModelDesc: "Podcast oluşturma için kullanılır", | ||
| visionModelLabel: "Vision Model", |
There was a problem hiding this comment.
P2: Newly added vision model labels are in English instead of Turkish in the tr-TR locale file, breaking localization consistency.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At frontend/src/lib/locales/tr-TR/index.ts, line 860:
<comment>Newly added vision model labels are in English instead of Turkish in the tr-TR locale file, breaking localization consistency.</comment>
<file context>
@@ -857,6 +857,8 @@ export const trTR = {
embeddingModelDesc: "Anlamsal arama ve vektör gömme için kullanılır",
ttsModelLabel: "Metinden Sese Modeli",
ttsModelDesc: "Podcast oluşturma için kullanılır",
+ visionModelLabel: "Vision Model",
+ visionModelDesc: "Used for analyzing images and extracting text from PDFs",
sttModelLabel: "Sesten Metne Modeli",
</file context>
| visionModelLabel: "Vision Model", | ||
| visionModelDesc: "Used for analyzing images and extracting text from PDFs", |
There was a problem hiding this comment.
P2: Untranslated English strings for new vision model keys in Catalan locale file.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At frontend/src/lib/locales/ca-ES/index.ts, line 860:
<comment>Untranslated English strings for new vision model keys in Catalan locale file.</comment>
<file context>
@@ -857,6 +857,8 @@ export const caES = {
embeddingModelDesc: "S'usa per a la cerca semàntica i les incrustacions vectorials",
ttsModelLabel: "Model de text a veu",
ttsModelDesc: "S'usa per a la generació de podcasts",
+ visionModelLabel: "Vision Model",
+ visionModelDesc: "Used for analyzing images and extracting text from PDFs",
sttModelLabel: "Model de veu a text",
</file context>
| visionModelLabel: "Vision Model", | |
| visionModelDesc: "Used for analyzing images and extracting text from PDFs", | |
| visionModelLabel: "Model de visió", | |
| visionModelDesc: "S'usa per analitzar imatges i extreure text de PDFs", |
| visionModelLabel: "Vision Model", | ||
| visionModelDesc: "Used for analyzing images and extracting text from PDFs", |
There was a problem hiding this comment.
P2: New vision model localization strings were added in English instead of Polish in the Polish locale file, breaking translation consistency.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At frontend/src/lib/locales/pl-PL/index.ts, line 860:
<comment>New vision model localization strings were added in English instead of Polish in the Polish locale file, breaking translation consistency.</comment>
<file context>
@@ -857,6 +857,8 @@ export const plPL = {
embeddingModelDesc: "Używany do wyszukiwania semantycznego i osadzeń wektorowych",
ttsModelLabel: "Model syntezy mowy",
ttsModelDesc: "Używany do generowania podcastów",
+ visionModelLabel: "Vision Model",
+ visionModelDesc: "Used for analyzing images and extracting text from PDFs",
sttModelLabel: "Model rozpoznawania mowy",
</file context>
| visionModelLabel: "Vision Model", | |
| visionModelDesc: "Used for analyzing images and extracting text from PDFs", | |
| visionModelLabel: "Model wizyjny", | |
| visionModelDesc: "Używany do analizowania obrazów i wyodrębniania tekstu z PDFów", |
| visionModelLabel: "Vision Model", | ||
| visionModelDesc: "Used for analyzing images and extracting text from PDFs", |
There was a problem hiding this comment.
P2: New vision model UI strings added in English instead of Russian, breaking locale consistency in the ru-RU locale file.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At frontend/src/lib/locales/ru-RU/index.ts, line 860:
<comment>New vision model UI strings added in English instead of Russian, breaking locale consistency in the ru-RU locale file.</comment>
<file context>
@@ -857,6 +857,8 @@ export const ruRU = {
embeddingModelDesc: "Используется для семантического поиска и векторных эмбеддингов",
ttsModelLabel: "Модель озвучивания",
ttsModelDesc: "Используется для генерации подкастов",
+ visionModelLabel: "Vision Model",
+ visionModelDesc: "Used for analyzing images and extracting text from PDFs",
sttModelLabel: "Модель распознавания речи",
</file context>
| visionModelLabel: "Vision Model", | |
| visionModelDesc: "Used for analyzing images and extracting text from PDFs", | |
| visionModelLabel: "Модель компьютерного зрения", | |
| visionModelDesc: "Используется для анализа изображений и извлечения текста из PDF", |
| embeddingModelDesc: "セマンティック検索とベクトルEmbeddingに使用", | ||
| ttsModelLabel: "音声合成モデル", | ||
| ttsModelDesc: "ポッドキャスト生成に使用", | ||
| visionModelLabel: "Vision Model", |
There was a problem hiding this comment.
P2: New vision model localization keys are in English instead of Japanese, breaking i18n for the Japanese locale.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At frontend/src/lib/locales/ja-JP/index.ts, line 860:
<comment>New vision model localization keys are in English instead of Japanese, breaking i18n for the Japanese locale.</comment>
<file context>
@@ -857,6 +857,8 @@ export const jaJP = {
embeddingModelDesc: "セマンティック検索とベクトルEmbeddingに使用",
ttsModelLabel: "音声合成モデル",
ttsModelDesc: "ポッドキャスト生成に使用",
+ visionModelLabel: "Vision Model",
+ visionModelDesc: "Used for analyzing images and extracting text from PDFs",
sttModelLabel: "音声認識モデル",
</file context>
There was a problem hiding this comment.
2 issues found across 17 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="open_notebook/ai/vision_parser.py">
<violation number="1" location="open_notebook/ai/vision_parser.py:115">
P1: Hard page cap introduces silent partial-document parsing for PDFs over 50 pages.</violation>
</file>
<file name="frontend/src/lib/locales/tr-TR/index.ts">
<violation number="1" location="frontend/src/lib/locales/tr-TR/index.ts:860">
P2: Newly added vision model labels are in English instead of Turkish in the tr-TR locale file, breaking localization consistency.</violation>
</file>
<file name="frontend/src/lib/locales/ca-ES/index.ts">
<violation number="1" location="frontend/src/lib/locales/ca-ES/index.ts:860">
P2: Untranslated English strings for new vision model keys in Catalan locale file.</violation>
</file>
<file name="frontend/src/lib/locales/pl-PL/index.ts">
<violation number="1" location="frontend/src/lib/locales/pl-PL/index.ts:860">
P2: New vision model localization strings were added in English instead of Polish in the Polish locale file, breaking translation consistency.</violation>
</file>
<file name="frontend/src/lib/locales/ru-RU/index.ts">
<violation number="1" location="frontend/src/lib/locales/ru-RU/index.ts:860">
P2: New vision model UI strings added in English instead of Russian, breaking locale consistency in the ru-RU locale file.</violation>
</file>
<file name="frontend/src/lib/locales/ja-JP/index.ts">
<violation number="1" location="frontend/src/lib/locales/ja-JP/index.ts:860">
P2: New vision model localization keys are in English instead of Japanese, breaking i18n for the Japanese locale.</violation>
</file>
<file name="fix_i18n.py">
<violation number="1" location="fix_i18n.py:9">
P2: Missing explicit UTF-8 encoding when reading/writing locale files; risk of Unicode errors with non-ASCII content</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| locales = glob.glob(f"{locales_dir}/*/index.ts") | ||
|
|
||
| for locale_file in locales: | ||
| with open(locale_file, 'r') as f: |
There was a problem hiding this comment.
P2: Missing explicit UTF-8 encoding when reading/writing locale files; risk of Unicode errors with non-ASCII content
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At fix_i18n.py, line 9:
<comment>Missing explicit UTF-8 encoding when reading/writing locale files; risk of Unicode errors with non-ASCII content</comment>
<file context>
@@ -0,0 +1,31 @@
+locales = glob.glob(f"{locales_dir}/*/index.ts")
+
+for locale_file in locales:
+ with open(locale_file, 'r') as f:
+ content = f.read()
+
</file context>
|
You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment |
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
…finite loops on 34B models
…vent ChatOllama dropping content
…ection failures on files with frontmatter or HTML
|
Thanks a lot for the effort here, @emontele 🙏 — the local-vision goal is something we very much want. But we're going to close this PR for an architectural reason rather than a code one. Multimodal extraction belongs in So if you'd like to drive this forward, the most impactful place to contribute is content-core directly — there's already a vision-extractor PR in flight there (lfnovo/content-core#37) that does exactly this (PDF via Tracking the ON-side design in #331 (multimodal / vision strategy). Closing this in favor of that direction — and genuinely, thank you, this helped sharpen the call. A couple of side notes for next time regardless: the PR included scratch/debug files ( |
Description
This PR introduces robust, provider-agnostic multimodal Vision Parsing for images and PDFs, allowing the ingestion engine to "read" complex visual documents using local models (like LLaVA or Gemma via Ollama) or external APIs (GPT-4o, Claude 3.5 Sonnet, Gemini).
Additionally, it resolves critical Docker deployment bugs related to virtual environment caching and Next.js binding.
Key features & fixes:
open_notebook/ai/vision_parser.pyusingPyMuPDFto rasterize PDFs and inject base64 strings into the standardlangchain_coreHumanMessagepayload.asyncio.to_threadto prevent CPU-bound rasterization from freezing the FastAPI async event loop.frontend/src/app/(dashboard)/settings/api-keys/page.tsxto includevisionin the UI model selection.ModuleNotFoundErrorduring Docker build by removing the dummy__init__.pyhack and using the officialuv sync --no-install-project.host.docker.internal:host-gatewaytodocker-compose.ymlextra_hoststo ensure seamless resolution of local host IPs (e.g., Ollama's11434) on Linux machines.Related Issue
Fixes #
Type of Change
How Has This Been Tested?
uv run pytest)Test Details:
api,worker, andfrontendDocker containers to verifyuvinstallation succeeds and Next.js successfully binds to0.0.0.0viasupervisord.conf.http://host.docker.internal:11434).nvtopregistered GPU processing activity and the event loop did not block during multi-page rasterization.Design Alignment
Which design principles does this PR support?
Explanation:
HumanMessagestandard ({"type": "image_url", ...}). This guarantees automatic compatibility with any multimodal API (OpenAI, Anthropic, Google) without custom wrappers.fitz.open,.get_pixmap()) to separate threads.Checklist
Code Quality
Testing
make rufforruff check . --fixmake lintoruv run python -m mypy .Documentation
/docs(if applicable)Database Changes
/migrations)Breaking Changes
Pre-Submission Verification
Before submitting, please verify: