Skip to content

feat: implement local vision parsing and fix docker deployment#888

Closed
emontele wants to merge 24 commits into
lfnovo:mainfrom
emontele:main
Closed

feat: implement local vision parsing and fix docker deployment#888
emontele wants to merge 24 commits into
lfnovo:mainfrom
emontele:main

Conversation

@emontele

@emontele emontele commented Jun 15, 2026

Copy link
Copy Markdown

Description

This PR introduces robust, provider-agnostic multimodal Vision Parsing for images and PDFs, allowing the ingestion engine to "read" complex visual documents using local models (like LLaVA or Gemma via Ollama) or external APIs (GPT-4o, Claude 3.5 Sonnet, Gemini).

Additionally, it resolves critical Docker deployment bugs related to virtual environment caching and Next.js binding.

Key features & fixes:

  • Implemented open_notebook/ai/vision_parser.py using PyMuPDF to rasterize PDFs and inject base64 strings into the standard langchain_core HumanMessage payload.
  • Delegated PyMuPDF processing to asyncio.to_thread to prevent CPU-bound rasterization from freezing the FastAPI async event loop.
  • Updated frontend/src/app/(dashboard)/settings/api-keys/page.tsx to include vision in the UI model selection.
  • Fixed ModuleNotFoundError during Docker build by removing the dummy __init__.py hack and using the official uv sync --no-install-project.
  • Added host.docker.internal:host-gateway to docker-compose.yml extra_hosts to ensure seamless resolution of local host IPs (e.g., Ollama's 11434) on Linux machines.

Related Issue

Fixes #

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

How Has This Been Tested?

  • Tested locally with Docker
  • Tested locally with development setup
  • Added new unit tests
  • Existing tests pass (uv run pytest)
  • Manual testing performed (describe below)

Test Details:

  • Spun up the api, worker, and frontend Docker containers to verify uv installation succeeds and Next.js successfully binds to 0.0.0.0 via supervisord.conf.
  • Uploaded complex scientific PDFs through the web UI with a local LLaVA vision model configured via Ollama (http://host.docker.internal:11434).
  • Confirmed nvtop registered GPU processing activity and the event loop did not block during multi-page rasterization.
  • Verified fallback logic triggers correctly if no vision model is selected.

Design Alignment

Which design principles does this PR support?

  • Privacy First
  • Simplicity Over Features
  • API-First Architecture
  • Multi-Provider Flexibility
  • Extensibility Through Standards
  • Async-First for Performance

Explanation:

  • Privacy First: Fully supports 100% local visual document parsing without leaking sensitive images to the cloud.
  • Multi-Provider Flexibility / Standards: Implemented the vision logic using LangChain's native dictionary-based HumanMessage standard ({"type": "image_url", ...}). This guarantees automatic compatibility with any multimodal API (OpenAI, Anthropic, Google) without custom wrappers.
  • Async-First: Safely offloads synchronous, heavy C-library calls (fitz.open, .get_pixmap()) to separate threads.

Checklist

Code Quality

  • My code follows PEP 8 style guidelines (Python)
  • My code follows TypeScript best practices (Frontend)
  • I have added type hints to my code (Python)
  • I have added JSDoc comments where appropriate (TypeScript)
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings or errors

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran linting: make ruff or ruff check . --fix
  • I ran type checking: make lint or uv run python -m mypy .

Documentation

  • I have updated the relevant documentation in /docs (if applicable)
  • I have added/updated docstrings for new/modified functions
  • I have updated the API documentation (if API changes were made)
  • I have added comments to complex logic

Database Changes

  • I have created migration scripts for any database schema changes (in /migrations)
  • Migration includes both up and down scripts
  • Migration has been tested locally

Breaking Changes

  • This PR includes breaking changes
  • I have documented the migration path for users
  • I have updated MIGRATION.md (if applicable)

Pre-Submission Verification

Before submitting, please verify:

  • I have read CONTRIBUTING.md
  • I have read DESIGN_PRINCIPLES.md
  • This PR addresses an approved issue that was assigned to me
  • I have not included unrelated changes in this PR
  • My PR title follows conventional commits format (e.g., "feat: add user authentication")

Review in cubic

Emanuele Monteleone and others added 2 commits June 15, 2026 21:43
Added documentation for Vision Parsing feature, including its overview, functionality, configuration steps, and fallback mechanism.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 issues found across 16 files

Confidence score: 2/5

  • pyproject.toml adds PyMuPDF as a core dependency even though it is AGPLv3, which conflicts with an MIT-licensed distribution and creates immediate legal/compliance risk if merged as-is — either replace it with a license-compatible library or move it to an optional, clearly isolated component before merging.
  • docker-compose.yml now defaults to local build instead of pulling the published image, which can break the documented deployment path and surprise operators during rollout — restore pull-first behavior (or update docs/scripts in lockstep) before merge.
  • open_notebook/ai/vision_parser.py currently parses every PDF page with no cap and performs blocking file/encoding work in an async path, so user-provided large PDFs can drive unbounded LLM cost and stall request handling under load — add a strict max-page limit and move blocking I/O off the event loop before merging.
  • open_notebook/ai/vision_parser.py and api/routers/models.py have correctness gaps (document handle not safely closed on exceptions, response.content may violate the annotated str type, and PUT defaults omit default_vision_model), which can lead to resource leaks and inconsistent model-settings behavior — add explicit close/typing guards and make PUT/GET model fields symmetric before merge.

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread pyproject.toml Outdated
Comment thread docker-compose.yml Outdated
Comment thread open_notebook/ai/vision_parser.py
Comment thread open_notebook/ai/vision_parser.py Outdated
Comment thread api/routers/models.py
Comment thread open_notebook/ai/models.py Outdated
Comment thread frontend/src/app/(dashboard)/settings/api-keys/page.tsx Outdated
Comment thread open_notebook/ai/vision_parser.py Outdated
Comment thread open_notebook/ai/vision_parser.py Outdated
Comment thread docker-compose.yml Outdated

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 9 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="open_notebook/ai/vision_parser.py">

<violation number="1" location="open_notebook/ai/vision_parser.py:115">
P1: Hard page cap introduces silent partial-document parsing for PDFs over 50 pages.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

with fitz.open(file_path) as doc:
total_pages = len(doc)

page_count = min(total_pages, MAX_PAGES)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Hard page cap introduces silent partial-document parsing for PDFs over 50 pages.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At open_notebook/ai/vision_parser.py, line 115:

<comment>Hard page cap introduces silent partial-document parsing for PDFs over 50 pages.</comment>

<file context>
@@ -91,20 +96,25 @@ async def process_pdf_with_vision(file_path: str) -> Optional[str]:
+        with fitz.open(file_path) as doc:
+            total_pages = len(doc)
+            
+        page_count = min(total_pages, MAX_PAGES)
+        if total_pages > MAX_PAGES:
+            logger.warning(f"PDF {file_path} has {total_pages} pages. Truncating to {MAX_PAGES} for vision parsing.")
</file context>

Comment thread docs/2-CORE-CONCEPTS/VISION_PARSING_BACKUP.md Outdated
Comment thread open_notebook/ai/vision_parser.py Outdated
Comment thread frontend/src/lib/locales/en-US/index.ts Outdated
emontele and others added 5 commits June 15, 2026 23:09
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 issues found across 14 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="frontend/src/lib/locales/tr-TR/index.ts">

<violation number="1" location="frontend/src/lib/locales/tr-TR/index.ts:860">
P2: Newly added vision model labels are in English instead of Turkish in the tr-TR locale file, breaking localization consistency.</violation>
</file>

<file name="frontend/src/lib/locales/pt-BR/index.ts">

<violation number="1" location="frontend/src/lib/locales/pt-BR/index.ts:860">
P2: New vision model translation keys are in English instead of Portuguese</violation>
</file>

<file name="frontend/src/lib/locales/ca-ES/index.ts">

<violation number="1" location="frontend/src/lib/locales/ca-ES/index.ts:860">
P2: Untranslated English strings for new vision model keys in Catalan locale file.</violation>
</file>

<file name="frontend/src/lib/locales/zh-TW/index.ts">

<violation number="1" location="frontend/src/lib/locales/zh-TW/index.ts:860">
P2: Untranslated English strings added to Traditional Chinese locale: `visionModelLabel` and `visionModelDesc` should be localized to match the pattern of all other model-related keys.</violation>
</file>

<file name="frontend/src/lib/locales/pl-PL/index.ts">

<violation number="1" location="frontend/src/lib/locales/pl-PL/index.ts:860">
P2: New vision model localization strings were added in English instead of Polish in the Polish locale file, breaking translation consistency.</violation>
</file>

<file name="frontend/src/lib/locales/ru-RU/index.ts">

<violation number="1" location="frontend/src/lib/locales/ru-RU/index.ts:860">
P2: New vision model UI strings added in English instead of Russian, breaking locale consistency in the ru-RU locale file.</violation>
</file>

<file name="frontend/src/lib/locales/ja-JP/index.ts">

<violation number="1" location="frontend/src/lib/locales/ja-JP/index.ts:860">
P2: New vision model localization keys are in English instead of Japanese, breaking i18n for the Japanese locale.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread frontend/src/lib/locales/tr-TR/index.ts Outdated
embeddingModelDesc: "Anlamsal arama ve vektör gömme için kullanılır",
ttsModelLabel: "Metinden Sese Modeli",
ttsModelDesc: "Podcast oluşturma için kullanılır",
visionModelLabel: "Vision Model",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Newly added vision model labels are in English instead of Turkish in the tr-TR locale file, breaking localization consistency.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At frontend/src/lib/locales/tr-TR/index.ts, line 860:

<comment>Newly added vision model labels are in English instead of Turkish in the tr-TR locale file, breaking localization consistency.</comment>

<file context>
@@ -857,6 +857,8 @@ export const trTR = {
     embeddingModelDesc: "Anlamsal arama ve vektör gömme için kullanılır",
     ttsModelLabel: "Metinden Sese Modeli",
     ttsModelDesc: "Podcast oluşturma için kullanılır",
+    visionModelLabel: "Vision Model",
+    visionModelDesc: "Used for analyzing images and extracting text from PDFs",
     sttModelLabel: "Sesten Metne Modeli",
</file context>

Comment thread frontend/src/lib/locales/pt-BR/index.ts Outdated
Comment thread frontend/src/lib/locales/ca-ES/index.ts Outdated
Comment on lines +860 to +861
visionModelLabel: "Vision Model",
visionModelDesc: "Used for analyzing images and extracting text from PDFs",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Untranslated English strings for new vision model keys in Catalan locale file.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At frontend/src/lib/locales/ca-ES/index.ts, line 860:

<comment>Untranslated English strings for new vision model keys in Catalan locale file.</comment>

<file context>
@@ -857,6 +857,8 @@ export const caES = {
     embeddingModelDesc: "S'usa per a la cerca semàntica i les incrustacions vectorials",
     ttsModelLabel: "Model de text a veu",
     ttsModelDesc: "S'usa per a la generació de podcasts",
+    visionModelLabel: "Vision Model",
+    visionModelDesc: "Used for analyzing images and extracting text from PDFs",
     sttModelLabel: "Model de veu a text",
</file context>
Suggested change
visionModelLabel: "Vision Model",
visionModelDesc: "Used for analyzing images and extracting text from PDFs",
visionModelLabel: "Model de visió",
visionModelDesc: "S'usa per analitzar imatges i extreure text de PDFs",

Comment thread frontend/src/lib/locales/zh-TW/index.ts Outdated
Comment thread frontend/src/lib/locales/pl-PL/index.ts Outdated
Comment on lines +860 to +861
visionModelLabel: "Vision Model",
visionModelDesc: "Used for analyzing images and extracting text from PDFs",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: New vision model localization strings were added in English instead of Polish in the Polish locale file, breaking translation consistency.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At frontend/src/lib/locales/pl-PL/index.ts, line 860:

<comment>New vision model localization strings were added in English instead of Polish in the Polish locale file, breaking translation consistency.</comment>

<file context>
@@ -857,6 +857,8 @@ export const plPL = {
     embeddingModelDesc: "Używany do wyszukiwania semantycznego i osadzeń wektorowych",
     ttsModelLabel: "Model syntezy mowy",
     ttsModelDesc: "Używany do generowania podcastów",
+    visionModelLabel: "Vision Model",
+    visionModelDesc: "Used for analyzing images and extracting text from PDFs",
     sttModelLabel: "Model rozpoznawania mowy",
</file context>
Suggested change
visionModelLabel: "Vision Model",
visionModelDesc: "Used for analyzing images and extracting text from PDFs",
visionModelLabel: "Model wizyjny",
visionModelDesc: "Używany do analizowania obrazów i wyodrębniania tekstu z PDFów",

Comment thread frontend/src/lib/locales/ru-RU/index.ts Outdated
Comment on lines +860 to +861
visionModelLabel: "Vision Model",
visionModelDesc: "Used for analyzing images and extracting text from PDFs",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: New vision model UI strings added in English instead of Russian, breaking locale consistency in the ru-RU locale file.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At frontend/src/lib/locales/ru-RU/index.ts, line 860:

<comment>New vision model UI strings added in English instead of Russian, breaking locale consistency in the ru-RU locale file.</comment>

<file context>
@@ -857,6 +857,8 @@ export const ruRU = {
     embeddingModelDesc: "Используется для семантического поиска и векторных эмбеддингов",
     ttsModelLabel: "Модель озвучивания",
     ttsModelDesc: "Используется для генерации подкастов",
+    visionModelLabel: "Vision Model",
+    visionModelDesc: "Used for analyzing images and extracting text from PDFs",
     sttModelLabel: "Модель распознавания речи",
</file context>
Suggested change
visionModelLabel: "Vision Model",
visionModelDesc: "Used for analyzing images and extracting text from PDFs",
visionModelLabel: "Модель компьютерного зрения",
visionModelDesc: "Используется для анализа изображений и извлечения текста из PDF",

Comment thread frontend/src/lib/locales/ja-JP/index.ts Outdated
embeddingModelDesc: "セマンティック検索とベクトルEmbeddingに使用",
ttsModelLabel: "音声合成モデル",
ttsModelDesc: "ポッドキャスト生成に使用",
visionModelLabel: "Vision Model",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: New vision model localization keys are in English instead of Japanese, breaking i18n for the Japanese locale.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At frontend/src/lib/locales/ja-JP/index.ts, line 860:

<comment>New vision model localization keys are in English instead of Japanese, breaking i18n for the Japanese locale.</comment>

<file context>
@@ -857,6 +857,8 @@ export const jaJP = {
     embeddingModelDesc: "セマンティック検索とベクトルEmbeddingに使用",
     ttsModelLabel: "音声合成モデル",
     ttsModelDesc: "ポッドキャスト生成に使用",
+    visionModelLabel: "Vision Model",
+    visionModelDesc: "Used for analyzing images and extracting text from PDFs",
     sttModelLabel: "音声認識モデル",
</file context>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 17 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="open_notebook/ai/vision_parser.py">

<violation number="1" location="open_notebook/ai/vision_parser.py:115">
P1: Hard page cap introduces silent partial-document parsing for PDFs over 50 pages.</violation>
</file>

<file name="frontend/src/lib/locales/tr-TR/index.ts">

<violation number="1" location="frontend/src/lib/locales/tr-TR/index.ts:860">
P2: Newly added vision model labels are in English instead of Turkish in the tr-TR locale file, breaking localization consistency.</violation>
</file>

<file name="frontend/src/lib/locales/ca-ES/index.ts">

<violation number="1" location="frontend/src/lib/locales/ca-ES/index.ts:860">
P2: Untranslated English strings for new vision model keys in Catalan locale file.</violation>
</file>

<file name="frontend/src/lib/locales/pl-PL/index.ts">

<violation number="1" location="frontend/src/lib/locales/pl-PL/index.ts:860">
P2: New vision model localization strings were added in English instead of Polish in the Polish locale file, breaking translation consistency.</violation>
</file>

<file name="frontend/src/lib/locales/ru-RU/index.ts">

<violation number="1" location="frontend/src/lib/locales/ru-RU/index.ts:860">
P2: New vision model UI strings added in English instead of Russian, breaking locale consistency in the ru-RU locale file.</violation>
</file>

<file name="frontend/src/lib/locales/ja-JP/index.ts">

<violation number="1" location="frontend/src/lib/locales/ja-JP/index.ts:860">
P2: New vision model localization keys are in English instead of Japanese, breaking i18n for the Japanese locale.</violation>
</file>

<file name="fix_i18n.py">

<violation number="1" location="fix_i18n.py:9">
P2: Missing explicit UTF-8 encoding when reading/writing locale files; risk of Unicode errors with non-ASCII content</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread fix_i18n.py
locales = glob.glob(f"{locales_dir}/*/index.ts")

for locale_file in locales:
with open(locale_file, 'r') as f:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Missing explicit UTF-8 encoding when reading/writing locale files; risk of Unicode errors with non-ASCII content

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At fix_i18n.py, line 9:

<comment>Missing explicit UTF-8 encoding when reading/writing locale files; risk of Unicode errors with non-ASCII content</comment>

<file context>
@@ -0,0 +1,31 @@
+locales = glob.glob(f"{locales_dir}/*/index.ts")
+
+for locale_file in locales:
+    with open(locale_file, 'r') as f:
+        content = f.read()
+        
</file context>

Comment thread frontend/src/lib/locales/zh-TW/index.ts Outdated
@cubic-dev-ai

cubic-dev-ai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

@lfnovo

lfnovo commented Jun 21, 2026

Copy link
Copy Markdown
Owner

Thanks a lot for the effort here, @emontele 🙏 — the local-vision goal is something we very much want. But we're going to close this PR for an architectural reason rather than a code one.

Multimodal extraction belongs in content-core, not in Open Notebook. content-core is our extraction layer — it's where "turn this PDF/image/video into text/markdown" lives, behind a provider-agnostic interface. This PR implements that logic inside Open Notebook (open_notebook/ai/vision_parser.py rasterizing PDFs with PyMuPDF and building the vision payload by hand), which duplicates content-core's job and ties Open Notebook to one rasterization path. We want ON to stay thin here: select a vision model, hand the source to content-core, and let content-core do the extraction.

So if you'd like to drive this forward, the most impactful place to contribute is content-core directly — there's already a vision-extractor PR in flight there (lfnovo/content-core#37) that does exactly this (PDF via pdftoppm, video frames via ffmpeg, image description), and the Open Notebook side is just the model-slot wiring (see #791). Once content-core ships vision, ON gets it for free.

Tracking the ON-side design in #331 (multimodal / vision strategy). Closing this in favor of that direction — and genuinely, thank you, this helped sharpen the call.

A couple of side notes for next time regardless: the PR included scratch/debug files (scratch_test_ollama_vision.py, check_models.py, fix_i18n.py, fix_models.py, test.jpg) and mixed in unrelated Docker/supervisord changes. The host.docker.internal Docker fix is actually useful on its own — if you want to resend just that as a small focused PR, it'd help with #644 / #623.

@lfnovo lfnovo closed this Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants