feat: vision models#791
Conversation
Add concurrently and dotenv-cli to frontend dependencies and introduce convenience npm scripts to run local services: "all", "all:no-db", "db", "api", and "worker". package-lock.json was updated to reflect the installed packages and dependency tree.
Introduce default_vision_model across backend, API, and frontend so vision-capable models can be configured and used. Backend: add field to DefaultModels, persist and return it in ModelsService and API router; add ModelManager.get_vision_model and use vision fallback to default chat model where appropriate. Content processing: include vision model provider/name/config (with credential->esperanto config) for image/video/PDF analysis and improve error logging. Frontend: add vision selector to default model UI, update layout (min-w-0), and add localized labels/descriptions for en/ja/pt-BR/zh-CN/zh-TW. Update TypeScript types to include default_vision_model.
Update frontend/package.json to include the --no-sync flag for the "api" and "worker" npm scripts. Both scripts now invoke `uv` with --no-sync (previously invoked without it), causing uv to run without its sync behavior.
Reduce the primary models grid from 4 to 3 columns at large screens and update advanced models to use responsive 2/3 column breakpoints. Add truncate to SelectTrigger and min-w-0 to advanced item containers to prevent overflowing model names and ensure proper text truncation/overflow handling.
# Conflicts: # frontend/src/app/(dashboard)/settings/api-keys/page.tsx
Introduce a new 'Vision Models' configuration guide and wire it into user and provider docs. Adds docs/5-CONFIGURATION/vision-models.md explaining how to configure a default Vision Model, required binaries (pdftoppm, ffmpeg), routing/adaptive-sampling behavior, cost considerations, and troubleshooting. Update Adding Sources and AI Providers pages to reference the new guide and clarify that image/PDF/video visual extraction requires a configured Vision Model (and that pure image files are unsupported without one).
There was a problem hiding this comment.
4 issues found across 18 files
Confidence score: 3/5
- There is a concrete regression risk in
open_notebook/graphs/source.py: shared exception handling can cause STT failures to also skip vision model configuration, which may change runtime behavior for source ingestion. - The documentation issues in
docs/5-CONFIGURATION/vision-models.mdanddocs/3-USER-GUIDE/adding-sources.mdare lower severity but user-facing, and could lead to incorrect setup expectations around Vision Model fallback behavior. - Given one medium-severity, high-confidence code-path issue (6/10, confidence 9/10), this looks mergeable with caution rather than a low-risk merge.
- Pay close attention to
open_notebook/graphs/source.py,docs/5-CONFIGURATION/vision-models.md,docs/3-USER-GUIDE/adding-sources.md, andfrontend/package.json- exception-path coupling may affect ingestion behavior, docs need alignment on fallback semantics, and dependency scope should avoid production bloat.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="frontend/package.json">
<violation number="1" location="frontend/package.json:45">
P3: Development-only CLI tooling was added to `dependencies` instead of `devDependencies`, which unnecessarily bloats production installs.</violation>
</file>
<file name="docs/5-CONFIGURATION/vision-models.md">
<violation number="1" location="docs/5-CONFIGURATION/vision-models.md:97">
P2: Docs contradict themselves on what happens when Vision Model is unset: fallback to Chat Model vs. always non-vision ingestion.</violation>
</file>
<file name="docs/3-USER-GUIDE/adding-sources.md">
<violation number="1" location="docs/3-USER-GUIDE/adding-sources.md:79">
P2: Docs incorrectly state that vision ingestion requires an explicit Vision Model and that image-only sources remain unsupported otherwise, but the app falls back to the default Chat Model (including multimodal chat models).</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
|
||
| ## Behavior & Failure Modes | ||
|
|
||
| - **No vision model configured** — images remain unsupported; PDFs use `pdfplumber`; videos use audio-only. Existing behavior preserved. |
There was a problem hiding this comment.
P2: Docs contradict themselves on what happens when Vision Model is unset: fallback to Chat Model vs. always non-vision ingestion.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/5-CONFIGURATION/vision-models.md, line 97:
<comment>Docs contradict themselves on what happens when Vision Model is unset: fallback to Chat Model vs. always non-vision ingestion.</comment>
<file context>
@@ -0,0 +1,134 @@
+
+## Behavior & Failure Modes
+
+- **No vision model configured** — images remain unsupported; PDFs use `pdfplumber`; videos use audio-only. Existing behavior preserved.
+- **Vision model fails on a frame/page** — the processor returns an `ExtractionOutput` with a placeholder message rather than aborting the whole source.
+- **Video audio extraction fails** — the visual analysis still completes; the transcript portion is simply omitted.
</file context>
| - **No vision model configured** — images remain unsupported; PDFs use `pdfplumber`; videos use audio-only. Existing behavior preserved. | |
| - **No vision model configured** — images remain unsupported; PDFs use `pdfplumber`; videos use audio-only. Existing behavior preserved. | |
| + **No Vision Model configured** (and no multimodal Chat Model fallback) — images remain unsupported; PDFs use `pdfplumber`; videos use audio-only. Existing behavior preserved. |
| - **Scanned / image-heavy PDFs** — Pages rendered and analyzed page-by-page (replaces standard OCR) | ||
| - **Videos** — Visual frames analyzed alongside the audio transcript | ||
|
|
||
| Requires setting a default **Vision Model** in Settings → API Keys → Default Models. See the [Vision Models guide](../5-CONFIGURATION/vision-models.md) for setup, supported models, and cost guidance. Without a Vision Model configured, image-only files remain unsupported and PDFs/videos fall back to the standard text/audio pipeline. |
There was a problem hiding this comment.
P2: Docs incorrectly state that vision ingestion requires an explicit Vision Model and that image-only sources remain unsupported otherwise, but the app falls back to the default Chat Model (including multimodal chat models).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/3-USER-GUIDE/adding-sources.md, line 79:
<comment>Docs incorrectly state that vision ingestion requires an explicit Vision Model and that image-only sources remain unsupported otherwise, but the app falls back to the default Chat Model (including multimodal chat models).</comment>
<file context>
@@ -71,10 +71,17 @@ Sources are the raw materials of your research. This guide covers how to add dif
+- **Scanned / image-heavy PDFs** — Pages rendered and analyzed page-by-page (replaces standard OCR)
+- **Videos** — Visual frames analyzed alongside the audio transcript
+
+Requires setting a default **Vision Model** in Settings → API Keys → Default Models. See the [Vision Models guide](../5-CONFIGURATION/vision-models.md) for setup, supported models, and cost guidance. Without a Vision Model configured, image-only files remain unsupported and PDFs/videos fall back to the standard text/audio pipeline.
+
### What Doesn't Work
</file context>
| Requires setting a default **Vision Model** in Settings → API Keys → Default Models. See the [Vision Models guide](../5-CONFIGURATION/vision-models.md) for setup, supported models, and cost guidance. Without a Vision Model configured, image-only files remain unsupported and PDFs/videos fall back to the standard text/audio pipeline. | |
| If no Vision Model is configured, Open Notebook falls back to the default **Chat Model**. When that chat model is multimodal, image-only files, scanned PDFs, and videos are still supported. |
| @@ -9,7 +9,12 @@ | |||
| "lint": "eslint src/", | |||
There was a problem hiding this comment.
P3: Development-only CLI tooling was added to dependencies instead of devDependencies, which unnecessarily bloats production installs.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At frontend/package.json, line 45:
<comment>Development-only CLI tooling was added to `dependencies` instead of `devDependencies`, which unnecessarily bloats production installs.</comment>
<file context>
@@ -37,7 +42,9 @@
"class-variance-authority": "^0.7.1",
"clsx": "^2.1.1",
"cmdk": "^1.1.1",
+ "concurrently": "^9.2.1",
"date-fns": "^4.1.0",
+ "dotenv-cli": "^11.0.0",
</file context>
The PDF vision processor samples pages by design (step=2 for 21-100 page docs) to control cost. That's fine for figure/layout understanding but yields a sampled-and-summarized full_text where embeddings built on it can't cite content from pages that were never read — leaving chat-with-source unable to ground answers in real document passages. When a PDF was just processed via vision, also pull literal text via content-core's existing pdfplumber extractor and use that as the primary `content`. Move the vision output to metadata.visual_analysis so save_source attaches it as a separate "Visual analysis" insight, keeping the figure/table/diagram descriptions visible to the user without polluting full_text. Scanned PDFs where pdfplumber returns < 200 chars fall through and keep their vision output unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Very interesting. This is some kind of feature I wanted in open-notebook. Use Gemini, Gemma, Kimi to analyse images in sources and videos, describe it as text and use some text-only but cheap/smarter model to read the context OR use purely multimodal models to read the image/frames directly as source instead of text description. |
a4f7ae8 to
24bcd86
Compare
# Conflicts: # frontend/package.json # open_notebook/graphs/source.py
Description
Adds first-class support for a default vision model, separate from the default chat model. Users can now select a multimodal model (GPT-4o, Claude 3+, Gemini, etc.) in Settings → API Keys → Default Models that will be used for analyzing images, PDF pages, and video frames during source ingestion. If no Vision Model is configured, the system falls back to the default Chat Model — so existing installs with a multimodal chat model get vision support automatically.
This PR is the Open Notebook–side wiring for two upstream library PRs:
chat_complete/achat_completeacross all LLM providers via the OpenAI content-array format, with per-provider translation (Anthropic image blocks, Gemini parts, Ollamaimagesfield, etc.) and newencode_image_base64/image_to_content_part/create_image_messagehelpers.pdftoppm(adaptive sampling), and sample video frames withffmpeg(adaptive fps), all analyzed in parallel through Esperanto. Routes PDFs / images / videos to the vision pipeline whenevervision_providerandvision_modelare both set.What this PR does end-to-end:
open_notebook/ai/models.py):DefaultModels.default_vision_modelis now a real (non-commented)Optional[str]field.ModelManagerexposesget_vision_model()and a"vision"case inget_default_model(), both falling back todefault_chat_modelwhen the vision-specific slot is empty.api/models.py,api/models_service.py,api/routers/models.py):DefaultModelsResponseround-tripsdefault_vision_modelthrough both GET and PUT on/models/defaults.open_notebook/graphs/source.py):content_processnow passesvision_provider/vision_model/vision_configintocontent_stateso content-core picks up the configured model + credential. As a drive-by, the existing speech-to-text path now also forwards full credential config (previously onlyprovider+name) — the same pattern as the new vision path.frontend/src/app/(dashboard)/settings/api-keys/page.tsx): new "Vision Model" selector in the Default Models grid, with translations for en-US, ja-JP, pt-BR, zh-CN, zh-TW. Minormin-w-0fix on the selector grid item to keep long labels from overflowing.docs/5-CONFIGURATION/vision-models.mdcovering how vision is delivered (links to both upstream PRs), what gets vision-processed (images / PDFs / videos), routing precedence vs. Docling, adaptive sampling math, configuration walkthrough, system binary requirements (pdftoppm,ffmpeg), fallback behavior, cost guidance, and troubleshooting. Cross-linked from the Configuration index, the Adding Sources user guide (which now lists images as supported when a Vision Model is set), and the AI Providers index.Related Issue
Fixes #
Type of Change
How Has This Been Tested?
uv run pytest)Test Details:
GET /models/defaults.default_chat_modelinModelManager.get_vision_model()and incontent_process.vision_provider,vision_model, andvision_config(when the model has a credential attached) are populated oncontent_statebeforeextract_contentruns.Design Alignment
Explanation:
The Vision Model is exposed through the same
/models/defaultsAPI surface andModelManagerfactory as every other default model type, so both UI and direct API consumers can configure it identically. Decoupling vision from chat lets users pick the best multimodal provider for media without forcing it on text-only chat — reinforcing multi-provider flexibility. All provider-specific image translation is handled upstream in Esperanto, keeping Open Notebook's wiring provider-agnostic.Checklist
Code Quality
Testing
make rufforruff check . --fixmake lintoruv run python -m mypy .Documentation
/docs(if applicable)Docs added/updated:
docs/5-CONFIGURATION/vision-models.md(new)docs/5-CONFIGURATION/index.md(link added)docs/3-USER-GUIDE/adding-sources.md(images now listed as supported when Vision Model configured)docs/4-AI-PROVIDERS/index.md(multimodal section links to the new guide)Database Changes
/migrations)Breaking Changes
Screenshots (if applicable)
Additional Context
pdftoppm(poppler) andffmpeg/ffprobe. The official Docker image already ships these; from-source users on macOS needbrew install poppler ffmpeg. Image-only ingestion does not require either.default_chat_modelin two places (ModelManager.get_default_model("vision")/get_vision_model()andgraphs/source.py::content_process) so existing installs that already have a multimodal chat model configured get vision support automatically without any settings change.gpt-4o-mini,claude-3-5-haiku, orgemini-2.0-flashas a default for routine ingestion, reserving flagship vision models for cases where layout / handwriting / detailed diagrams matter.Pre-Submission Verification