Skip to content

feat: vision models#791

Open
kevincolten wants to merge 13 commits into
lfnovo:mainfrom
Notebooker-ai:feat/vision-models
Open

feat: vision models#791
kevincolten wants to merge 13 commits into
lfnovo:mainfrom
Notebooker-ai:feat/vision-models

Conversation

@kevincolten

Copy link
Copy Markdown
Contributor

Description

Adds first-class support for a default vision model, separate from the default chat model. Users can now select a multimodal model (GPT-4o, Claude 3+, Gemini, etc.) in Settings → API Keys → Default Models that will be used for analyzing images, PDF pages, and video frames during source ingestion. If no Vision Model is configured, the system falls back to the default Chat Model — so existing installs with a multimodal chat model get vision support automatically.

This PR is the Open Notebook–side wiring for two upstream library PRs:

  • Esperantolfnovo/esperanto#191: adds multimodal (image) input to chat_complete / achat_complete across all LLM providers via the OpenAI content-array format, with per-provider translation (Anthropic image blocks, Gemini parts, Ollama images field, etc.) and new encode_image_base64 / image_to_content_part / create_image_message helpers.
  • content-corelfnovo/content-core#37: adds vision-model–based extractors that describe images directly, render PDF pages with pdftoppm (adaptive sampling), and sample video frames with ffmpeg (adaptive fps), all analyzed in parallel through Esperanto. Routes PDFs / images / videos to the vision pipeline whenever vision_provider and vision_model are both set.

What this PR does end-to-end:

  • Domain (open_notebook/ai/models.py): DefaultModels.default_vision_model is now a real (non-commented) Optional[str] field. ModelManager exposes get_vision_model() and a "vision" case in get_default_model(), both falling back to default_chat_model when the vision-specific slot is empty.
  • API (api/models.py, api/models_service.py, api/routers/models.py): DefaultModelsResponse round-trips default_vision_model through both GET and PUT on /models/defaults.
  • Source ingestion graph (open_notebook/graphs/source.py): content_process now passes vision_provider / vision_model / vision_config into content_state so content-core picks up the configured model + credential. As a drive-by, the existing speech-to-text path now also forwards full credential config (previously only provider + name) — the same pattern as the new vision path.
  • Frontend (frontend/src/app/(dashboard)/settings/api-keys/page.tsx): new "Vision Model" selector in the Default Models grid, with translations for en-US, ja-JP, pt-BR, zh-CN, zh-TW. Minor min-w-0 fix on the selector grid item to keep long labels from overflowing.
  • Docs: new feature guide at docs/5-CONFIGURATION/vision-models.md covering how vision is delivered (links to both upstream PRs), what gets vision-processed (images / PDFs / videos), routing precedence vs. Docling, adaptive sampling math, configuration walkthrough, system binary requirements (pdftoppm, ffmpeg), fallback behavior, cost guidance, and troubleshooting. Cross-linked from the Configuration index, the Adding Sources user guide (which now lists images as supported when a Vision Model is set), and the AI Providers index.

Related Issue

Fixes #

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

How Has This Been Tested?

  • Tested locally with Docker
  • Tested locally with development setup
  • Added new unit tests
  • Existing tests pass (uv run pytest)
  • Manual testing performed (describe below)

Test Details:

  • Selected a vision-capable model in Settings → API Keys → Default Models and verified it persists via GET /models/defaults.
  • Verified the unset case falls back to default_chat_model in ModelManager.get_vision_model() and in content_process.
  • Ingested a source containing images / video / PDF pages and confirmed vision_provider, vision_model, and vision_config (when the model has a credential attached) are populated on content_state before extract_content runs.
  • Spot-checked the new translation keys render correctly across all five locales.

Design Alignment

  • Privacy First
  • Simplicity Over Features
  • API-First Architecture
  • Multi-Provider Flexibility
  • Extensibility Through Standards
  • Async-First for Performance

Explanation:
The Vision Model is exposed through the same /models/defaults API surface and ModelManager factory as every other default model type, so both UI and direct API consumers can configure it identically. Decoupling vision from chat lets users pick the best multimodal provider for media without forcing it on text-only chat — reinforcing multi-provider flexibility. All provider-specific image translation is handled upstream in Esperanto, keeping Open Notebook's wiring provider-agnostic.

Checklist

Code Quality

  • My code follows PEP 8 style guidelines (Python)
  • My code follows TypeScript best practices (Frontend)
  • I have added type hints to my code (Python)
  • I have added JSDoc comments where appropriate (TypeScript)
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings or errors

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran linting: make ruff or ruff check . --fix
  • I ran type checking: make lint or uv run python -m mypy .

Documentation

  • I have updated the relevant documentation in /docs (if applicable)
  • I have added/updated docstrings for new/modified functions
  • I have updated the API documentation (if API changes were made)
  • I have added comments to complex logic

Docs added/updated:

  • docs/5-CONFIGURATION/vision-models.md (new)
  • docs/5-CONFIGURATION/index.md (link added)
  • docs/3-USER-GUIDE/adding-sources.md (images now listed as supported when Vision Model configured)
  • docs/4-AI-PROVIDERS/index.md (multimodal section links to the new guide)

Database Changes

  • I have created migration scripts for any database schema changes (in /migrations)
  • Migration includes both up and down scripts
  • Migration has been tested locally

No migration required — default_vision_model is added as an Optional[str] field on the existing DefaultModels schemaless record; existing rows simply read None until set.

Breaking Changes

  • This PR includes breaking changes
  • I have documented the migration path for users
  • I have updated MIGRATION.md (if applicable)

Screenshots (if applicable)

Additional Context

  • Upstream dependencies: this PR depends on the multimodal input surface in Esperanto (lfnovo/esperanto#191) and the vision extractors in content-core (lfnovo/content-core#37). Both are companion PRs and need to land / be released before this can be merged on their pinned versions.
  • System binaries: PDF and video vision processing shell out to pdftoppm (poppler) and ffmpeg / ffprobe. The official Docker image already ships these; from-source users on macOS need brew install poppler ffmpeg. Image-only ingestion does not require either.
  • Capability detection: none, by design (matching the upstream Esperanto PR). Selecting a non-multimodal model surfaces the provider's API error verbatim during ingestion.
  • Fallback behavior: the vision model resolution intentionally falls back to default_chat_model in two places (ModelManager.get_default_model("vision") / get_vision_model() and graphs/source.py::content_process) so existing installs that already have a multimodal chat model configured get vision support automatically without any settings change.
  • Cost note: vision input is significantly more expensive than text. The new docs guide recommends gpt-4o-mini, claude-3-5-haiku, or gemini-2.0-flash as a default for routine ingestion, reserving flagship vision models for cases where layout / handwriting / detailed diagrams matter.

Pre-Submission Verification

  • I have read CONTRIBUTING.md
  • I have read DESIGN_PRINCIPLES.md
  • This PR addresses an approved issue that was assigned to me
  • I have not included unrelated changes in this PR
  • My PR title follows conventional commits format (e.g., "feat: add user authentication")

Add concurrently and dotenv-cli to frontend dependencies and introduce convenience npm scripts to run local services: "all", "all:no-db", "db", "api", and "worker". package-lock.json was updated to reflect the installed packages and dependency tree.
Introduce default_vision_model across backend, API, and frontend so vision-capable models can be configured and used. Backend: add field to DefaultModels, persist and return it in ModelsService and API router; add ModelManager.get_vision_model and use vision fallback to default chat model where appropriate. Content processing: include vision model provider/name/config (with credential->esperanto config) for image/video/PDF analysis and improve error logging. Frontend: add vision selector to default model UI, update layout (min-w-0), and add localized labels/descriptions for en/ja/pt-BR/zh-CN/zh-TW. Update TypeScript types to include default_vision_model.
Update frontend/package.json to include the --no-sync flag for the "api" and "worker" npm scripts. Both scripts now invoke `uv` with --no-sync (previously invoked without it), causing uv to run without its sync behavior.
Reduce the primary models grid from 4 to 3 columns at large screens and update advanced models to use responsive 2/3 column breakpoints. Add truncate to SelectTrigger and min-w-0 to advanced item containers to prevent overflowing model names and ensure proper text truncation/overflow handling.
# Conflicts:
#	frontend/src/app/(dashboard)/settings/api-keys/page.tsx
Introduce a new 'Vision Models' configuration guide and wire it into user and provider docs. Adds docs/5-CONFIGURATION/vision-models.md explaining how to configure a default Vision Model, required binaries (pdftoppm, ffmpeg), routing/adaptive-sampling behavior, cost considerations, and troubleshooting. Update Adding Sources and AI Providers pages to reference the new guide and clarify that image/PDF/video visual extraction requires a configured Vision Model (and that pure image files are unsupported without one).

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 18 files

Confidence score: 3/5

  • There is a concrete regression risk in open_notebook/graphs/source.py: shared exception handling can cause STT failures to also skip vision model configuration, which may change runtime behavior for source ingestion.
  • The documentation issues in docs/5-CONFIGURATION/vision-models.md and docs/3-USER-GUIDE/adding-sources.md are lower severity but user-facing, and could lead to incorrect setup expectations around Vision Model fallback behavior.
  • Given one medium-severity, high-confidence code-path issue (6/10, confidence 9/10), this looks mergeable with caution rather than a low-risk merge.
  • Pay close attention to open_notebook/graphs/source.py, docs/5-CONFIGURATION/vision-models.md, docs/3-USER-GUIDE/adding-sources.md, and frontend/package.json - exception-path coupling may affect ingestion behavior, docs need alignment on fallback semantics, and dependency scope should avoid production bloat.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="frontend/package.json">

<violation number="1" location="frontend/package.json:45">
P3: Development-only CLI tooling was added to `dependencies` instead of `devDependencies`, which unnecessarily bloats production installs.</violation>
</file>

<file name="docs/5-CONFIGURATION/vision-models.md">

<violation number="1" location="docs/5-CONFIGURATION/vision-models.md:97">
P2: Docs contradict themselves on what happens when Vision Model is unset: fallback to Chat Model vs. always non-vision ingestion.</violation>
</file>

<file name="docs/3-USER-GUIDE/adding-sources.md">

<violation number="1" location="docs/3-USER-GUIDE/adding-sources.md:79">
P2: Docs incorrectly state that vision ingestion requires an explicit Vision Model and that image-only sources remain unsupported otherwise, but the app falls back to the default Chat Model (including multimodal chat models).</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.


## Behavior & Failure Modes

- **No vision model configured** — images remain unsupported; PDFs use `pdfplumber`; videos use audio-only. Existing behavior preserved.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Docs contradict themselves on what happens when Vision Model is unset: fallback to Chat Model vs. always non-vision ingestion.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/5-CONFIGURATION/vision-models.md, line 97:

<comment>Docs contradict themselves on what happens when Vision Model is unset: fallback to Chat Model vs. always non-vision ingestion.</comment>

<file context>
@@ -0,0 +1,134 @@
+
+## Behavior & Failure Modes
+
+- **No vision model configured** — images remain unsupported; PDFs use `pdfplumber`; videos use audio-only. Existing behavior preserved.
+- **Vision model fails on a frame/page** — the processor returns an `ExtractionOutput` with a placeholder message rather than aborting the whole source.
+- **Video audio extraction fails** — the visual analysis still completes; the transcript portion is simply omitted.
</file context>
Suggested change
- **No vision model configured** — images remain unsupported; PDFs use `pdfplumber`; videos use audio-only. Existing behavior preserved.
- **No vision model configured** — images remain unsupported; PDFs use `pdfplumber`; videos use audio-only. Existing behavior preserved.
+ **No Vision Model configured** (and no multimodal Chat Model fallback) — images remain unsupported; PDFs use `pdfplumber`; videos use audio-only. Existing behavior preserved.

- **Scanned / image-heavy PDFs** — Pages rendered and analyzed page-by-page (replaces standard OCR)
- **Videos** — Visual frames analyzed alongside the audio transcript

Requires setting a default **Vision Model** in Settings → API Keys → Default Models. See the [Vision Models guide](../5-CONFIGURATION/vision-models.md) for setup, supported models, and cost guidance. Without a Vision Model configured, image-only files remain unsupported and PDFs/videos fall back to the standard text/audio pipeline.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Docs incorrectly state that vision ingestion requires an explicit Vision Model and that image-only sources remain unsupported otherwise, but the app falls back to the default Chat Model (including multimodal chat models).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/3-USER-GUIDE/adding-sources.md, line 79:

<comment>Docs incorrectly state that vision ingestion requires an explicit Vision Model and that image-only sources remain unsupported otherwise, but the app falls back to the default Chat Model (including multimodal chat models).</comment>

<file context>
@@ -71,10 +71,17 @@ Sources are the raw materials of your research. This guide covers how to add dif
+- **Scanned / image-heavy PDFs** — Pages rendered and analyzed page-by-page (replaces standard OCR)
+- **Videos** — Visual frames analyzed alongside the audio transcript
+
+Requires setting a default **Vision Model** in Settings → API Keys → Default Models. See the [Vision Models guide](../5-CONFIGURATION/vision-models.md) for setup, supported models, and cost guidance. Without a Vision Model configured, image-only files remain unsupported and PDFs/videos fall back to the standard text/audio pipeline.
+
 ### What Doesn't Work
</file context>
Suggested change
Requires setting a default **Vision Model** in Settings → API Keys → Default Models. See the [Vision Models guide](../5-CONFIGURATION/vision-models.md) for setup, supported models, and cost guidance. Without a Vision Model configured, image-only files remain unsupported and PDFs/videos fall back to the standard text/audio pipeline.
If no Vision Model is configured, Open Notebook falls back to the default **Chat Model**. When that chat model is multimodal, image-only files, scanned PDFs, and videos are still supported.

Comment thread open_notebook/graphs/source.py
Comment thread frontend/package.json
@@ -9,7 +9,12 @@
"lint": "eslint src/",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Development-only CLI tooling was added to dependencies instead of devDependencies, which unnecessarily bloats production installs.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At frontend/package.json, line 45:

<comment>Development-only CLI tooling was added to `dependencies` instead of `devDependencies`, which unnecessarily bloats production installs.</comment>

<file context>
@@ -37,7 +42,9 @@
     "class-variance-authority": "^0.7.1",
     "clsx": "^2.1.1",
     "cmdk": "^1.1.1",
+    "concurrently": "^9.2.1",
     "date-fns": "^4.1.0",
+    "dotenv-cli": "^11.0.0",
</file context>

The PDF vision processor samples pages by design (step=2 for 21-100
page docs) to control cost. That's fine for figure/layout understanding
but yields a sampled-and-summarized full_text where embeddings built
on it can't cite content from pages that were never read — leaving
chat-with-source unable to ground answers in real document passages.

When a PDF was just processed via vision, also pull literal text via
content-core's existing pdfplumber extractor and use that as the
primary `content`. Move the vision output to metadata.visual_analysis
so save_source attaches it as a separate "Visual analysis" insight,
keeping the figure/table/diagram descriptions visible to the user
without polluting full_text. Scanned PDFs where pdfplumber returns
< 200 chars fall through and keep their vision output unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Markus328

Copy link
Copy Markdown

Very interesting. This is some kind of feature I wanted in open-notebook. Use Gemini, Gemma, Kimi to analyse images in sources and videos, describe it as text and use some text-only but cheap/smarter model to read the context OR use purely multimodal models to read the image/frames directly as source instead of text description.

@kevincolten kevincolten force-pushed the feat/vision-models branch from a4f7ae8 to 24bcd86 Compare June 3, 2026 23:05
# Conflicts:
#	frontend/package.json
#	open_notebook/graphs/source.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants