Extract embedded PDF images as separate files with relative Markdown links by haydaramru · Pull Request #2055 · microsoft/markitdown

haydaramru · 2026-06-02T09:39:01Z

Summary

This PR adds an opt-in PDF image extraction workflow for MarkItDown. When enabled, embedded images from PDFs are saved as separate files under an output directory, and the generated Markdown references them using relative links.

This is intended for RAG, document archival, technical documentation, datasheets, schematics, research papers, and other workflows where inline base64 images make Markdown too large or impractical for indexing.

Example:

markitdown document.pdf --extract-images --output-dir ./output -o ./output/document.md

Output:

output/
├── document.md
└── images/
    ├── page1-image1.png
    └── page1-image2.png

Markdown:

![Image 1 on page 1](images/page1-image1.png)

What changed

CLI

Adds two new CLI options:

--extract-images
--output-dir <directory>

--output-dir is required when --extract-images is used. Existing stdout behavior is preserved: if -o is not provided, Markdown is still printed to stdout, while images are written under --output-dir/images/.

Python API

Adds constructor-level support:

from markitdown import MarkItDown

md = MarkItDown(extract_images=True, output_dir="./output")
result = md.convert("document.pdf")

Also supports per-call overrides:

md = MarkItDown()
result = md.convert(
    "document.pdf",
    extract_images=True,
    output_dir="./output",
)

If image extraction is enabled without an output directory, MarkItDown raises:

ValueError: output_dir is required when extract_images=True

PDF conversion

The PDF converter now has an opt-in image extraction path. Default PDF conversion remains unchanged unless extract_images=True is explicitly set.

The implementation:

Detects images from pdfplumber page image objects.
Saves images under output_dir/images/.
Uses deterministic filenames:
- page1-image1.png
- page1-image2.png
- page3-image1.jpg
Inserts relative Markdown image links near the image's approximate vertical page position.
Preserves the existing fallback behavior for normal text extraction.
Keeps page cleanup behavior by still calling page.close().

Image writing behavior

The converter first tries to save original embedded image stream bytes when they are already recognizable image formats, including:

PNG
JPEG
GIF
TIFF
JPEG 2000

If direct stream extraction is not possible, it falls back to rendering/cropping the image region from the PDF page using pdfplumber.

The current fallback render setting is:

cropped_page.to_image(resolution=600)

This improves output sharpness for diagrams and technical images, but produces larger files and may be slower than lower DPI values such as 150 or 300.

Why not inline base64?

Before this feature, PDF/image-heavy Markdown workflows often had two poor options:

truncate data URIs, losing the actual image content;
keep full base64 data URIs, creating very large Markdown files.

Both are problematic for RAG and indexing. Extracted image files with relative links keep Markdown lightweight while preserving visual context.

Testing

Added focused coverage for image extraction, relative Markdown links, output_dir validation, CLI help/validation, and existing PDF page cleanup behavior. Full hatch test and pre-commit checks was also run and Passed.

Manual verification

Example command:

mkdir -p ./coba

markitdown \
  packages/markitdown-ocr/tests/ocr_test_data/pdf_multiple_images.pdf \
  --extract-images \
  --output-dir ./coba \
  -o ./coba/document.md

Expected files:

coba/document.md
coba/images/page1-image1.png
coba/images/page1-image2.png

Example Markdown output:

Document with Multiple Images

![Image 1 on page 1](images/page1-image1.png)

Text between first and second image.

![Image 2 on page 1](images/page1-image2.png)

Final text after all images.

Files changed

Core implementation:

packages/markitdown/src/markitdown/__main__.py
packages/markitdown/src/markitdown/_markitdown.py
packages/markitdown/src/markitdown/converters/_pdf_converter.py

Tests:

packages/markitdown/tests/test_cli_misc.py
packages/markitdown/tests/test_pdf_images.py

Future improvements

Possible follow-up work:

Add a configurable render DPI option, for example --image-dpi 300.
Add custom image directory support, for example --image-dir assets/images.
Add metadata output for page number, image dimensions, bbox, and format.
Add caption detection when nearby text appears to be a figure caption.
Add duplicate image detection to avoid writing identical images repeatedly.
Extend similar extraction behavior to DOCX/PPTX/XLSX if useful.

Related issues

This feature addresses the same workflow need described in Issue #2049 and complements prior work around base64 data URI handling and image descriptions for RAG workflows.

haydaramru · 2026-06-02T09:41:48Z

@microsoft-github-policy-service agree

haydaramru added 2 commits June 2, 2026 16:00

feat: extract embedded PDF images

ba73053

feat: extract embedded PDF images

d31d167

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract embedded PDF images as separate files with relative Markdown links#2055

Extract embedded PDF images as separate files with relative Markdown links#2055
haydaramru wants to merge 2 commits into
microsoft:mainfrom
haydaramru:feat/pdf-extract-images

haydaramru commented Jun 2, 2026

Uh oh!

haydaramru commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

haydaramru commented Jun 2, 2026

Summary

What changed

CLI

Python API

PDF conversion

Image writing behavior

Why not inline base64?

Testing

Manual verification

Files changed

Future improvements

Related issues

Uh oh!

haydaramru commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant