Skip to content

Extract embedded PDF images as separate files with relative Markdown links#2055

Open
haydaramru wants to merge 2 commits into
microsoft:mainfrom
haydaramru:feat/pdf-extract-images
Open

Extract embedded PDF images as separate files with relative Markdown links#2055
haydaramru wants to merge 2 commits into
microsoft:mainfrom
haydaramru:feat/pdf-extract-images

Conversation

@haydaramru
Copy link
Copy Markdown

Summary

This PR adds an opt-in PDF image extraction workflow for MarkItDown. When enabled, embedded images from PDFs are saved as separate files under an output directory, and the generated Markdown references them using relative links.

This is intended for RAG, document archival, technical documentation, datasheets, schematics, research papers, and other workflows where inline base64 images make Markdown too large or impractical for indexing.

Example:

markitdown document.pdf --extract-images --output-dir ./output -o ./output/document.md

Output:

output/
├── document.md
└── images/
    ├── page1-image1.png
    └── page1-image2.png

Markdown:

![Image 1 on page 1](images/page1-image1.png)

What changed

CLI

Adds two new CLI options:

--extract-images
--output-dir <directory>

--output-dir is required when --extract-images is used. Existing stdout behavior is preserved: if -o is not provided, Markdown is still printed to stdout, while images are written under --output-dir/images/.

Python API

Adds constructor-level support:

from markitdown import MarkItDown

md = MarkItDown(extract_images=True, output_dir="./output")
result = md.convert("document.pdf")

Also supports per-call overrides:

md = MarkItDown()
result = md.convert(
    "document.pdf",
    extract_images=True,
    output_dir="./output",
)

If image extraction is enabled without an output directory, MarkItDown raises:

ValueError: output_dir is required when extract_images=True

PDF conversion

The PDF converter now has an opt-in image extraction path. Default PDF conversion remains unchanged unless extract_images=True is explicitly set.

The implementation:

  • Detects images from pdfplumber page image objects.
  • Saves images under output_dir/images/.
  • Uses deterministic filenames:
    • page1-image1.png
    • page1-image2.png
    • page3-image1.jpg
  • Inserts relative Markdown image links near the image's approximate vertical page position.
  • Preserves the existing fallback behavior for normal text extraction.
  • Keeps page cleanup behavior by still calling page.close().

Image writing behavior

The converter first tries to save original embedded image stream bytes when they are already recognizable image formats, including:

  • PNG
  • JPEG
  • GIF
  • TIFF
  • JPEG 2000

If direct stream extraction is not possible, it falls back to rendering/cropping the image region from the PDF page using pdfplumber.

The current fallback render setting is:

cropped_page.to_image(resolution=600)

This improves output sharpness for diagrams and technical images, but produces larger files and may be slower than lower DPI values such as 150 or 300.

Why not inline base64?

Before this feature, PDF/image-heavy Markdown workflows often had two poor options:

  • truncate data URIs, losing the actual image content;
  • keep full base64 data URIs, creating very large Markdown files.

Both are problematic for RAG and indexing. Extracted image files with relative links keep Markdown lightweight while preserving visual context.

Testing

Added focused coverage for image extraction, relative Markdown links, output_dir validation, CLI help/validation, and existing PDF page cleanup behavior. Full hatch test and pre-commit checks was also run and Passed.

Manual verification

Example command:

mkdir -p ./coba

markitdown \
  packages/markitdown-ocr/tests/ocr_test_data/pdf_multiple_images.pdf \
  --extract-images \
  --output-dir ./coba \
  -o ./coba/document.md

Expected files:

coba/document.md
coba/images/page1-image1.png
coba/images/page1-image2.png

Example Markdown output:

Document with Multiple Images

![Image 1 on page 1](images/page1-image1.png)

Text between first and second image.

![Image 2 on page 1](images/page1-image2.png)

Final text after all images.

Files changed

Core implementation:

  • packages/markitdown/src/markitdown/__main__.py
  • packages/markitdown/src/markitdown/_markitdown.py
  • packages/markitdown/src/markitdown/converters/_pdf_converter.py

Tests:

  • packages/markitdown/tests/test_cli_misc.py
  • packages/markitdown/tests/test_pdf_images.py

Future improvements

Possible follow-up work:

  • Add a configurable render DPI option, for example --image-dpi 300.
  • Add custom image directory support, for example --image-dir assets/images.
  • Add metadata output for page number, image dimensions, bbox, and format.
  • Add caption detection when nearby text appears to be a figure caption.
  • Add duplicate image detection to avoid writing identical images repeatedly.
  • Extend similar extraction behavior to DOCX/PPTX/XLSX if useful.

Related issues

This feature addresses the same workflow need described in Issue #2049 and complements prior work around base64 data URI handling and image descriptions for RAG workflows.

@haydaramru
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant