Skip to content

Conversation

@anup00900
Copy link

Fixes #21

Problem

Images in table cells were appearing below tables instead of inside the cells.

Solution

Implemented image detection within table cells for both legacy and layout modes.

Before Fix

Product Preview
Widget

image1

After Fix

Product Preview
Widget image1

Technical Changes

Legacy Mode (pymupdf_rag.py) - All users:

  • Added add_images_to_table_markdown() function
  • Detects images with >50% bbox overlap with cells
  • Generates unique filenames for table cell images
  • Inserts ![image](path) markdown inline
  • Updated 3 locations calling table.to_markdown()

Layout Mode (document_layout.py) - pymupdf_layout users:

  • Include image blocks (type==1) in table_blocks
  • Enhanced extract_cells() for image handling

Testing

  • Tested with realistic product catalogs (5 products)
  • 100% success rate (all images in correct cells)
  • Works with write_images and embed_images modes
  • Backward compatible

Benefits

  • Solves exact Issue Images in table #21 use case
  • Works for all users (not just commercial)
  • No breaking changes

anup.roy and others added 2 commits November 26, 2025 15:52
Modified extract_cells() to detect and extract image blocks (type==1)
within table cells, not just text blocks (type==0).

Changes:
- Updated extract_cells() to accept page and document parameters
- Added logic to detect image blocks within cell bounding boxes
- Implemented image extraction and saving for cells with images
- Images are now embedded in cell markdown as ![image](path) syntax
- Updated table_to_markdown() and table_extract() signatures
- Updated calls in document_layout.py to pass page/document context
- Added test script to demonstrate the fix

When write_images=True or embed_images=True, images found in table
cells are now properly extracted and referenced inline within the
cell markdown, resolving the issue where images appeared below tables.
This fix enables images to appear inside their corresponding table
cells instead of being extracted separately below the table.

Changes for LEGACY MODE (pymupdf_rag.py):
- Added add_images_to_table_markdown() function to detect images within
  table cell boundaries
- Images with >50% overlap with a cell are assigned to that cell
- Generates unique filenames for table cell images
- Supports both write_images and embed_images modes
- Inserts ![image](path) markdown syntax inline with cell text
- Updated all 3 locations where table.to_markdown() is called

Changes for LAYOUT MODE (document_layout.py):
- Updated table_blocks to include image blocks (type==1)
- Modified extract_cells() to detect and extract images in cells
- Added page/document parameters to table extraction functions
- Images are extracted and referenced inline in cells

TESTING:
Fully tested with embedded images in PDFs. All images correctly
appear inside their table cells in the markdown output.

Before fix:
| Col1  | Col2  | Image |
|---|---|---|
| Text | Text |  |

![image1](image1.png)

After fix:
| Col1  | Col2  | Image |
|---|---|---|
| Text | Text | ![image1](image1.png) |

Resolves the requested behavior from Issue pymupdf#21.
@anup00900
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Images in table

1 participant