Extracted image from pdf is completely black #1407

YashMistry349 · 2021-11-16T09:23:53Z

YashMistry349
Nov 16, 2021

I am working on image extraction from PDF. The library can detect the image in the PDF page correctly, But while saving it or displaying it I get a completely black image.
For your reference, I am attaching a file that contains the image byte stream which is extracted from pdf. But while saving or displaying it, it's completely black.

.
byte_stream.txt

Answered by JorjMcKie

Nov 16, 2021

Your way of image extraction is unable to deal with images having an image mask.
Your PDF however has 2 images, each with an image mask:

>>> from pprint import pprint
>>> 
>>> pprint(page.get_images(True))
[(19, 25, 419, 64, 8, 'DeviceRGB', '', 'Img1', 'FlateDecode', 0),
 (20, 26, 419, 64, 8, 'DeviceRGB', '', 'Img10', 'FlateDecode', 0)]
>>>

to extract such images, a special coding must be used: e-g- for the first one (xref 19, mask xref 25):

pix19 = fitz.Pixmap(doc, 19)
mask = fitz.Pixmap(doc, 25)
pix = fitz.Pixmap(pix19, mask)
pix.save("test.png")  # fully recovered image

View full answer

JorjMcKie · 2021-11-16T09:29:05Z

JorjMcKie
Nov 16, 2021
Maintainer

The attachment doesn't help - please provide the document and the code you used for extraction.

0 replies

YashMistry349 · 2021-11-16T09:51:54Z

YashMistry349
Nov 16, 2021
Author

Test.pdf
Code:

import io
import fitz
from PIL import Image
path = 'Test.pdf'
doc = fitz.open(path, filetype="pdf")

page_count = doc.page_count
if page_count:
    for page_no in range(page_count):
        blocks = doc[page_no].getText('dict')['blocks']
        for ind, block in enumerate(blocks):
            if block['type'] == 1:
                try:
                    image = Image.open(io.BytesIO(block['image']))
                    image.save(open(f"test.{block['ext']}", "wb"))
                except Exception as e:
                    print(e)

0 replies

JorjMcKie · 2021-11-16T10:30:11Z

JorjMcKie
Nov 16, 2021
Maintainer

Your way of image extraction is unable to deal with images having an image mask.
Your PDF however has 2 images, each with an image mask:

>>> from pprint import pprint
>>> 
>>> pprint(page.get_images(True))
[(19, 25, 419, 64, 8, 'DeviceRGB', '', 'Img1', 'FlateDecode', 0),
 (20, 26, 419, 64, 8, 'DeviceRGB', '', 'Img10', 'FlateDecode', 0)]
>>>

to extract such images, a special coding must be used: e-g- for the first one (xref 19, mask xref 25):

pix19 = fitz.Pixmap(doc, 19)
mask = fitz.Pixmap(doc, 25)
pix = fitz.Pixmap(pix19, mask)
pix.save("test.png")  # fully recovered image

1 reply

YashMistry349 Nov 17, 2021
Author

Hey, Thank you for the snippet of code but I am getting an error while doing this pix = fitz.Pixmap(pix19, mask),

Full code:

>>import fitz
>>path = './Test.pdf'
>>doc = fitz.open(path, filetype='pdf')
>>from pprint import print

>>for page in doc:
...    pprint(page.get_images(True))
    
[(19, 25, 419, 64, 8, 'DeviceRGB', '', 'Img1', 'FlateDecode', 0),
 (20, 26, 419, 64, 8, 'DeviceRGB', '', 'Img10', 'FlateDecode', 0)]
>>pix19 = fitz.Pixmap(doc, 19)
>>mask = fitz.Pixmap(doc, 25)
>>pix = fitz.Pixmap(pix19, mask)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/home/yash/git/virtual_enviroments/test/lib/python3.8/site-packages/fitz/fitz.py", line 6467, in __init__
    _fitz.Pixmap_swiginit(self, _fitz.new_Pixmap(*args))
TypeError: Wrong number or type of arguments for overloaded function 'new_Pixmap'.
  Possible C/C++ prototypes are:
    Pixmap::Pixmap(struct Colorspace *,PyObject *,int)
    Pixmap::Pixmap(struct Colorspace *,struct Pixmap *)
    Pixmap::Pixmap(struct Pixmap *,float,float,PyObject *)
    Pixmap::Pixmap(struct Pixmap *,int)
    Pixmap::Pixmap(struct Colorspace *,int,int,PyObject *,int)
    Pixmap::Pixmap(PyObject *)
    Pixmap::Pixmap(struct Document *,int)

What am I doing wrong?

System Specification:
Ubuntu 20.04.3 LTS
Python 3.8.10
PyMuPDF 1.18.17

JorjMcKie · 2021-11-17T06:32:13Z

JorjMcKie
Nov 17, 2021
Maintainer

Sorry forgot to mention that you need to upgrade to v1.19.x for this to work.

4 replies

SummerXXXX Dec 26, 2021

hi, I'm facing the same issue, I try to extract image from pdf file but the image has an black background,
I tried the code:

page.get_images(True)

which returns one of the image:
[(1088, 6181, 1010, 485, 8, 'DeviceRGB', '', 'Image158', 'FlateDecode', 0)

then I copied your example:
pix1088 = fitz.Pixmap(doc, 1088)
print(pix1088.alpha)
mask = fitz.Pixmap(doc, 6181)
print(mask.alpha)
pix = fitz.Pixmap(pix1088, mask)
pix1088.save("test1.png")

but there is an error while executing fitz.Pixmap(pix1088, mask) :
RuntimeError: color pixmap must not have an alpha channel

I found this is because pix1088 contains transparency information, pix1088.alpha = 1 and mask.alpha = 0

Could you please help, how to extract image in this case, so there is no black background. thanks.

my environment:
windows 10
python 3.9
pymupdf 1.19.3

JorjMcKie Dec 26, 2021
Maintainer

looks awkward, let me have your file / page please

SummerXXXX Dec 26, 2021

test.pdf

the image is in page 4

SummerXXXX Dec 26, 2021

here is my test code:

import fitz
from pprint import pprint

doc = fitz.open("test.pdf")
page = doc.load_page(3)
pprint(page.get_images(True))
pix1088 = fitz.Pixmap(doc, 1088)
print(pix1088.alpha)
mask = fitz.Pixmap(doc, 6181)
print(mask.alpha)
pix = fitz.Pixmap(pix1088, mask)
pix1088.save("test1.png")

JorjMcKie · 2021-12-26T07:55:41Z

JorjMcKie
Dec 26, 2021
Maintainer

Thanks for the file.
I have looked into it: this case is unsupported by (Py-) MuPDF, sorry. If you look at the mask xref, you will find the key /Matte, which means that a special color premultiplication must take place with the entries of this parameter.
This does not work currently.

0 replies

JorjMcKie · 2021-12-26T09:39:22Z

JorjMcKie
Dec 26, 2021
Maintainer

The following may give you a somewhat better result:

pix = fitz.Pixmap(doc, 1088)
mask = fitz.Pixmap(doc, 6181)
pix.set_alpha(mask.samples)

1 reply

JorjMcKie Dec 26, 2021
Maintainer

The Pixmap.set_alpha() method does the same (or similar) thing as the approach that you used. The difference is that it requires a pixmap with an alpha channel, so it is appropriate in your situation.
Method .set_alpha() is my own making, so there may be a way to build logic that can cope with masks having a /Matte definition ...

JorjMcKie · 2021-12-26T21:38:15Z

JorjMcKie
Dec 26, 2021
Maintainer

@SummerXXXX - in the meantime I also tested yet another approach:
The only problem in your case is that the base image has an alpha channel. This prevents that applying the mask directly.
But if we first remove that alpha channel, then the method does work with the thus modified base image.
So if you do the following then everything works fine:

pix1088 = fitz.Pixmap(doc,1088)
mask = fitz.Pixmap(doc, 6181)
if pix1088.alpha:
    temp = fitz.Pixmap(pix1088, 0)  # make temp pixmap w/o the alpha
    pix1088 = None  # release storage
    pix1088 = temp
pix = fitz.Pixmap(pix1088, mask)  # now compose final pixmap
pix.save("image1088.png")

This method works with the example file, because all the /Matte (background color) keys have the value [0 0 0], which has zero effect: a normal premultiply will work in this case.

For the next version, I plan a modification which hopefully provides more of these cases.

1 reply

SummerXXXX Dec 27, 2021

thanks for your help, I will try this

Shilpi261985 · 2025-07-22T12:19:12Z

Shilpi261985
Jul 22, 2025

i am also having same issue with my code. black back ground images are extracted from pdf. but need proper images as in pdf. code used: def extract_and_save(input_pdf_path, output_pdf_path):
doc = fitz.open(input_pdf_path)
image_list = []

for page_num in range(len(doc)):
    page = doc[page_num]
    images = page.get_images(full=True)

    print(f"Page {page_num + 1}: Found {len(images)} images")

    for img_idx, img in enumerate(images):
        try:
            xref = img[0]
            # Get the image XObject dictionary
            img_dict = doc.xref_object(xref, compressed=True)

            pix = fitz.Pixmap(doc, xref)

            # Try to get the soft mask (transparency mask)
            smask_xref = None
            for line in img_dict.splitlines():
                if "/SMask" in line:
                    # Extract the xref number after /SMask
                    # Example line: "/SMask 123 0 R"
                    parts = line.strip().split()
                    if len(parts) >= 2 and parts[0] == "/SMask":
                        try:
                            smask_xref = int(parts[1])
                            break
                        except:
                            pass
            
            if smask_xref:
                # Extract main image
                img_pix = fitz.Pixmap(doc, xref)
                img_np = np.frombuffer(img_pix.samples, dtype=np.uint8)
                img_np = img_np.reshape((img_pix.height, img_pix.width, img_pix.n))
                img_pix = None

                # Extract mask image
                mask_pix = fitz.Pixmap(doc, smask_xref)
                mask_np = np.frombuffer(mask_pix.samples, dtype=np.uint8)
                mask_np = mask_np.reshape((mask_pix.height, mask_pix.width))
                mask_pix = None

                # Combine image + alpha mask into RGBA
                if img_np.shape[2] == 3:
                    rgba_np = np.dstack((img_np, mask_np))
                else:
                    rgba_np = img_np  # fallback

                pil_img = Image.fromarray(rgba_np, mode="RGBA")

            else:
                # Convert CMYK or grayscale to RGB if needed
                if pix.n >= 4 or pix.alpha or pix.colorspace != fitz.csRGB:
                    pix = fitz.Pixmap(fitz.csRGB, pix)

                img_bytes = pix.tobytes("png")
                #pix = None

                pil_img = Image.open(BytesIO(img_bytes))

            # Flatten alpha if present
            if pil_img.mode in ("RGBA", "LA"):
                background = Image.new("RGB", pil_img.size, (255, 255, 255))
                background.paste(pil_img, mask=pil_img.getchannel("A"))  # Use the last channel as alpha mask
                pil_img = background
            else:
                pil_img = pil_img.convert("RGB")

            # Skip small images
            if pil_img.width < 300 or pil_img.height < 300:
                continue

            # Skip near blank images
            stat = ImageStat.Stat(pil_img)
            if max(stat.stddev) < 1.0:
                continue

            image_list.append(pil_img)

        except Exception as e:
            print(f"Error processing image {img_idx + 1} on page {page_num + 1}: {e}")

if image_list:
    image_list[0].save(
        output_pdf_path,
        save_all=True,
        append_images=image_list[1:],
        resolution=100.0
    )
    print(f"Saved {len(image_list)} images into '{output_pdf_path}'")
else:
    print("No valid images found to save.")

Will appreciate if you answer as quickly as possible.

0 replies

Extracted image from pdf is completely black #1407

Uh oh!

YashMistry349 Nov 16, 2021

Replies: 8 comments · 7 replies

Uh oh!

JorjMcKie Nov 16, 2021 Maintainer

Uh oh!

Uh oh!

YashMistry349 Nov 16, 2021 Author

Uh oh!

Uh oh!

JorjMcKie Nov 16, 2021 Maintainer

Uh oh!

Uh oh!

YashMistry349 Nov 17, 2021 Author

Uh oh!

JorjMcKie Nov 17, 2021 Maintainer

Uh oh!

Uh oh!

SummerXXXX Dec 26, 2021

Uh oh!

JorjMcKie Dec 26, 2021 Maintainer

Uh oh!

SummerXXXX Dec 26, 2021

Uh oh!

SummerXXXX Dec 26, 2021

Uh oh!

JorjMcKie Dec 26, 2021 Maintainer

Uh oh!

JorjMcKie Dec 26, 2021 Maintainer

Uh oh!

JorjMcKie Dec 26, 2021 Maintainer

Uh oh!

JorjMcKie Dec 26, 2021 Maintainer

Uh oh!

SummerXXXX Dec 27, 2021

Uh oh!

Shilpi261985 Jul 22, 2025

YashMistry349
Nov 16, 2021

Replies: 8 comments 7 replies

JorjMcKie
Nov 16, 2021
Maintainer

YashMistry349
Nov 16, 2021
Author

JorjMcKie
Nov 16, 2021
Maintainer

YashMistry349 Nov 17, 2021
Author

JorjMcKie
Nov 17, 2021
Maintainer

JorjMcKie Dec 26, 2021
Maintainer

JorjMcKie
Dec 26, 2021
Maintainer

JorjMcKie
Dec 26, 2021
Maintainer

JorjMcKie Dec 26, 2021
Maintainer

JorjMcKie
Dec 26, 2021
Maintainer

Shilpi261985
Jul 22, 2025