Not able to fetch all text data & Not able to extract text, table data in proper format

Hi Team,

I am using layout parser & detectron2 to detect everything i.e. text, tables, title, list but not figures  from the pdf(which I converted into image using pdf2image). I wanted to then extract the detected text, title, table, list in .txt format

**Issues:**
1)It seems like model is not recognizing all of text data properly
2) While extracting data in .txt format , it appears that:
      a)I am not bale to print text data in sequence as it appears on pdf
      b) I am not able to extract table data in tabular format

Can you please suggest how I can resolve above issues? Thank you!

**Code:**
**Install necessary libraries**
#install detectron2:
!pip install 'git+https://github.com/facebookresearch/detectron2.git@v0.4#egg=detectron2'
#install layoutparser
!pip install layoutparser
!pip install layoutparser[ocr]
##install opencv, numpy, matplotlib
!pip install opencv-python numpy matplotlib
!pip3 install pdf2image
!sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev
!apt-get install poppler-utils
!pip install --upgrade google-cloud-vision
!pip uninstall google-cloud-vision
!pip install google-cloud-vision
!apt install tesseract-ocr
!apt install libtesseract-dev
!pip install pytesseract

import os
from pdf2image import convert_from_path
import shutil
import cv2
import numpy as np
import layoutparser as lp
from pdf2image import convert_from_path

# Define Pdf_path
pdf_file='7050X_Q_A.pdf'
# Define your output file name here
output_file = 'output.txt'  

with open(output_file, 'w', encoding='utf-8') as f:
    for i, page_img in enumerate(convert_from_path(pdf_file)):
        img = np.asarray(page_img)

        model3 = lp.models.Detectron2LayoutModel(
            'lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config',
            extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5],
            label_map={0: "Text", 1: "Title", 2: "List", 3: "Table", 4: "Figure"}
        )

        layout_result3 = model3.detect(img)

        text_blocks = lp.Layout([b for b in layout_result3 if b.type != "Figure"])

        h, w = img.shape[:2]

        left_interval = lp.Interval(0, w / 2 * 1.05, axis='x').put_on_canvas(img)

        left_blocks = text_blocks.filter_by(left_interval, center=True)
        left_blocks.sort(key=lambda b: b.coordinates[1])

        right_blocks = [b for b in text_blocks if b not in left_blocks]
        right_blocks.sort(key=lambda b: b.coordinates[1])

        text_blocks = lp.Layout([b.set(id=idx) for idx, b in enumerate(left_blocks + right_blocks)])
        viz=lp.draw_box(img, text_blocks,box_width=10,show_element_id=True)
        display(viz)
        ocr_agent = lp.TesseractAgent(languages='eng')
        for block in text_blocks:
               segment_image = (block
                                .pad(left=5, right=5, top=5, bottom=5)
                                .crop_image(img))

               text = ocr_agent.detect(segment_image)
               block.set(text=text, inplace=True)

            # Write text to the output file
        for txt in text_blocks.get_texts():
            #print(txt, end='\n---\n')
            f.write(txt + '\n---\n')

print("Text extraction completed. Check the output file:", output_file)

**Environment**
1. Windows
2. Layout Parser & layoutparser[ocr] version 0.3.4
3. PyTorch version: 2.1.0+cu121  
!pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
4. google-cloud-vision-3.5.0
5. google-api-core Version: 2.11.1
6.Python 3.10.6

Thanks
Reema Jain



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Not able to fetch all text data & Not able to extract text, table data in proper format #205

Define Pdf_path

Define your output file name here

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Not able to fetch all text data & Not able to extract text, table data in proper format #205

Description

Define Pdf_path

Define your output file name here

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions