Description
Hi Team,
I am using layout parser & detectron2 to detect everything i.e. text, tables, title, list but not figures from the pdf(which I converted into image using pdf2image). I wanted to then extract the detected text, title, table, list in .txt format
Issues:
1)It seems like model is not recognizing all of text data properly
2) While extracting data in .txt format , it appears that:
a)I am not bale to print text data in sequence as it appears on pdf
b) I am not able to extract table data in tabular format
Can you please suggest how I can resolve above issues? Thank you!
Code:
Install necessary libraries
#install detectron2:
!pip install 'git+https://github.com/facebookresearch/[email protected]#egg=detectron2'
#install layoutparser
!pip install layoutparser
!pip install layoutparser[ocr]
##install opencv, numpy, matplotlib
!pip install opencv-python numpy matplotlib
!pip3 install pdf2image
!sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev
!apt-get install poppler-utils
!pip install --upgrade google-cloud-vision
!pip uninstall google-cloud-vision
!pip install google-cloud-vision
!apt install tesseract-ocr
!apt install libtesseract-dev
!pip install pytesseract
import os
from pdf2image import convert_from_path
import shutil
import cv2
import numpy as np
import layoutparser as lp
from pdf2image import convert_from_path
Define Pdf_path
pdf_file='7050X_Q_A.pdf'
Define your output file name here
output_file = 'output.txt'
with open(output_file, 'w', encoding='utf-8') as f:
for i, page_img in enumerate(convert_from_path(pdf_file)):
img = np.asarray(page_img)
model3 = lp.models.Detectron2LayoutModel(
'lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config',
extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.5],
label_map={0: "Text", 1: "Title", 2: "List", 3: "Table", 4: "Figure"}
)
layout_result3 = model3.detect(img)
text_blocks = lp.Layout([b for b in layout_result3 if b.type != "Figure"])
h, w = img.shape[:2]
left_interval = lp.Interval(0, w / 2 * 1.05, axis='x').put_on_canvas(img)
left_blocks = text_blocks.filter_by(left_interval, center=True)
left_blocks.sort(key=lambda b: b.coordinates[1])
right_blocks = [b for b in text_blocks if b not in left_blocks]
right_blocks.sort(key=lambda b: b.coordinates[1])
text_blocks = lp.Layout([b.set(id=idx) for idx, b in enumerate(left_blocks + right_blocks)])
viz=lp.draw_box(img, text_blocks,box_width=10,show_element_id=True)
display(viz)
ocr_agent = lp.TesseractAgent(languages='eng')
for block in text_blocks:
segment_image = (block
.pad(left=5, right=5, top=5, bottom=5)
.crop_image(img))
text = ocr_agent.detect(segment_image)
block.set(text=text, inplace=True)
# Write text to the output file
for txt in text_blocks.get_texts():
#print(txt, end='\n---\n')
f.write(txt + '\n---\n')
print("Text extraction completed. Check the output file:", output_file)
Environment
- Windows
- Layout Parser & layoutparser[ocr] version 0.3.4
- PyTorch version: 2.1.0+cu121
!pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121 - google-cloud-vision-3.5.0
- google-api-core Version: 2.11.1
6.Python 3.10.6
Thanks
Reema Jain