File tree 2 files changed +23
-1
lines changed
2 files changed +23
-1
lines changed Original file line number Diff line number Diff line change 1
1
# ocr-table
2
2
This project aims to extract tables from scanned image PDFs using Optical Character Recognition.
3
3
4
+ # Install Requirements
5
+
6
+ 1 . Tesseract OCR
7
+ ```sh
8
+ sudo apt-get install tesseract-ocr
9
+ ```
10
+
11
+ 2 . Imagemagick
12
+ ```sh
13
+ sudo apt-get install imagemagick
14
+ ```
15
+
16
+ 3 . PDF Utilities
17
+ ```sh
18
+ sudo apt-get install poppler-utils
19
+ ```
20
+
21
+ 4 . Python packages
22
+ ```sh
23
+ sudo pip install requirements.txt
24
+ ```
25
+
4
26
# Usage
5
27
6
28
1 . Clear the [ pdf/] ( pdf ) folder and copy all your pdf files to be scanned in it.
Original file line number Diff line number Diff line change @@ -21,7 +21,7 @@ for FILEPATH in $BPATH*.pdf; do
21
21
OUTFILE=$OPATH $( basename $FILEPATH ) .txt
22
22
touch " $OUTFILE " # The text file will be created regardless of whether
23
23
# text is successfully extracted.
24
- # First attempt ot use pdftotext to extract embedded text.
24
+ # First attempt to use pdftotext to extract embedded text.
25
25
echo -n " Attempting pdftotext extraction..."
26
26
pdftotext " $FILEPATH " " $OUTFILE "
27
27
FILESIZE=$( wc -w < " $OUTFILE " )
You can’t perform that action at this time.
0 commit comments