Skip to content

Commit 9400d94

Browse files
committed
Update requirements
1 parent 4492cd8 commit 9400d94

File tree

2 files changed

+23
-1
lines changed

2 files changed

+23
-1
lines changed

README.md

+22
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,28 @@
11
# ocr-table
22
This project aims to extract tables from scanned image PDFs using Optical Character Recognition.
33

4+
# Install Requirements
5+
6+
1. Tesseract OCR
7+
```sh
8+
sudo apt-get install tesseract-ocr
9+
```
10+
11+
2. Imagemagick
12+
```sh
13+
sudo apt-get install imagemagick
14+
```
15+
16+
3. PDF Utilities
17+
```sh
18+
sudo apt-get install poppler-utils
19+
```
20+
21+
4. Python packages
22+
```sh
23+
sudo pip install requirements.txt
24+
```
25+
426
# Usage
527

628
1. Clear the [pdf/](pdf) folder and copy all your pdf files to be scanned in it.

extract_text.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ for FILEPATH in $BPATH*.pdf; do
2121
OUTFILE=$OPATH$(basename $FILEPATH).txt
2222
touch "$OUTFILE" # The text file will be created regardless of whether
2323
# text is successfully extracted.
24-
# First attempt ot use pdftotext to extract embedded text.
24+
# First attempt to use pdftotext to extract embedded text.
2525
echo -n "Attempting pdftotext extraction..."
2626
pdftotext "$FILEPATH" "$OUTFILE"
2727
FILESIZE=$(wc -w < "$OUTFILE")

0 commit comments

Comments
 (0)