Add Sámi tesseract model by titaenstad · Pull Request #11 · tesseract-ocr/tessdata_contrib

titaenstad · 2025-02-28T15:34:03Z

I would like to share the best tesseract model from our work with Sámi OCR for the texts in the collection of the National Library of Norway. The training, validation and test data are unfortunately copyright protected and cannot be shared.

We describe the work in the paper "T Enstad, T Trosterud, MI Røsok, Y Beyer, M Roald. Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway. Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)"

Paper link: https://hdl.handle.net/10062/107202

stweil · 2025-02-28T16:08:39Z

Nice. It's a pity that parts of your ground truth data are not publicly available. Only free open data allows reproducible or new trainings, not only for Tesseract, but also for other free OCR software.

It would be interesting to repeat the training process based on Latin.traineddata. It might work better than starting with the Norwegian model. And I assume that you did not add a dictionary to your trained model? That might increase the recognition rate further.

titaenstad · 2025-03-05T14:55:48Z

It would be interesting to repeat the training process based on Latin.traineddata. It might work better than starting with the Norwegian model.

Why do you think that?

And I assume that you did not add a dictionary to your trained model? That might increase the recognition rate further.

Is this the WORDLIST_FILE parameter to tesstrain? Can you tell me more/link to documentation about this?

stweil · 2025-03-05T16:47:00Z

It would be interesting to repeat the training process based on Latin.traineddata. It might work better than starting with the Norwegian model.

Why do you think that?

Latin.traineddata got more training and has a larger character set than nor.traineddata. I expect that its larger character set covers more of the additional characters which are required for Sámi.

And I assume that you did not add a dictionary to your trained model? That might increase the recognition rate further.

Is this the WORDLIST_FILE parameter to tesstrain? Can you tell me more/link to documentation about this?

You need combine_tessdata, dawg2wordlist and wordlist2dawg.

This process should do the job:

# Create intermediate directory.
mkdir -p tmp

# Extract all components from smi.traineddata.
combine_tessdata -u smi.traineddata tmp/smi.

# Extract all components from nor.traineddata.
combine_tessdata -u nor.traineddata tmp/nor.

# Convert wordlist from DAWG to text.
dawg2wordlist tmp/nor.lstm-unicharset tmp/nor.lstm-word-dawg tmp/nor.lstm-word.txt

Now edit or replace the dictionary tmp/nor.lstm-word.txt to get an optimized dictionary tmp/smi.lstm-word.txt for Sámi.

# Convert wordlist from text to DAWG.
wordlist2dawg tmp/smi.lstm-word.txt tmp/smi.lstm-word-dawg tmp/smi.lstm-unicharset

# Add wordlist DAWG to smi.traineddata.
combine_tessdata -o smi.traineddata tmp/smi.lstm-word-dawg

The files tmp/nor.lstm-number-dawg and tmp/nor.lstm-punc-dawg can optionally also be converted to text, optionally edited, then converted back to DAWG for smi and added to smi.traineddata. They might help to recognize numbers and punctuation.

Hint: you can start with a small dictionary which only contains the correct spelling for some of the words which typically have OCR problems. Then you can test whether it has a positive effect. It's even simpler to test a small dictionary directly with tesseract --user-words WORDLIST.

titaenstad added 2 commits February 28, 2025 16:16

Add best smi model

adbfed0

Create readme for model

df19e43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Sámi tesseract model#11

Add Sámi tesseract model#11
titaenstad wants to merge 2 commits intotesseract-ocr:mainfrom
Sprakbanken:main

titaenstad commented Feb 28, 2025

Uh oh!

stweil commented Feb 28, 2025 •

edited

Loading

Uh oh!

titaenstad commented Mar 5, 2025 •

edited

Loading

Uh oh!

stweil commented Mar 5, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

titaenstad commented Feb 28, 2025

Uh oh!

stweil commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

titaenstad commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stweil commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stweil commented Feb 28, 2025 •

edited

Loading

titaenstad commented Mar 5, 2025 •

edited

Loading

stweil commented Mar 5, 2025 •

edited

Loading