Add Sámi tesseract model#11
Conversation
|
Nice. It's a pity that parts of your ground truth data are not publicly available. Only free open data allows reproducible or new trainings, not only for Tesseract, but also for other free OCR software. It would be interesting to repeat the training process based on Latin.traineddata. It might work better than starting with the Norwegian model. And I assume that you did not add a dictionary to your trained model? That might increase the recognition rate further. |
Why do you think that?
Is this the WORDLIST_FILE parameter to tesstrain? Can you tell me more/link to documentation about this? |
Latin.traineddata got more training and has a larger character set than nor.traineddata. I expect that its larger character set covers more of the additional characters which are required for Sámi.
You need combine_tessdata, dawg2wordlist and wordlist2dawg. This process should do the job: Now edit or replace the dictionary The files Hint: you can start with a small dictionary which only contains the correct spelling for some of the words which typically have OCR problems. Then you can test whether it has a positive effect. It's even simpler to test a small dictionary directly with |
I would like to share the best tesseract model from our work with Sámi OCR for the texts in the collection of the National Library of Norway. The training, validation and test data are unfortunately copyright protected and cannot be shared.
We describe the work in the paper "T Enstad, T Trosterud, MI Røsok, Y Beyer, M Roald. Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway. Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)"
Paper link: https://hdl.handle.net/10062/107202