Im using tesseract-ocr package on Ubuntu Linux, I have been using it for a while and I think that in order to improve the accuracy of the OCR I only need a subset of letters from the alphabet. The letters I need are:
0123456789abcdefghijklmnopqrstuvwxyz
and only that, not even capital letters, can anybody give me a hand on indicating tesseract to only match againts a subset of letters ?
Thanks,
What you're looking for is the Tesseract Whitelist. If you're on python and working with it and the API, I think this should work (found on the Tesseract Google Group).
Note, I'm not sure which version of Tesseract this is for.
From the python-tesseract project page:
So just set your own collection of characters in
api.SetVariable
.From the tesseract-ocr project FAQ