I want to use tesseract
to recognize only numbers. The problem is that I have mixture of numbers & letters and when I use SetVariable("tessedit_char_whitelist", "0123456789")
for every symbol tesseract returns wrong digit.
Can I set a threshold value so that tesseract
omits the symbols with low resemblance?
NOTE: I set tesseract
to recognize only digits so there is no confusion between O and 0.
For tesseract 3, i try to create config file according FAQ.
BEFORE calling an Init function or put this in a text file called
tessdata/configs/digits
:then, it works by using the command:
tesseract imagename outputbase digits
If one want to match 0-9
Or if one almost wants to match 0-9, but with one or more different characters
For tesseract 3, the command is simpler
tesseract imagename outputbase digits
according to FAQ. But it doesn't work for me very well.I turn to try different
psm
options and find-psm 6
works best for my case.man tesseract
for details.What I do is to recognize everything, and when I have the text, I take out all the characters except numbers
This works pretty well for me.
You can instruct tesseract to use only digits, and if that is not accurate enough then best chance of getting better results is to go trough training process: http://www.resolveradiologic.com/blog/2013/01/15/training-tesseract/
I made it a bit different (with tess-two). Maybe it will be useful for somebody.
So you need to initialize first the API.
Then set the following variables
In this way the engine will check only the numbers.