I am trying to interact with tesseract
API also I am new to image processing and I am just struggling with it for last few days. I have tried simple algorithms and I have achieved around 70% accuracy.
I want its accuracy to be 90+%. The problem with the images is that they are in 72dpi. I also tried to increase the resolution but did not get good results the images which I am trying to be recognized are attached.
Any help would be appreciated and I am sorry if I asked something very basic.
EDIT
I forgot to mention that I am trying to do all the processing and recognition within 2-2.5 secs on Linux
platform and method to detect the text mentioned in this answer is taking a lot of time. Also I prefer not to use command line solution but I would prefer Leptonica
or OpenCV
.
Most of the images are uploaded here
I have tried following things to binarize the tickets but no luck
- http://www.vincent-net.com/luc/papers/10wiley_morpho_DIAapps.pdf
- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.193.6347&rep=rep1&type=pdf
- http://iit.demokritos.gr/~bgat/PatRec2006.pdf
- http://psych.stanford.edu/~jlm/pdfs/Sternberg67.pdf
Ticket contains
- little bit bad light
- Non-text area
- less resolution
I tried to feed the image direct to tesseract API and it is giving me 70% good results in 1 sec average. But I want to increase the accuracy in noticing the time factor So far I have tried
- Detect edges of the image
- Blob Analysis for blobs
- Binarized the ticket using adaptive thresholding
Then I tried to feed those binarized images to tesseract, the accuracy reduced to less than 50-60%, though binarized image look perfect.
There are several things you could try:
To be able to improve the accuracy you should improve the quality of the image for the OCR engine, and that means preprocessing the images before feeding them to Tesseract. I suggest investigating OpenCV for this purpose.
The main problem with OCR engines is that they are not as good at recognizing characters as we are. So even things that are not text sometimes get mistakenly identified as if they were. Therefore, to prevent this from happening it's best to detect the areas of text and send those to Tesseract instead of sending the full image, like you are doing with image #2.
Another way to extract the text regions of an image can be done with this technique to isolate them.
When you get the results from Tesseract, you can improve them by comparing the resulting text to a dictionary.
Some possible improvements:
- The resolution should be 300 dpi at least.
- Make your illumination more averagely distributed. There are several dark areas that might impact the results.
- Try to scale your characters a little bit. Currently they are in different sizes, and some of the letters are even distorted.
- Pre-process the image by thresholding and binarization.
You can do above with your own programming, or Fred's ImageMagick Scripts might help.
I'm not sure if my post is useful for you, because my answer is not about Tesseract. But it is about high accuracy, so I decided that it can be interesting for you to see how paid OCR SDK solution works.
That's results of recognition with ABBYY Cloud OCR SDK without any additional settings.
Disclaimer: I work for ABBYY.
You can try to use ScanTailor (http://scantailor.sourceforge.net/ it has also CLI interface) to binarize, deskew and dewarp images. Scaling images up might help to improve recognition. Because Tesseract recognition profiles were optimized to work on at least 300 DPI.
Another possibility is to train Tesseract on font which are characteristic for your material (more on this can be here: https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3).
I don't think that dictionary lookup will help here, because you have mostly numbers.