I am trying to extract information from a range of different receipts using a combination of Opencv, Tesseract and Keras. The end result of the project is that I should be able to take a picture of a receipt using a phone and from that picture get the store name, payment type (card or cash), amount paid and change tendered.
So far I have done a few different preprocessing steps on a series of different sample receipts using Opencv such as removing background, denoising and converting to a binary image and am left with an image such as the following:
I am then using Tesseract to perform ocr on the receipt and write the results out to a text file. I have managed to get the ocr to perform at an acceptable level, so I can currently take a picture of a receipt and run my program on it and I will get a text file containing all the text on the receipt.
My problem is that I don't want all of the text on the receipt, I just want certain information such as the parameters I listed above. I am unsure as to how to go about training a model that will extract the data I need.
Am I correct in thinking that I should use Keras to segment and classify different sections of the image, and then write to file the text in sections that my model has classified as containing relevant data? Or is there a better solution for what I need to do?
Sorry if this is a stupid question, this is my first Opencv/machine learning project and I'm pretty far out of my depth. Any constructive criticism would be much appreciated.
My answer isn't as fancy as what's in fashion right now, but I think it works in your case, specially if this is for a product (not for research & publication purposes).
I would implement the paper Text/Graphics Separation Revisited. I have already implemented it in both Matlab & C++ and I guarantee from your description it won't take you long. In summary:
Get all connected components with stats. You're specially interested in the bounding box for each character.
The paper obtains thresholds from histograms on the properties of your connected components, which makes it a bit robust. Using these thresholds (that work surprisingly well) on the geometrical properties of your connected components, discard anything that's not a character.
For your characters, get the centroid for all of their bounding boxes and group the close centroids by your own criteria (height, vertical position, euclidean distance, etc.). Use the obtained centroid clusters to create rectangular text regions.
Associate text regions of same height and vertical position.
Run OCR on your text regions and look for keywords like "Cash". I honestly think you can get away with having dictionaries with text files, and from having done computer vision for mobile I know your resources are limited (by privacy too).
I honestly don't think a neural net will be much better than some kind of keyword matching (e.g. using Levenshtein distance or something similar to add a bit of robustness) because you will need to manually create and label these words anyway to create your training dataset, so... Why not just write them down instead?
That's basically it. You end up with something very fast (specially if you want to use a phone and you can't send images to a server) and it just works. No machine learning needed, so no dataset needed either.
But if this is for school... Sorry I was so rude. Please use TensorFlow with 10,000 manually labeled receipt images and natural language processing methods, your professor will be happy.
its a good idea to use image, as you will loose the structure of the document if you just you plain OCR. I think you are on right track. I would segment the bill in to headers, total amount, line items and get an image classifier trained on it. Then you could use it to clean/extract relevant information that you need from the text