I've written an application that segments an image based on the text regions within it, and extracts those regions as I see fit. What I'm attempting to do is clean the image so OCR (Tesseract) gives an accurate result. I have the following image as an example:
Running this through tesseract gives a widely inaccurate result. However cleaning up the image (using photoshop) to get the image as follows:
Gives exactly the result I would expect. The first image is already being run through the following method to clean it to that point:
public Mat cleanImage (Mat srcImage) {
Core.normalize(srcImage, srcImage, 0, 255, Core.NORM_MINMAX);
Imgproc.threshold(srcImage, srcImage, 0, 255, Imgproc.THRESH_OTSU);
Imgproc.erode(srcImage, srcImage, new Mat());
Imgproc.dilate(srcImage, srcImage, new Mat(), new Point(0, 0), 9);
return srcImage;
}
What more can I do to clean the first image so it resembles the second image?
Edit: This is the original image before it's run through the cleanImage
function.
I think you need to work more on the pre-processing part to prepare the image to be clear as much as you can before calling the tesseract.
What's my ideas to do that are the following:
1- Extract contours from the image and find contours in the image (check this) and this
2- Each contours have width, height and area, so you may filter the contours according to the width, height and its area (check this and this), plus you may use some part of the contour analysis code here to filter the contours and more you may delete the contours that are not similar to a "letter or number" contour using a template contour matching.
3- After filter the contour you may check where are the letters and the numbers in this image, so you may need to use some text detection methods like here
4- All what you need now if to remove the non-text area, and the contours that are not good from the image
5- Now you can create your binirization method or you may use the tesseract one to do the binirization to the image then call the OCR on the image.
Sure these are the best steps to do this, you may use some of them and it may enough for you.
Other ideas:
You may use different ways to do this the best idea is to find a way to detect the digit and character location using different methods like template matching, or feature based like HOG.
You may first to do binarization to your image and get the binary image, then you need to apply opening with line structural for the horizontal and vertical and this will help you to detect the edges after that and do the segmentation on the image then the OCR.
After detecting all the contours in the image, you also may use
Hough transformation
to detect any kind of line and defined curve like this one, and in this way you can detect the characters that are a lined so you may segment the image and do the OCR after that.Much easier way:
1- Do binirization
2- Some morphology operation to separate the contours:
3- Inverse the color in the image (this may be before step 2)
4- Find all contours in the image
5- Delete all the contours that width is more than its height, delete the very small contours, the very large ones, and the not rectangle contours
Note : you may use the text detection methods (or using HOG or edge detection) instead of step 4 and 5
6- Find the large rectangle that contain all the remaining contours in the image
7- You may do some extra pre-processing to enhance the input for the tesseract then you may call the OCR now. (I advice you to crop the image and make it as an input to the OCR [I mean crop the yellow rectangle and do not make the whole image as an input just the yellow rectangle and that will enhance the results also])
Would that image help you?
The algorithm producing that image would be easy to implement. I am sure, if you tweak some of its parameters, you can get very good results for that kind of images.
I tested all the images with tesseract:
Just a little bit of thinking out of the box:
I can see from your original image that it's a rather rigorously preformatted document, looks like a road tax badge or something like that, right?
If the assumption above is correct, then you could implement a less generic solution: The noise you are trying to get rid of is due to features of the specific document template, it occurs in specific and known regions of your image. In fact, so does the text.
In that case, one of the ways to go about is define the boundaries of the regions where you know that there is such "noise", and just white them out.
Then, follow the rest of the steps that you are already following: Do the noise reduction that will remove the finest detail (i.e. the background pattern that looks like the safety watermark or hologram in the badge). The result should be clear enough for Tesseract to process without trouble.
Just a thought anyway. Not a generic solution, I acknowledge that, so it depends on what your actual requirements are.
My answer is based on following assumptions. It's possible that none of them holds in your case.
This is my procedure for extracting the digits:
Threshold the distance transformed image using the stroke-width ( = 8) constraint
Apply morphological operation to disconnect
Filter bounding box heights and make a guess where the digits are
stroke-width = 8 stroke-width = 10
EDIT
Prepare a mask using the convexhull of the found digit contours
Copy digits region to a clean image using the mask
stroke-width = 8
stroke-width = 10
My Tesseract knowledge is a bit rusty. As I remember you can get a confidence level for the characters. You may be able to filter out noise using this information if you still happen to detect noisy regions as character bounding boxes.
C++ Code
EDIT
Java Code
The font size should not be so big or small, approximately it should in range of 10-12 pt(i.e, character height approximately above 20 and less than 80). you can down sample the image and try with tesseract. And few fonts are not trained in tesseract, the issue may arise if it is not in that trained fonts.