Remove background noise from image to make text mo

I've written an application that segments an image based on the text regions within it, and extracts those regions as I see fit. What I'm attempting to do is clean the image so OCR (Tesseract) gives an accurate result. I have the following image as an example:

Running this through tesseract gives a widely inaccurate result. However cleaning up the image (using photoshop) to get the image as follows:

Gives exactly the result I would expect. The first image is already being run through the following method to clean it to that point:

 public Mat cleanImage (Mat srcImage) {
    Core.normalize(srcImage, srcImage, 0, 255, Core.NORM_MINMAX);
    Imgproc.threshold(srcImage, srcImage, 0, 255, Imgproc.THRESH_OTSU);
    Imgproc.erode(srcImage, srcImage, new Mat());
    Imgproc.dilate(srcImage, srcImage, new Mat(), new Point(0, 0), 9);
    return srcImage;
}

What more can I do to clean the first image so it resembles the second image?

Edit: This is the original image before it's run through the cleanImage function.

标签： java c++ opencv ocr

5条回答

爷、活的狠高调

2楼-- · 2020-05-11 12:22

I think you need to work more on the pre-processing part to prepare the image to be clear as much as you can before calling the tesseract.

What's my ideas to do that are the following:

1- Extract contours from the image and find contours in the image (check this) and this

2- Each contours have width, height and area, so you may filter the contours according to the width, height and its area (check this and this), plus you may use some part of the contour analysis code here to filter the contours and more you may delete the contours that are not similar to a "letter or number" contour using a template contour matching.

3- After filter the contour you may check where are the letters and the numbers in this image, so you may need to use some text detection methods like here

4- All what you need now if to remove the non-text area, and the contours that are not good from the image

5- Now you can create your binirization method or you may use the tesseract one to do the binirization to the image then call the OCR on the image.

Sure these are the best steps to do this, you may use some of them and it may enough for you.

Other ideas:

You may use different ways to do this the best idea is to find a way to detect the digit and character location using different methods like template matching, or feature based like HOG.
You may first to do binarization to your image and get the binary image, then you need to apply opening with line structural for the horizontal and vertical and this will help you to detect the edges after that and do the segmentation on the image then the OCR.
After detecting all the contours in the image, you also may use Hough transformation to detect any kind of line and defined curve like this one, and in this way you can detect the characters that are a lined so you may segment the image and do the OCR after that.

Much easier way:

1- Do binirization

2- Some morphology operation to separate the contours:

3- Inverse the color in the image (this may be before step 2)

4- Find all contours in the image

5- Delete all the contours that width is more than its height, delete the very small contours, the very large ones, and the not rectangle contours

Note : you may use the text detection methods (or using HOG or edge detection) instead of step 4 and 5

6- Find the large rectangle that contain all the remaining contours in the image

7- You may do some extra pre-processing to enhance the input for the tesseract then you may call the OCR now. (I advice you to crop the image and make it as an input to the OCR [I mean crop the yellow rectangle and do not make the whole image as an input just the yellow rectangle and that will enhance the results also])

0人赞添加讨论(0) 举报

不美不萌又怎样

3楼-- · 2020-05-11 12:23

Would that image help you?

The algorithm producing that image would be easy to implement. I am sure, if you tweak some of its parameters, you can get very good results for that kind of images.

I tested all the images with tesseract:

Original image : Nothing detected
Processed image #1 : Nothing detected
Processed image #2 : 12-14 (exact match)
My processed image : y’1'2-14/j

0人赞添加讨论(0) 举报

Melony?

4楼-- · 2020-05-11 12:23

Just a little bit of thinking out of the box:

I can see from your original image that it's a rather rigorously preformatted document, looks like a road tax badge or something like that, right?

If the assumption above is correct, then you could implement a less generic solution: The noise you are trying to get rid of is due to features of the specific document template, it occurs in specific and known regions of your image. In fact, so does the text.

In that case, one of the ways to go about is define the boundaries of the regions where you know that there is such "noise", and just white them out.

Then, follow the rest of the steps that you are already following: Do the noise reduction that will remove the finest detail (i.e. the background pattern that looks like the safety watermark or hologram in the badge). The result should be clear enough for Tesseract to process without trouble.

Just a thought anyway. Not a generic solution, I acknowledge that, so it depends on what your actual requirements are.

0人赞添加讨论(0) 举报

Melony?

5楼-- · 2020-05-11 12:32

My answer is based on following assumptions. It's possible that none of them holds in your case.

It's possible for you to impose a threshold for bounding box heights in the segmented region. Then you should be able to filter out other components.
You know the average stroke widths of the digits. Use this information to minimize the chance that the digits are connected to other regions. You can use distance transform and morphological operations for this.

This is my procedure for extracting the digits:

Apply Otsu threshold to the image
Take the distance transform
Threshold the distance transformed image using the stroke-width ( = 8) constraint
Apply morphological operation to disconnect
Filter bounding box heights and make a guess where the digits are

stroke-width = 8 stroke-width = 10

EDIT

Prepare a mask using the convexhull of the found digit contours
Copy digits region to a clean image using the mask

stroke-width = 8

stroke-width = 10

My Tesseract knowledge is a bit rusty. As I remember you can get a confidence level for the characters. You may be able to filter out noise using this information if you still happen to detect noisy regions as character bounding boxes.

C++ Code

Mat im = imread("aRh8C.png", 0);
// apply Otsu threshold
Mat bw;
threshold(im, bw, 0, 255, CV_THRESH_BINARY_INV | CV_THRESH_OTSU);
// take the distance transform
Mat dist;
distanceTransform(bw, dist, CV_DIST_L2, CV_DIST_MASK_PRECISE);
Mat dibw;
// threshold the distance transformed image
double SWTHRESH = 8;    // stroke width threshold
threshold(dist, dibw, SWTHRESH/2, 255, CV_THRESH_BINARY);
Mat kernel = getStructuringElement(MORPH_RECT, Size(3, 3));
// perform opening, in case digits are still connected
Mat morph;
morphologyEx(dibw, morph, CV_MOP_OPEN, kernel);
dibw.convertTo(dibw, CV_8U);
// find contours and filter
Mat cont;
morph.convertTo(cont, CV_8U);

Mat binary;
cvtColor(dibw, binary, CV_GRAY2BGR);

const double HTHRESH = im.rows * .5;    // height threshold
vector<vector<Point>> contours;
vector<Vec4i> hierarchy;
vector<Point> digits; // points corresponding to digit contours

findContours(cont, contours, hierarchy, CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE, Point(0, 0));
for(int idx = 0; idx >= 0; idx = hierarchy[idx][0])
{
    Rect rect = boundingRect(contours[idx]);
    if (rect.height > HTHRESH)
    {
        // append the points of this contour to digit points
        digits.insert(digits.end(), contours[idx].begin(), contours[idx].end());

        rectangle(binary, 
            Point(rect.x, rect.y), Point(rect.x + rect.width - 1, rect.y + rect.height - 1),
            Scalar(0, 0, 255), 1);
    }
}

// take the convexhull of the digit contours
vector<Point> digitsHull;
convexHull(digits, digitsHull);
// prepare a mask
vector<vector<Point>> digitsRegion;
digitsRegion.push_back(digitsHull);
Mat digitsMask = Mat::zeros(im.rows, im.cols, CV_8U);
drawContours(digitsMask, digitsRegion, 0, Scalar(255, 255, 255), -1);
// expand the mask to include any information we lost in earlier morphological opening
morphologyEx(digitsMask, digitsMask, CV_MOP_DILATE, kernel);
// copy the region to get a cleaned image
Mat cleaned = Mat::zeros(im.rows, im.cols, CV_8U);
dibw.copyTo(cleaned, digitsMask);

EDIT

Java Code

Mat im = Highgui.imread("aRh8C.png", 0);
// apply Otsu threshold
Mat bw = new Mat(im.size(), CvType.CV_8U);
Imgproc.threshold(im, bw, 0, 255, Imgproc.THRESH_BINARY_INV | Imgproc.THRESH_OTSU);
// take the distance transform
Mat dist = new Mat(im.size(), CvType.CV_32F);
Imgproc.distanceTransform(bw, dist, Imgproc.CV_DIST_L2, Imgproc.CV_DIST_MASK_PRECISE);
// threshold the distance transform
Mat dibw32f = new Mat(im.size(), CvType.CV_32F);
final double SWTHRESH = 8.0;    // stroke width threshold
Imgproc.threshold(dist, dibw32f, SWTHRESH/2.0, 255, Imgproc.THRESH_BINARY);
Mat dibw8u = new Mat(im.size(), CvType.CV_8U);
dibw32f.convertTo(dibw8u, CvType.CV_8U);

Mat kernel = Imgproc.getStructuringElement(Imgproc.MORPH_RECT, new Size(3, 3));
// open to remove connections to stray elements
Mat cont = new Mat(im.size(), CvType.CV_8U);
Imgproc.morphologyEx(dibw8u, cont, Imgproc.MORPH_OPEN, kernel);
// find contours and filter based on bounding-box height
final double HTHRESH = im.rows() * 0.5; // bounding-box height threshold
List<MatOfPoint> contours = new ArrayList<MatOfPoint>();
List<Point> digits = new ArrayList<Point>();    // contours of the possible digits
Imgproc.findContours(cont, contours, new Mat(), Imgproc.RETR_CCOMP, Imgproc.CHAIN_APPROX_SIMPLE);
for (int i = 0; i < contours.size(); i++)
{
    if (Imgproc.boundingRect(contours.get(i)).height > HTHRESH)
    {
        // this contour passed the bounding-box height threshold. add it to digits
        digits.addAll(contours.get(i).toList());
    }   
}
// find the convexhull of the digit contours
MatOfInt digitsHullIdx = new MatOfInt();
MatOfPoint hullPoints = new MatOfPoint();
hullPoints.fromList(digits);
Imgproc.convexHull(hullPoints, digitsHullIdx);
// convert hull index to hull points
List<Point> digitsHullPointsList = new ArrayList<Point>();
List<Point> points = hullPoints.toList();
for (Integer i: digitsHullIdx.toList())
{
    digitsHullPointsList.add(points.get(i));
}
MatOfPoint digitsHullPoints = new MatOfPoint();
digitsHullPoints.fromList(digitsHullPointsList);
// create the mask for digits
List<MatOfPoint> digitRegions = new ArrayList<MatOfPoint>();
digitRegions.add(digitsHullPoints);
Mat digitsMask = Mat.zeros(im.size(), CvType.CV_8U);
Imgproc.drawContours(digitsMask, digitRegions, 0, new Scalar(255, 255, 255), -1);
// dilate the mask to capture any info we lost in earlier opening
Imgproc.morphologyEx(digitsMask, digitsMask, Imgproc.MORPH_DILATE, kernel);
// cleaned image ready for OCR
Mat cleaned = Mat.zeros(im.size(), CvType.CV_8U);
dibw8u.copyTo(cleaned, digitsMask);
// feed cleaned to Tesseract

0人赞添加讨论(0) 举报

Emotional °昔

6楼-- · 2020-05-11 12:35

The font size should not be so big or small, approximately it should in range of 10-12 pt(i.e, character height approximately above 20 and less than 80). you can down sample the image and try with tesseract. And few fonts are not trained in tesseract, the issue may arise if it is not in that trained fonts.

0人赞添加讨论(0) 举报

Remove background noise from image to make text mo

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间