Remove background noise from image to make text mo

2020-05-11 12:09发布

I've written an application that segments an image based on the text regions within it, and extracts those regions as I see fit. What I'm attempting to do is clean the image so OCR (Tesseract) gives an accurate result. I have the following image as an example:

enter image description here

Running this through tesseract gives a widely inaccurate result. However cleaning up the image (using photoshop) to get the image as follows:

enter image description here

Gives exactly the result I would expect. The first image is already being run through the following method to clean it to that point:

 public Mat cleanImage (Mat srcImage) {
    Core.normalize(srcImage, srcImage, 0, 255, Core.NORM_MINMAX);
    Imgproc.threshold(srcImage, srcImage, 0, 255, Imgproc.THRESH_OTSU);
    Imgproc.erode(srcImage, srcImage, new Mat());
    Imgproc.dilate(srcImage, srcImage, new Mat(), new Point(0, 0), 9);
    return srcImage;
}

What more can I do to clean the first image so it resembles the second image?

Edit: This is the original image before it's run through the cleanImage function.

enter image description here

5条回答
爷、活的狠高调
2楼-- · 2020-05-11 12:22

I think you need to work more on the pre-processing part to prepare the image to be clear as much as you can before calling the tesseract.

What's my ideas to do that are the following:

1- Extract contours from the image and find contours in the image (check this) and this

2- Each contours have width, height and area, so you may filter the contours according to the width, height and its area (check this and this), plus you may use some part of the contour analysis code here to filter the contours and more you may delete the contours that are not similar to a "letter or number" contour using a template contour matching.

3- After filter the contour you may check where are the letters and the numbers in this image, so you may need to use some text detection methods like here

4- All what you need now if to remove the non-text area, and the contours that are not good from the image

5- Now you can create your binirization method or you may use the tesseract one to do the binirization to the image then call the OCR on the image.

Sure these are the best steps to do this, you may use some of them and it may enough for you.

Other ideas:

  • You may use different ways to do this the best idea is to find a way to detect the digit and character location using different methods like template matching, or feature based like HOG.

  • You may first to do binarization to your image and get the binary image, then you need to apply opening with line structural for the horizontal and vertical and this will help you to detect the edges after that and do the segmentation on the image then the OCR.

  • After detecting all the contours in the image, you also may use Hough transformation to detect any kind of line and defined curve like this one, and in this way you can detect the characters that are a lined so you may segment the image and do the OCR after that.

Much easier way:

1- Do binirization enter image description here

2- Some morphology operation to separate the contours:

enter image description here

3- Inverse the color in the image (this may be before step 2)

enter image description here

4- Find all contours in the image

enter image description here

5- Delete all the contours that width is more than its height, delete the very small contours, the very large ones, and the not rectangle contours

enter image description here

Note : you may use the text detection methods (or using HOG or edge detection) instead of step 4 and 5

6- Find the large rectangle that contain all the remaining contours in the image

enter image description here

7- You may do some extra pre-processing to enhance the input for the tesseract then you may call the OCR now. (I advice you to crop the image and make it as an input to the OCR [I mean crop the yellow rectangle and do not make the whole image as an input just the yellow rectangle and that will enhance the results also])

查看更多
不美不萌又怎样
3楼-- · 2020-05-11 12:23

Would that image help you?

enter image description here

The algorithm producing that image would be easy to implement. I am sure, if you tweak some of its parameters, you can get very good results for that kind of images.

I tested all the images with tesseract:

  • Original image : Nothing detected
  • Processed image #1 : Nothing detected
  • Processed image #2 : 12-14 (exact match)
  • My processed image : y’1'2-14/j
查看更多
Melony?
4楼-- · 2020-05-11 12:23

Just a little bit of thinking out of the box:

I can see from your original image that it's a rather rigorously preformatted document, looks like a road tax badge or something like that, right?

If the assumption above is correct, then you could implement a less generic solution: The noise you are trying to get rid of is due to features of the specific document template, it occurs in specific and known regions of your image. In fact, so does the text.

In that case, one of the ways to go about is define the boundaries of the regions where you know that there is such "noise", and just white them out.

Then, follow the rest of the steps that you are already following: Do the noise reduction that will remove the finest detail (i.e. the background pattern that looks like the safety watermark or hologram in the badge). The result should be clear enough for Tesseract to process without trouble.

Just a thought anyway. Not a generic solution, I acknowledge that, so it depends on what your actual requirements are.

查看更多
Melony?
5楼-- · 2020-05-11 12:32

My answer is based on following assumptions. It's possible that none of them holds in your case.

  • It's possible for you to impose a threshold for bounding box heights in the segmented region. Then you should be able to filter out other components.
  • You know the average stroke widths of the digits. Use this information to minimize the chance that the digits are connected to other regions. You can use distance transform and morphological operations for this.

This is my procedure for extracting the digits:

  • Apply Otsu threshold to the image otsu
  • Take the distance transform dist
  • Threshold the distance transformed image using the stroke-width ( = 8) constraint sw2

  • Apply morphological operation to disconnect ws2op

  • Filter bounding box heights and make a guess where the digits are

stroke-width = 8 bb stroke-width = 10 bb2

EDIT

  • Prepare a mask using the convexhull of the found digit contours mask

  • Copy digits region to a clean image using the mask

stroke-width = 8 cl1

stroke-width = 10 cl2

My Tesseract knowledge is a bit rusty. As I remember you can get a confidence level for the characters. You may be able to filter out noise using this information if you still happen to detect noisy regions as character bounding boxes.

C++ Code

Mat im = imread("aRh8C.png", 0);
// apply Otsu threshold
Mat bw;
threshold(im, bw, 0, 255, CV_THRESH_BINARY_INV | CV_THRESH_OTSU);
// take the distance transform
Mat dist;
distanceTransform(bw, dist, CV_DIST_L2, CV_DIST_MASK_PRECISE);
Mat dibw;
// threshold the distance transformed image
double SWTHRESH = 8;    // stroke width threshold
threshold(dist, dibw, SWTHRESH/2, 255, CV_THRESH_BINARY);
Mat kernel = getStructuringElement(MORPH_RECT, Size(3, 3));
// perform opening, in case digits are still connected
Mat morph;
morphologyEx(dibw, morph, CV_MOP_OPEN, kernel);
dibw.convertTo(dibw, CV_8U);
// find contours and filter
Mat cont;
morph.convertTo(cont, CV_8U);

Mat binary;
cvtColor(dibw, binary, CV_GRAY2BGR);

const double HTHRESH = im.rows * .5;    // height threshold
vector<vector<Point>> contours;
vector<Vec4i> hierarchy;
vector<Point> digits; // points corresponding to digit contours

findContours(cont, contours, hierarchy, CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE, Point(0, 0));
for(int idx = 0; idx >= 0; idx = hierarchy[idx][0])
{
    Rect rect = boundingRect(contours[idx]);
    if (rect.height > HTHRESH)
    {
        // append the points of this contour to digit points
        digits.insert(digits.end(), contours[idx].begin(), contours[idx].end());

        rectangle(binary, 
            Point(rect.x, rect.y), Point(rect.x + rect.width - 1, rect.y + rect.height - 1),
            Scalar(0, 0, 255), 1);
    }
}

// take the convexhull of the digit contours
vector<Point> digitsHull;
convexHull(digits, digitsHull);
// prepare a mask
vector<vector<Point>> digitsRegion;
digitsRegion.push_back(digitsHull);
Mat digitsMask = Mat::zeros(im.rows, im.cols, CV_8U);
drawContours(digitsMask, digitsRegion, 0, Scalar(255, 255, 255), -1);
// expand the mask to include any information we lost in earlier morphological opening
morphologyEx(digitsMask, digitsMask, CV_MOP_DILATE, kernel);
// copy the region to get a cleaned image
Mat cleaned = Mat::zeros(im.rows, im.cols, CV_8U);
dibw.copyTo(cleaned, digitsMask);

EDIT

Java Code

Mat im = Highgui.imread("aRh8C.png", 0);
// apply Otsu threshold
Mat bw = new Mat(im.size(), CvType.CV_8U);
Imgproc.threshold(im, bw, 0, 255, Imgproc.THRESH_BINARY_INV | Imgproc.THRESH_OTSU);
// take the distance transform
Mat dist = new Mat(im.size(), CvType.CV_32F);
Imgproc.distanceTransform(bw, dist, Imgproc.CV_DIST_L2, Imgproc.CV_DIST_MASK_PRECISE);
// threshold the distance transform
Mat dibw32f = new Mat(im.size(), CvType.CV_32F);
final double SWTHRESH = 8.0;    // stroke width threshold
Imgproc.threshold(dist, dibw32f, SWTHRESH/2.0, 255, Imgproc.THRESH_BINARY);
Mat dibw8u = new Mat(im.size(), CvType.CV_8U);
dibw32f.convertTo(dibw8u, CvType.CV_8U);

Mat kernel = Imgproc.getStructuringElement(Imgproc.MORPH_RECT, new Size(3, 3));
// open to remove connections to stray elements
Mat cont = new Mat(im.size(), CvType.CV_8U);
Imgproc.morphologyEx(dibw8u, cont, Imgproc.MORPH_OPEN, kernel);
// find contours and filter based on bounding-box height
final double HTHRESH = im.rows() * 0.5; // bounding-box height threshold
List<MatOfPoint> contours = new ArrayList<MatOfPoint>();
List<Point> digits = new ArrayList<Point>();    // contours of the possible digits
Imgproc.findContours(cont, contours, new Mat(), Imgproc.RETR_CCOMP, Imgproc.CHAIN_APPROX_SIMPLE);
for (int i = 0; i < contours.size(); i++)
{
    if (Imgproc.boundingRect(contours.get(i)).height > HTHRESH)
    {
        // this contour passed the bounding-box height threshold. add it to digits
        digits.addAll(contours.get(i).toList());
    }   
}
// find the convexhull of the digit contours
MatOfInt digitsHullIdx = new MatOfInt();
MatOfPoint hullPoints = new MatOfPoint();
hullPoints.fromList(digits);
Imgproc.convexHull(hullPoints, digitsHullIdx);
// convert hull index to hull points
List<Point> digitsHullPointsList = new ArrayList<Point>();
List<Point> points = hullPoints.toList();
for (Integer i: digitsHullIdx.toList())
{
    digitsHullPointsList.add(points.get(i));
}
MatOfPoint digitsHullPoints = new MatOfPoint();
digitsHullPoints.fromList(digitsHullPointsList);
// create the mask for digits
List<MatOfPoint> digitRegions = new ArrayList<MatOfPoint>();
digitRegions.add(digitsHullPoints);
Mat digitsMask = Mat.zeros(im.size(), CvType.CV_8U);
Imgproc.drawContours(digitsMask, digitRegions, 0, new Scalar(255, 255, 255), -1);
// dilate the mask to capture any info we lost in earlier opening
Imgproc.morphologyEx(digitsMask, digitsMask, Imgproc.MORPH_DILATE, kernel);
// cleaned image ready for OCR
Mat cleaned = Mat.zeros(im.size(), CvType.CV_8U);
dibw8u.copyTo(cleaned, digitsMask);
// feed cleaned to Tesseract
查看更多
Emotional °昔
6楼-- · 2020-05-11 12:35

The font size should not be so big or small, approximately it should in range of 10-12 pt(i.e, character height approximately above 20 and less than 80). you can down sample the image and try with tesseract. And few fonts are not trained in tesseract, the issue may arise if it is not in that trained fonts.

查看更多
登录 后发表回答