Image cleaning before OCR application

I have been experimenting with PyTesser for the past couple of hours and it is a really nice tool. Couple of things I noticed about the accuracy of PyTesser:

File with icons, images and text - 5-10% accurate
File with only text(images and icons erased) - 50-60% accurate
File with stretching(And this is the best part) - Stretching file in 2) above on x or y axis increased the accuracy by 10-20%

So apparently Pytesser does not take care of font dimension or image stretching. Although there is much theory to be read about image processing and OCR, are there any standard procedures of image cleanup(apart from erasing icons and images) that needs to be done before applying PyTesser or other libraries irrespective of the language?

...........

Wow, this post is quite old now. I started my research again on OCR these last couple of days. This time I chucked PyTesser and used the Tesseract Engine with ImageMagik instead. Coming straight to the point, this is what I found:

1) You can increase the resolution with ImageMagic(There are a bunch of simple shell commands you can use)
2) After increasing the resolution, the accuracy went up by 80-90%.

So the Tesseract Engine is without doubt the best open source OCR engine in the market. No prior image cleaning was required here. The caveat is that it does not work on files with a lot of embedded images and I coudn't figure out a way to train Tesseract to ignore them. Also the text layout and formatting in the image makes a big difference. It works great with images with just text. Hope this helped.

标签： python image-processing ocr

3条回答

We Are One

2楼-- · 2019-03-09 12:46

Not sure if your intent is for commercial use or not, But this works wonders if your performing OCR on a bunch of like images.

http://www.fmwconcepts.com/imagemagick/textcleaner/index.php

ORIGINAL

After Pre-Processing with given arguments.

0人赞添加讨论(0) 举报

老娘就宠你

3楼-- · 2019-03-09 12:48

I know it's not a perfect answer. But I'd like to share with you a video that I saw from PyCon 2013 that might be applicable. It's a little devoid of implementation details, but just might be some guidance/inspiration to you on how to solve/improve your problem.

Link to Video

Link to Presentation

And if you do decide to use ImageMagick to pre-process your source images a little. Here is question that points you to nice python bindings for it.

On a side note. Quite an important thing with Tesseract. You need to train it, otherwise it wont be nearly as good/accurate as it's capable of being.

0人赞添加讨论(0) 举报

Viruses.

4楼-- · 2019-03-09 12:58

As it turns out, tesseract wiki has an article that answers this question in best way I can imagine:

Illustrated guide about "Improving the quality of the [OCR] output".
Question "image processing to improve tesseract OCR accuracy" may also be of interest.

(initial answer, just for the record)

I haven't used PyTesser, but I have done some experiments with tesseract (version: 3.02.02).

If you invoke tesseract on colored image, then it first applies global Otsu's method to binarize it and then actual character recognition is run on binary (black and white) image.

Image from: http://scikit-image.org/docs/dev/auto_examples/plot_local_otsu.html

Otsu's threshold illustration

As it can be seen, 'global Otsu' may not always produce desirable result.

To better understand what tesseract 'sees' is to apply Otsu's method to your image and then look at the resulting image.

In conclusion: the most straightforward method to improve recognition ratio is to binarize images yourself (most likely you will have find good threshold by trial and error) and then pass those binarized images to tesseract.

Somebody was kind enough to publish api docs for tesseract, so it is possible to verify previous statements about processing pipeline: ProcessPage -> GetThresholdedImage -> ThresholdToPix -> OtsuThresholdRectToPix

0人赞添加讨论(0) 举报

Image cleaning before OCR application

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间