In order to improve OCR quality, I need to preprocess my scanned images. Sometimes I need to OCR the image with few pictures (components on the page and they are at different angles - for example, a few paper documents scanned at one time), for example:
Is it possible to automatically programmatically divide such images into separate images that will contain every logical document? For example with a tool like ImageMagick or something else? Is there any solutions/technics exists for such problem?
In ImageMagick 6, you can blur the image enough that the text overlaps and threshold so that the text boxes are each one large black region on a white background. Then you can use connected-components to find each separate black gray(0) region and its bounding box. Then crop the original image for each such region using the bounding box values.
Input:
Unix Syntax (adjust the blur to be just large enough to keep the text regions solid black):
Textual Listing:
tmp.png showing the blurred and thresholded regions:
Cropped Images:
No it will not work well for several reasons. The second image you provide was much larger than the first. So it would need a much larger blur. It is jpg and has artifacts in it. JPG is not a good format, since the image in 'constant' regions is not really constant. The blur will pick up your artifacts and will need to have a different threshold to remove some of them. In your case, the top of the image has a good sized artifact that will get caught as an object. Finally your blurred and thresholded text region's bounding boxes overlap even if they do not touch. Thus one crop may include text from other regions.
Here is my test command to blur and threshold your image: