In the example image (just a reference, my images will be of same pattern) a page which have full horizontal text and other have two horizontal column of text.
How to automatically detect the pattern of the document and read one after the other column of data in python?.
I am using Tesseract OCR with Psm 6, where it is reading horizontally which is wrong.
One way to accomplish this is using morphological operations and contour detection.
With the former you essentially "bleed" all characters into a big chunky blob. With the latter, you locate these blobs in your image and extract the ones that seem interesting (meaning: big enough).
Script used:
Then all you need is to compute the bounding box of the contour, and cut it from the original image. Add a bit of a margin and feed the whole thing to tesseract.