I'm using tesseract-ocr-3.01 to scan many forms. The forms all follow a template, so I already know where the regions/rectangles of text are.
Is there a way to pass those regions to tesseract when using the command-line tool?
I'm using tesseract-ocr-3.01 to scan many forms. The forms all follow a template, so I already know where the regions/rectangles of text are.
Is there a way to pass those regions to tesseract when using the command-line tool?
I found the answer, thanks to this thread.
It seems that tesseract suports the uzn format (used in the unvl tests).
From the thread:
Example: If we have
C:\input.tif
andC:\input.uzn
, we do this:This may not be an optimal answer, but here goes:
I'm not sure whether the command-line tool has options to specify text-regions.
What you can do is use a Tesseract wrapper on another platform (EmguCV has Tesseract built-in). So you get the the scanned image, crop out the text-regions, and give them to Tesseract one-at-a-time. This way you'll also avoid any inaccuracies in Tesseract's page-layout analysis.
eg.