I had a tiff file, which contain some text separated by tabs (4 spaces). But when I extract text out of this tiff image file, i always get a single space between two columns. A sample example:
TIFF IMAGE:
col-a col-b col-c
desired output:
col-a col-b col-c
but I am getting the following:
col-a col-b col-c
I tried this with multiple images of same format, but the result is always the same. How do I fix this issue ? Can I train tesseract to understand this?
After a very long research I found the solution. Here are the steps to follow
Upgrade your tesseract to 3.04
Create config.txt (Create a file in the directory where you input the image file)
In config file define "preserve_interword_spaces"
After the work preserve_interword_spaces give either 0 or 1. Ex:
or
Tesseract compresses consecutive spaces into one. You would need to modify
baseapi.cpp
to preserve the spaces. The code change can be found in the following posts:https://groups.google.com/forum/#!searchin/tesseract-ocr/spaces/tesseract-ocr/lGBQiryHcrY/wy5a-L9O3i4J
https://groups.google.com/forum/#!searchin/tesseract-ocr/spaces/tesseract-ocr/9nzPrBZ3118/b3W5GtsFPo0J