Tesseract - ambiguity in space and tab

2019-01-29 06:15发布

I had a tiff file, which contain some text separated by tabs (4 spaces). But when I extract text out of this tiff image file, i always get a single space between two columns. A sample example:

TIFF IMAGE:
col-a    col-b    col-c

desired output:
col-a    col-b    col-c

but I am getting the following:
col-a col-b col-c

I tried this with multiple images of same format, but the result is always the same. How do I fix this issue ? Can I train tesseract to understand this?

标签: ocr tesseract
2条回答
叛逆
2楼-- · 2019-01-29 07:03

After a very long research I found the solution. Here are the steps to follow

  1. Upgrade your tesseract to 3.04

  2. Create config.txt (Create a file in the directory where you input the image file)

  3. In config file define "preserve_interword_spaces"

  4. After the work preserve_interword_spaces give either 0 or 1. Ex:

preserve_interword_spaces 0

or

preserve_interword_spaces 1

  1. Test & Cheers!!!
查看更多
可以哭但决不认输i
3楼-- · 2019-01-29 07:06

Tesseract compresses consecutive spaces into one. You would need to modify baseapi.cpp to preserve the spaces. The code change can be found in the following posts:

https://groups.google.com/forum/#!searchin/tesseract-ocr/spaces/tesseract-ocr/lGBQiryHcrY/wy5a-L9O3i4J

https://groups.google.com/forum/#!searchin/tesseract-ocr/spaces/tesseract-ocr/9nzPrBZ3118/b3W5GtsFPo0J

查看更多
登录 后发表回答