Tesseract - ambiguity in space and tab

2019-01-29 06:15发布

I had a tiff file, which contain some text separated by tabs (4 spaces). But when I extract text out of this tiff image file, i always get a single space between two columns. A sample example:

TIFF IMAGE:
col-a    col-b    col-c

desired output:
col-a    col-b    col-c

but I am getting the following:
col-a col-b col-c

I tried this with multiple images of same format, but the result is always the same. How do I fix this issue ? Can I train tesseract to understand this?

标签： ocr tesseract

2条回答

叛逆

2楼-- · 2019-01-29 07:03

After a very long research I found the solution. Here are the steps to follow

Upgrade your tesseract to 3.04
Create config.txt (Create a file in the directory where you input the image file)
In config file define "preserve_interword_spaces"
After the work preserve_interword_spaces give either 0 or 1. Ex:

preserve_interword_spaces 0

preserve_interword_spaces 1

Test & Cheers!!!

0人赞添加讨论(0) 举报

可以哭但决不认输i

3楼-- · 2019-01-29 07:06

Tesseract compresses consecutive spaces into one. You would need to modify baseapi.cpp to preserve the spaces. The code change can be found in the following posts:

https://groups.google.com/forum/#!searchin/tesseract-ocr/spaces/tesseract-ocr/lGBQiryHcrY/wy5a-L9O3i4J

https://groups.google.com/forum/#!searchin/tesseract-ocr/spaces/tesseract-ocr/9nzPrBZ3118/b3W5GtsFPo0J

0人赞添加讨论(0) 举报

Tesseract - ambiguity in space and tab

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间