Creating a training image for Tesseract OCR

I'm writing a generator for training images for Tesseract OCR.

When generating a training image for a new font for Tesseract OCR, what are the best values for:

The DPI
The font size in points
Should the font be anti-aliased or not
Should the bounding boxes fit snugly: , or not:

标签： ocr tesseract

3条回答

▲ chillily

2楼-- · 2019-03-16 13:47

Good tool for tesseract training http://vietocr.sourceforge.net/training.html

It is good tool because having number of advantages

bounding box on letter can be editable by GUI based interface
automatically create all require file
automatically combined all files like freq-dawg, word-dawg, user-words (can be empty file), Inttemp, Normproto, Pffmtable, Unicharset, DangAmbigs (can be empty file), shapetable into single eng.traineddata file.
New training data can be used with existing tesseract file end.traineddata

0人赞添加讨论(0) 举报

做个烂人

3楼-- · 2019-03-16 14:01

I found the answer to the 4th question - "Should the bounding boxes fit snugly".

It seems that fitting the rectangles as much as possible gives much better results.

For the other 12 pts and 300 dpi will be good enough, as @Yaroslav suggested. I think anti-aliasing is better turned off.

0人赞添加讨论(0) 举报

甜甜的少女心

4楼-- · 2019-03-16 14:08

The 2th question is somehow answered here: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Generate_Training_Images There is no need to train with multiple sizes. 10 point will do. (An exception to this is very small text. If you want to recognize text with an x-height smaller than about 15 pixels, you should either train it specifically or scale your images before trying to recognize them.)

Questions 1 and 3: by experience, I've successfully used 300 dpi images/non anti-aliased fonts. More specifically, I have used the following convert parameters on a training pdf, which generated a satisfactory image:

convert -density 300 -depth 8 [input].pdf -background white -flatten +matte -compress none -monochrome [output].tif

But then I tried to add a dotted font to Tesseract and it only detected characters properly when I used a 150 dpi image. So, I don't think there's a general solution, it depends on the kind of fonts you're trying to add.

0人赞添加讨论(0) 举报

Creating a training image for Tesseract OCR

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间