Tesseract use subset of letters

Im using tesseract-ocr package on Ubuntu Linux, I have been using it for a while and I think that in order to improve the accuracy of the OCR I only need a subset of letters from the alphabet. The letters I need are:

0123456789abcdefghijklmnopqrstuvwxyz

and only that, not even capital letters, can anybody give me a hand on indicating tesseract to only match againts a subset of letters ?

Thanks,

标签： python linux ocr captcha tesseract

2条回答

The star\"

2楼-- · 2019-05-31 16:35

What you're looking for is the Tesseract Whitelist. If you're on python and working with it and the API, I think this should work (found on the Tesseract Google Group).

api.SetVariable("tessedit_char_whitelist", "abcdefghijklmnopqrstuvwxyz0123456789 ");

Note, I'm not sure which version of Tesseract this is for.

0人赞添加讨论(0) 举报

唯我独甜

3楼-- · 2019-05-31 16:38

From the python-tesseract project page:

import tesseract
api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwxyz")
api.SetPageSegMode(tesseract.PSM_AUTO)

So just set your own collection of characters in api.SetVariable.

From the tesseract-ocr project FAQ

Tesseract 2.03 Use

TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");

BEFORE calling an Init function or put this in a text file called tessdata/configs/digits:

tessedit_char_whitelist 0123456789

and then your command line becomes:

tesseract image.tif outputbase nobatch digits

Warning: Until the old and new config variables get merged, you must have the nobatch parameter too.

Tesseract 3 A digits config file is already created, so just run a tesseract command like this:

tesseract imagename outputbase digits

0人赞添加讨论(0) 举报

Tesseract use subset of letters

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间