How to ignore special characters in Tesseract OCR

2019-09-20 10:57发布

I have extracted text from image through Tesseract OCR using java. But the output is consisting of some special characters because image contains some symbols.

I want to ignore all the special characters and display just text. Is there any way that i can do that?

标签： java ocr tesseract tess4j

1条回答

在下西门庆

2楼-- · 2019-09-20 11:59

In tesseract you can set TessBaseAPI.VAR_CHAR_WHITELIST and TessBaseAPI.VAR_CHAR_BLACKLIST in order to ignore some special characters.

Following would make tesseract only recognize A-Z and digits

String whiteList = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
tessBaseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST,whiteList);

Next snippet would allow you to recognize everything except for ~ and ﬂ

String blackList = "~ﬂ";
tessBaseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST,blackList );

Also please note that as mentioned in tesseract github issue you can't black or whitelist characters with tesseract 4.0 Alpha LSTM, instead you should train LSTM with characters you expect on your image.

Of course if you want - you can still use 3.* versions of tesseract, its tessdata is located here

0人赞添加讨论(0) 举报

How to ignore special characters in Tesseract OCR

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间