Tesseract thinks my 1's are 7's

2020-07-30 04:13发布

站内文章 / 前沿技术

93 0

做自己的国王

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

It seems like this is probably a common issue with ocr. Is there a way to tell tesseract that my 1's are actually 1's?

Hopefully without changing my 7's into 1's in the process.

Note: these are scanned documents and I have no idea what font was used.

回答1:

if "tesseract" is trainable, try to train it on the font manually. It should solve the problem.

There is another possible solution. Make a small valdiation module after "tesseracting". For all 1s and 7s, double check them using intensity based method. For example try to find corners(feature points) on it and apply KLT with 1 and 7 template and see which one got more positive tracking result. This method is costy but since you will try it on just 2 templates and so small, I do not think it gonna be a big performance decreasing.

if both solution are not possible , try to solve it using post-processing. For example, if it is a student age it would not be 78, it is 18 and so on. However this method is so bad and not a solution at all. but when no other solution is possible you have to do something like it.