Special characters which are identified as individ

2019-07-26 07:37发布

I was trying to make the google vision OCR regex searchable. I have completed it and works pretty well when the document contains only English characters. But it fails when there is the text of other languages.

It's happening because I have only English characters in google vision word component as follows.

VISION_API_WORD_COUNTERS = "([a-zA-Z0-9]+)|([^a-zA-Z0-9 ])";
VISION_API_WORD_COMPONENTS = "[a-zA-Z0-9]";
VISION_API_NOT_WORD_COMPONENTS = "[^a-zA-Z0-9]";

As I can't include characters from all the languages, I am thinking to include the inverse of above. Something like

VISION_API_WORD_COMPONENTS = "[^*ALL THE SPECIAL CHARACTERS WHICH ARE IDENTIFIED AS WORD BY GOOGLE VISION*]"

for example [^!@#$%^&*()_+=].

So where can I find ALL THE SPECIAL CHARACTERS WHICH ARE IDENTIFIED AS A SEPARATE WORD BY GOOGLE VISION?

Trial and error, keep adding the special characters I find is one option.But that would be my last option.

标签： text google-api ocr google-cloud-vision google-vision

0条回答

Special characters which are identified as individ

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间