Special characters which are identified as individ

2019-07-26 07:36发布

问题:

I was trying to make the google vision OCR regex searchable. I have completed it and works pretty well when the document contains only English characters. But it fails when there is the text of other languages.

It's happening because I have only English characters in google vision word component as follows.

VISION_API_WORD_COUNTERS = "([a-zA-Z0-9]+)|([^a-zA-Z0-9 ])";
VISION_API_WORD_COMPONENTS = "[a-zA-Z0-9]";
VISION_API_NOT_WORD_COMPONENTS = "[^a-zA-Z0-9]";

As I can't include characters from all the languages, I am thinking to include the inverse of above. Something like

VISION_API_WORD_COMPONENTS = "[^*ALL THE SPECIAL CHARACTERS WHICH ARE IDENTIFIED AS WORD BY GOOGLE VISION*]"

for example [^!@#$%^&*()_+=].

So where can I find ALL THE SPECIAL CHARACTERS WHICH ARE IDENTIFIED AS A SEPARATE WORD BY GOOGLE VISION?

Trial and error, keep adding the special characters I find is one option.But that would be my last option.