I was trying to make the google vision OCR regex searchable. I have completed it and works pretty well when the document contains only English characters. But it fails when there is the text of other languages.
It's happening because I have only English characters in google vision word component as follows.
VISION_API_WORD_COUNTERS = "([a-zA-Z0-9]+)|([^a-zA-Z0-9 ])";
VISION_API_WORD_COMPONENTS = "[a-zA-Z0-9]";
VISION_API_NOT_WORD_COMPONENTS = "[^a-zA-Z0-9]";
As I can't include characters from all the languages, I am thinking to include the inverse of above. Something like
VISION_API_WORD_COMPONENTS = "[^*ALL THE SPECIAL CHARACTERS WHICH ARE IDENTIFIED AS WORD BY GOOGLE VISION*]"
for example [^!@#$%^&*()_+=]
.
So where can I find ALL THE SPECIAL CHARACTERS WHICH ARE IDENTIFIED AS A SEPARATE WORD BY GOOGLE VISION?
Trial and error, keep adding the special characters I find is one option.But that would be my last option.