I want a regex to match a simple hashtag like that in twitter (e.g. #someword). I want it also to recognize non standard characters (like those in Spanish, Hebrew or Chinese).
This was my initial regex: (^|\s|\b)(#(\w+))\b
--> but it doesn't recognize non standard characters.
Then, I tried using XRegExp.js, which worked, but ran too slowly.
Any suggestions for how to do it?
Eventually I found this: twitter-text.js useful link, which is basically how twitter solve this problem.
With native JS regexes that don't support unicode, your only option is to explicitly enumerate characters that can end the tag and match everything else, for example:
> s = "foo #הַתִּקְוָה. bar"
"foo #הַתִּקְוָה. bar"
> s.match(/#(.+?)(?=[\s.,:,]|$)/)
["#הַתִּקְוָה", "הַתִּקְוָה"]
The [\s.,:,]
should include spaces, punctuation and whatever else can be considered a terminating symbol.
#([^#]+)[\s,;]*
Explanation: This regular expression will search for a #
followed by one or more non-#
characters, followed by 0 or more spaces, commas or semicolons.
var input = "#hasta #mañana #babהַ";
var matches = input.match(/#([^#]+)[\s,;]*/g);
Result:
["#hasta ", "#mañana ", "#babהַ"]
EDIT - Replaced \b for word boundary