I am having a really hard time figuring out a regular expression (in C#) to validate hashtags. \w
simply isn't enough as special characters are missing (ä, ö, ø, æ, å for starters, but also a lot of other foreign characters.
I need to support ALL hashtags there is. Mainly from Twitter, but in the future also from other providers.
My best shot (so far) is: ^#[a-zA-Z_0-9\u00C0-\u02AF]+$
(C# regex)
I cannot find any decent documentation from Twitter or anyone else about this, so:
- Does anyone know of any documentation I have missed?
- OR does anyone know which unicode ranges I should include as valid characters for hashtags?
- AND Can anybody tell me if there is a difference between the support of hashtags on e.g. Twitter, Instagram, Facebook, etc.?
Update I should note that C# is not the only language I need this in. Thus the need for precise specification.
A Quick-and-Dirty Simplified Approach
Here is a nice-read from Twitter eng team:
The test cases and other valuable information is located at https://github.com/twitter/twitter-text/blob/master/java/src/test/java/com/twitter/twittertext/RegexTest.java. Acc. to it, the valid hashtag can be written in C# as
See this regex demo
Since you want to be able to use this in any language, just note that
\p{L}
is equal toand
\w
is a combination of\p{L}
,_
and a\p{N}
, see\p{N}
below:and whitespace is something like
Note there can be issues with diacritic matching in ES5 regex syntax.
UPDATE
twitter-text C# Adaptation
The Java library features the following regex for the hashtags:
Translating into C#:
And here is a testing C# demo:
JavaScript Hashtag validation
If you use JS Twitter library, identifying hasgtags can be done with a mere: