I am developing a program ,where I need to filter words and sentences which are non-Latin character. The problem is, that I found only Latin character words and sentences , but I do not found words and sentences which are mixed with Latin characters and non-Latin characters. For example, "Hello" is Latin letter word, and I can match it using this code:
Match match = Regex.Match(line.Line, @"[^\u0000-\u007F]+", RegexOptions.IgnoreCase);
if (match.Success)
{
line.Line = match.Groups[1].Value;
}
But I do not found for example mixed with non-Latin letter word or sentences : "Hellø I am sømthing" .
Also, could somebody explain what is RegexOptions.None or RegexOptions.IgnoreCase and for what they stand for?
The four "Latin" blocks are (from http://www.fileformat.info/info/unicode/block/index.htm):
Basic Latin U+0000 - U+007F
Latin-1 Supplement U+0080 - U+00FF
Latin Extended-A U+0100 - U+017F
Latin Extended-B U+0180 - U+024F
So a Regex to "include" all of them would be:
Regex.Match(line.Line, @"[\u0000-\u024F]+", RegexOptions.None);
while a Regex to catch anything outside the block would be:
Regex.Match(line.Line, @"[^\u0000-\u024F]+", RegexOptions.None);
Note that I do feel that doing a regex "by block" is a little wrong, especially when you use the Latin blocks, because for example in the Basic Latin block you have control characters (like new line, ...), letters (A-Z, a-z), numbers (0-9), punctation (.,;:...), other characters ($@/&...) and so on.
For the meaning of RegexOptions.None
and RegexOptions.IgnoreCase
From https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions.aspx:
RegexOptions.None: Specifies that no options are set
RegexOptions.IgnoreCase: Specifies case-insensitive matching.
the last one means that if you do Regex.Match(line.Line, @"ABC", RegexOptions.IgnoreCase)
it will match ABC
, Abc
, abc
, ... And this option works even on character ranges like [A-Z]
that will match both A-Z
and a-z
. Note that it is probably useless in this case because the blocks I suggested should contain both the uppercase and the lowercase "variation" of letters that are both uppercase and lowercase.