I'll be getting text from a user that I need to validate is a Chinese character.
Is there any way I can check this?
I'll be getting text from a user that I need to validate is a Chinese character.
Is there any way I can check this?
According to the information provided here in unicode website you can find the block of Chinese or any other language and then implement a parser to check if a word is in the range or no. just like
public bool IsChinese(string text)
{
return text.Any(c => c >= 0x20000 && c <= 0xFA2D);
}
Note that
As a handy reference, the Unicode Consortium here provides a search interface to the Unicode Hàn (漢) Database (Unihan).
The database link I'd provided above is showing you the characters
You can use regular expression to match with Supported Named Blocks:
private static readonly Regex cjkCharRegex = new Regex(@"\p{IsCJKUnifiedIdeographs}");
public static bool IsChinese(this char c)
{
return cjkCharRegex.IsMatch(c.ToString());
}
Then, you can use:
if (sometext.Any(z=>z.IsChinese()))
DoSomething();
As several people mentioned here, in unicode, chinese, japan, and Korean characters are encoded together, and there are several ranges to it. https://en.wikipedia.org/wiki/CJK_Compatibility
For the simplicity, here's a code sample that detects all the CJK range:
public bool IsChinese(string text)
{
return text.Any(c => (uint)c >= 0x4E00 && (uint)c <= 0x2FA1F);
}
Just check the characters to see if the codepoints are in the desired range(s). For exampe, see this question:
What's the complete range for Chinese characters in Unicode?
According to the wikipedia (https://en.wikipedia.org/wiki/CJK_Compatibility) there are several character code diapasons. Here is my approach to detect Chinese characters based on link above (code in F#, but it can be easily converted)
let isChinese(text: string) =
text |> Seq.exists (fun c ->
let code = int c
(code >= 0x4E00 && code <= 0x9FFF) ||
(code >= 0x3400 && code <= 0x4DBF) ||
(code >= 0x3400 && code <= 0x4DBF) ||
(code >= 0x20000 && code <= 0x2CEAF) ||
(code >= 0x2E80 && code <= 0x31EF) ||
(code >= 0xF900 && code <= 0xFAFF) ||
(code >= 0xFE30 && code <= 0xFE4F) ||
(code >= 0xF2800 && code <= 0x2FA1F)
)
in unicode, chinese, japan, and Korean characters are encoded together.
visit this FAQ: http://www.unicode.org/faq/han_cjk.html
chinese character are distributed in serveral blocks.
visit this wiki: https://en.wikipedia.org/wiki/CJK_Unified_Ideographs
You will find there are serveral cjk character charts in unicode website.
For simplicity, You can just use chinese character minimum and maximum range:
0x4e00 and 0x2fa1f to check.
This worked for me:
var charArray = text.ToCharArray();
var isChineseTextPresent = false;
foreach (var character in charArray)
{
var cat = char.GetUnicodeCategory(character);
if (cat != UnicodeCategory.OtherLetter)
{
continue;
}
isChineseTextPresent = true;
break;
}
You need to query the Unicode Character Database, that contain info on every unicode character. There probably is a utility function in C# that can do this for you. Otherwise you can download it off the internet.