Is there a way to check whether unicode text is in

2019-01-07 15:01发布

问题:

I'll be getting text from a user that I need to validate is a Chinese character.

Is there any way I can check this?

回答1:

According to the information provided here in unicode website you can find the block of Chinese or any other language and then implement a parser to check if a word is in the range or no. just like

public bool IsChinese(string text)
{
    return text.Any(c => c >= 0x20000 && c <= 0xFA2D);
}

Note that

As a handy reference, the Unicode Consortium here provides a search interface to the Unicode Hàn (漢) Database (Unihan).

The database link I'd provided above is showing you the characters



回答2:

You can use regular expression to match with Supported Named Blocks:

private static readonly Regex cjkCharRegex = new Regex(@"\p{IsCJKUnifiedIdeographs}");
public static bool IsChinese(this char c)
{
    return cjkCharRegex.IsMatch(c.ToString());
}

Then, you can use:

if (sometext.Any(z=>z.IsChinese()))
     DoSomething();


回答3:

As several people mentioned here, in unicode, chinese, japan, and Korean characters are encoded together, and there are several ranges to it. https://en.wikipedia.org/wiki/CJK_Compatibility

For the simplicity, here's a code sample that detects all the CJK range:

public bool IsChinese(string text)
{
    return text.Any(c => (uint)c >= 0x4E00 && (uint)c <= 0x2FA1F);
}


回答4:

Just check the characters to see if the codepoints are in the desired range(s). For exampe, see this question:

What's the complete range for Chinese characters in Unicode?



回答5:

According to the wikipedia (https://en.wikipedia.org/wiki/CJK_Compatibility) there are several character code diapasons. Here is my approach to detect Chinese characters based on link above (code in F#, but it can be easily converted)

 let isChinese(text: string) = 
            text |> Seq.exists (fun c -> 
                let code = int c
                (code >= 0x4E00 && code <= 0x9FFF) ||
                (code >= 0x3400 && code <= 0x4DBF) ||
                (code >= 0x3400 && code <= 0x4DBF) ||
                (code >= 0x20000 && code <= 0x2CEAF) ||
                (code >= 0x2E80 && code <= 0x31EF) ||
                (code >= 0xF900 && code <= 0xFAFF) ||
                (code >= 0xFE30 && code <= 0xFE4F) ||
                (code >= 0xF2800 && code <= 0x2FA1F) 
                )


回答6:

in unicode, chinese, japan, and Korean characters are encoded together.

visit this FAQ: http://www.unicode.org/faq/han_cjk.html

chinese character are distributed in serveral blocks.

visit this wiki: https://en.wikipedia.org/wiki/CJK_Unified_Ideographs

You will find there are serveral cjk character charts in unicode website.

For simplicity, You can just use chinese character minimum and maximum range:

0x4e00 and 0x2fa1f to check.



回答7:

This worked for me:

var charArray = text.ToCharArray();
var isChineseTextPresent = false;


foreach (var character in charArray)
{
    var cat = char.GetUnicodeCategory(character);


    if (cat != UnicodeCategory.OtherLetter)
    {
        continue;
    }


    isChineseTextPresent = true;
    break;
}


回答8:

You need to query the Unicode Character Database, that contain info on every unicode character. There probably is a utility function in C# that can do this for you. Otherwise you can download it off the internet.