How to recognize if a string contains unicode char

2019-01-06 18:55发布

问题:

I have a string and I want to know if it has unicode characters inside or not. (if its fully contains ASCII or not)

How can I achieve that?

Thanks!

回答1:

If my assumptions are correct you wish to know if your string contains any "non-ANSI" characters. You can derive this as follows.

    public void test()
    {
        const string WithUnicodeCharacter = "a hebrew character:\uFB2F";
        const string WithoutUnicodeCharacter = "an ANSI character:Æ";

        bool hasUnicode;

        //true
        hasUnicode = ContainsUnicodeCharacter(WithUnicodeCharacter);
        Console.WriteLine(hasUnicode);

        //false
        hasUnicode = ContainsUnicodeCharacter(WithoutUnicodeCharacter);
        Console.WriteLine(hasUnicode);
    }

    public bool ContainsUnicodeCharacter(string input)
    {
        const int MaxAnsiCode = 255;

        return input.Any(c => c > MaxAnsiCode);
    }

Update

This will detect for extended ASCII. If you only detect for the true ASCII character range (up to 127), then you could potentially get false positives for extended ASCII characters which does not denote Unicode. I have alluded to this in my sample.



回答2:

If a string contains only ASCII characters, a serialization + deserialization step using ASCII encoding should get back the same string so a one liner check in c# could look like..

String s1="testभारत";
bool isUnicode= System.Text.ASCIIEncoding.GetEncoding(0).GetString(System.Text.ASCIIEncoding.GetEncoding(0).GetBytes(s1)) != s1;


回答3:

ASCII defines only character codes in the range 0-127. Unicode is explicitly defined such as to overlap in that same range with ASCII. Thus, if you look at the character codes in your string, and it contains anything that is higher than 127, the string contains Unicode characters that are not ASCII characters.

Note, that ASCII includes only the English alphabet. Thus, if you (for whatever reason) need to apply that same approach to strings that might contain accented characters (Spanish text for example), ASCII is not sufficient and you need to look for another differentiator.

ANSI character set [*] does extends the ASCII characters with the aforementioned accented Latin characters in the range 128-255. However, Unicode does not overlap with ANSI in that range, so technically an Unicode string might contain characters that are not part of ANSI, but have the same character code (specifically in the range 128-159, as you can see from the table I linked to).

As for the actual code to do this, @chibacity answer should work, although you should modify it to cover strict ASCII, because it won't work for ANSI.

[*] Also known as Latin 1 Windows (Win-1252)



回答4:

All C# / VB.NET string datatypes are comprised of Unicode characters.



回答5:

As long as it contains characters, it contains Unicode characters.

From System.String:

Represents text as a series of Unicode characters.

public static bool ContainsUnicodeChars(string text)
{
   return !string.IsNullOrEmpty(text);
}

You normally have to worry about different Unicode encodings when you have to:

  1. Encode a string into a stream of bytes with a particular encoding.
  2. Decode a string from a stream of bytes with a particular encoding.

Once you're into string land though, the encoding that the string was originally represented with, if any, is irrelevant.

Each character in a string is defined by a Unicode scalar value, also called a Unicode code point or the ordinal (numeric) value of the Unicode character. Each code point is encoded by using UTF-16 encoding, and the numeric value of each element of the encoding is represented by a Char object.

Perhaps you might also find these questions relevant:

How can you strip non-ASCII characters from a string? (in C#)

C# Ensure string contains only ASCII

And this article by Jon Skeet: Unicode and .NET



回答6:

This is another solution without using lambda expresions. It is in VB.NET but you can convert it easily to C#:

   Public Function ContainsUnicode(ByVal inputstr As String) As Boolean
        Dim inputCharArray() As Char = inputstr.ToCharArray

        For i As Integer = 0 To inputCharArray.Length - 1
            If CInt(AscW(inputCharArray(i))) > 255 Then Return True
        Next
        Return False
   End Function