-->

How to retrieve the unicode decimal representation

2019-07-31 19:32发布

问题:

I am using visual studio 2010 in c# for converting text into unicodes. Like i have a string abc= "मेरा" . there are 4 characters in this string. i need all the four unicode characters. Please help me.

回答1:

When you write a code like string abc= "मेरा";, you already have it as Unicode (specifically, UTF-16), so you don't have to convert anything. If you want to access the singular characters, you can do that using normal index: e.g. abc[1] is (DEVANAGARI VOWEL SIGN E).

If you want to see the numeric representations of those characters, just cast them to integers. For example

abc.Select(c => (int)c)

gives the sequence of numbers 2350, 2375, 2352, 2366. If you want to see the hexadecimal representation of those numbers, use ToString():

abc.Select(c => ((int)c).ToString("x4"))

returns the sequence of strings "092e", "0947", "0930", "093e".

Note that when I said numeric representations, I actually meant their encoding using UTF-16. For characters in the Basic Multilingual Plane, this is the same as their Unicode code point. The vast majority of used characters lie in BMP, including those 4 Hindi characters presented here.

If you wanted to handle characters in other planes too, you could use code like the following.

byte[] bytes = Encoding.UTF32.GetBytes(abc);

int codePointCount = bytes.Length / 4;

int[] codePoints = new int[codePointCount];

for (int i = 0; i < codePointCount; i++)
    codePoints[i] = BitConverter.ToInt32(bytes, i * 4);

Since UTF-32 encodes all (21-bit) code points directly, this will give you them. (Maybe there is a more straightforward solution, but I haven't found one.)



回答2:

Since a .Net char is a Unicode character (at least, for the BMP code point), you can simply enumerate all characters in a string:

var abc = "मेरा";

foreach (var c in abc)
{
    Console.WriteLine((int)c);
}

resulting in

2350
2375
2352
2366


回答3:

use

System.Text.Encoding.UTF8.GetBytes(abc)

that will return your unicode values.



回答4:

If you are trying to convert files from a legacy encoding into Unicode:

Read the file, supplying the correct encoding of the source files, then write the file using the desired Unicode encoding scheme.

    using (StreamReader reader = new StreamReader(@"C:\MyFile.txt", Encoding.GetEncoding("ISCII")))
    using (StreamWriter writer = new StreamWriter(@"C:\MyConvertedFile.txt", false, Encoding.UTF8))
    {
        writer.Write(reader.ReadToEnd());
    }

If you are looking for a mapping of Devanagari characters to the Unicode code points:

You can find the chart at the Unicode Consortium website here.

Note that Unicode code points are traditionally written in hexidecimal. So rather than the decimal number 2350, the code point would be written as U+092E, and it appears as 092E on the code chart.



回答5:

If you have the string s = मेरा then you already have the answer.

This string contains four code points in the BMP which in UTF-16 are represented by 8 bytes. You can access them by index with s[i], with a foreach loop etc.

If you want the underlying 8 bytes you can access them as so:

string str = @"मेरा";
byte[] arr = System.Text.UnicodeEncoding.GetBytes(str);