vb.net: Encoding byte array into string using Unic

2019-09-18 11:03发布

问题:

I am reading RAW data from a source. This raw data is a sequence of Bytes. I store this sequence of Bytes into an array of Bytes that I define as following in VB.NET:

Dim frame() as Byte

so each element in the above array is in the range [0-255].

I want to encode each of these bytes into ASCII, UTF-8 and Unicode so I iterate over the byte array (frame) and perform below snippet code depending on the case:

ASCII:

For idxByte As Integer = 0 To Me.frame.Length - 1
    txtRefs(idxByte).Text = Encoding.ASCII.GetString(String.Format("<{0}>", Encoding.GetString(frame, idxByte, 1))
Next

Note: txtRefs is an array of textboxes, and its length is the same as frame.

And similar for the other two encodings:

UTF-8:

For idxByte As Integer = 0 To Me.frame.Length - 1
    txtRefs(idxByte).Text = Encoding.UTF8.GetString(String.Format("<{0}>", Encoding.GetString(frame, idxByte, 1))
Next

Unicode:

For idxByte As Integer = 0 To Me.frame.Length - 1
    txtRefs(idxByte).Text = Encoding.Unicode.GetString(String.Format("<{0}>", Encoding.GetString(frame, idxByte, 1))
Next

ASCII and UTF8 encoding seems ok, but Unicode encoding seems it is not working or maybe I am not understanding Unicode encoding functionality at all...

For unicode I get below result by executing above loop. Is this correct?

回答1:

Encoding.Unicode is UTF-16 LE, so it needs two bytes to give the correct results. e.g.

Dim input() As Byte = { 65, 0 }
Dim x = Encoding.Unicode.GetString(input, 0, 1)
Dim y = Encoding.Unicode.GetString(input, 0, 2)
Console.WriteLine("x={0}, y={1}", x, y)

x=�, y=A

However, if your input is single byte per character you probably don't want to just pass two bytes from your input array. You may want to create a new input array with a zero second byte:

Dim input() As Byte = { 65, 0 }
Dim x = Encoding.Unicode.GetString(input, 0, 1)
Dim y = Encoding.Unicode.GetString(input, 0, 2)
Dim z = Encoding.Unicode.GetString(New Byte() { input(0), 0 })
Console.WriteLine("x={0}, y={1}, z={2}", x, y, z)

x=�, y=A, z=A

Hard to tell without knowing your input and desired output.



回答2:

For ASCII, each byte is a code unit, is a codepoint, is a character, is a glyph.

For UTF-8, each byte is a code unit, one or more is a codepoint, one or more is a glyph.

For UTF-16, each two bytes is a code unit, one or more is a codepoint, one or more is a glyph.

To convert a sequence of bytes, just use one call to GetString for the appropriate Encoding instance. Then you'll be dealing with String, which is a counted sequence of Unicode/UTF-16 codepoints.

The built-in Encoding classes use a substitution character ("?") when the bytes don't make sense for the encoding. If you prefer you can create an instance with a DecoderFallback exception so you'll be able to handle those cases. For example, 0xFF is never a valid ASCII code unit; The 0xCD is a valid code unit in UTF-8 but the sequence 0xCD 0x20 is not valid.

Presumably, you want to separate glyphs for display purposes. See TextElementEnumerator.