I am reading RAW data from a source. This raw data is a sequence of Bytes.
I store this sequence of Bytes into an array of Bytes that I define as following in VB.NET:
Dim frame() as Byte
so each element in the above array is in the range [0-255].
I want to encode each of these bytes into ASCII, UTF-8 and Unicode so I iterate over the byte array (frame) and perform below snippet code depending on the case:
ASCII:
For idxByte As Integer = 0 To Me.frame.Length - 1
txtRefs(idxByte).Text = Encoding.ASCII.GetString(String.Format("<{0}>", Encoding.GetString(frame, idxByte, 1))
Next
Note: txtRefs is an array of textboxes, and its length is the same as frame.
And similar for the other two encodings:
UTF-8:
For idxByte As Integer = 0 To Me.frame.Length - 1
txtRefs(idxByte).Text = Encoding.UTF8.GetString(String.Format("<{0}>", Encoding.GetString(frame, idxByte, 1))
Next
Unicode:
For idxByte As Integer = 0 To Me.frame.Length - 1
txtRefs(idxByte).Text = Encoding.Unicode.GetString(String.Format("<{0}>", Encoding.GetString(frame, idxByte, 1))
Next
ASCII and UTF8 encoding seems ok, but Unicode encoding seems it is not working or maybe I am not understanding Unicode encoding functionality at all...
For unicode I get below result by executing above loop. Is this correct?
Encoding.Unicode
is UTF-16 LE, so it needs two bytes to give the correct results. e.g.
Dim input() As Byte = { 65, 0 }
Dim x = Encoding.Unicode.GetString(input, 0, 1)
Dim y = Encoding.Unicode.GetString(input, 0, 2)
Console.WriteLine("x={0}, y={1}", x, y)
x=�, y=A
However, if your input is single byte per character you probably don't want to just pass two bytes from your input array. You may want to create a new input array with a zero second byte:
Dim input() As Byte = { 65, 0 }
Dim x = Encoding.Unicode.GetString(input, 0, 1)
Dim y = Encoding.Unicode.GetString(input, 0, 2)
Dim z = Encoding.Unicode.GetString(New Byte() { input(0), 0 })
Console.WriteLine("x={0}, y={1}, z={2}", x, y, z)
x=�, y=A, z=A
Hard to tell without knowing your input and desired output.
For ASCII, each byte is a code unit, is a codepoint, is a character, is a glyph.
For UTF-8, each byte is a code unit, one or more is a codepoint, one or more is a glyph.
For UTF-16, each two bytes is a code unit, one or more is a codepoint, one or more is a glyph.
To convert a sequence of bytes, just use one call to GetString for the appropriate Encoding instance. Then you'll be dealing with String
, which is a counted sequence of Unicode/UTF-16 codepoints.
The built-in Encoding classes use a substitution character ("?") when the bytes don't make sense for the encoding. If you prefer you can create an instance with a DecoderFallback exception so you'll be able to handle those cases. For example, 0xFF is never a valid ASCII code unit; The 0xCD is a valid code unit in UTF-8 but the sequence 0xCD 0x20 is not valid.
Presumably, you want to separate glyphs for display purposes. See TextElementEnumerator.