How do I get a consistent byte representation of s

2018-12-31 01:44发布

How do I convert a string to a byte[] in .NET (C#) without manually specifying a specific encoding?

I'm going to encrypt the string. I can encrypt it without converting, but I'd still like to know why encoding comes to play here.

Also, why should encoding be taken into consideration? Can't I simply get what bytes the string has been stored in? Why is there a dependency on character encodings?

30条回答
大哥的爱人
2楼-- · 2018-12-31 02:13

The closest approach to the OP's question is Tom Blodget's, which actually goes into the object and extracts the bytes. I say closest because it depends on implementation of the String Object.

"Can't I simply get what bytes the string has been stored in?"

Sure, but that's where the fundamental error in the question arises. The String is an object which could have an interesting data structure. We already know it does, because it allows unpaired surrogates to be stored. It might store the length. It might keep a pointer to each of the 'paired' surrogates allowing quick counting. Etc. All of these extra bytes are not part of the character data.

What you want is each character's bytes in an array. And that is where 'encoding' comes in. By default you will get UTF-16LE. If you don't care about the bytes themselves except for the round trip then you can choose any encoding including the 'default', and convert it back later (assuming the same parameters such as what the default encoding was, code points, bug fixes, things allowed such as unpaired surrogates, etc.

But why leave the 'encoding' up to magic? Why not specify the encoding so that you know what bytes you are gonna get?

"Why is there a dependency on character encodings?"

Encoding (in this context) simply means the bytes that represent your string. Not the bytes of the string object. You wanted the bytes the string has been stored in -- this is where the question was asked naively. You wanted the bytes of string in a contiguous array that represent the string, and not all of the other binary data that a string object may contain.

Which means how a string is stored is irrelevant. You want a string "Encoded" into bytes in a byte array.

I like Tom Bloget's answer because he took you towards the 'bytes of the string object' direction. It's implementation dependent though, and because he's peeking at internals it might be difficult to reconstitute a copy of the string.

Mehrdad's response is wrong because it is misleading at the conceptual level. You still have a list of bytes, encoded. His particular solution allows for unpaired surrogates to be preserved -- this is implementation dependent. His particular solution would not produce the string's bytes accurately if GetBytes returned the string in UTF-8 by default.


I've changed my mind about this (Mehrdad's solution) -- this isn't getting the bytes of the string; rather it is getting the bytes of the character array that was created from the string. Regardless of encoding, the char datatype in c# is a fixed size. This allows a consistent length byte array to be produced, and it allows the character array to be reproduced based on the size of the byte array. So if the encoding were UTF-8, but each char was 6 bytes to accommodate the largest utf8 value, it would still work. So indeed -- encoding of the character does not matter.

But a conversion was used -- each character was placed into a fixed size box (c#'s character type). However what that representation is does not matter, which is technically the answer to the OP. So -- if you are going to convert anyway... Why not 'encode'?

查看更多
裙下三千臣
3楼-- · 2018-12-31 02:14

Use:

    string text = "string";
    byte[] array = System.Text.Encoding.UTF8.GetBytes(text);

The result is:

[0] = 115
[1] = 116
[2] = 114
[3] = 105
[4] = 110
[5] = 103
查看更多
伤终究还是伤i
4楼-- · 2018-12-31 02:16

This is a popular question. It is important to understand what the question author is asking, and that it is different from what is likely the most common need. To discourage misuse of the code where it is not needed, I've answered the later first.

Common Need

Every string has a character set and encoding. When you convert a System.String object to an array of System.Byte you still have a character set and encoding. For most usages, you'd know which character set and encoding you need and .NET makes it simple to "copy with conversion." Just choose the appropriate Encoding class.

// using System.Text;
Encoding.UTF8.GetBytes(".NET String to byte array")

The conversion may need to handle cases where the target character set or encoding doesn't support a character that's in the source. You have some choices: exception, substitution or skipping. The default policy is to substitute a '?'.

// using System.Text;
var text = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes("You win €100")); 
                                                      // -> "You win ?100"

Clearly, conversions are not necessarily lossless!

Note: For System.String the source character set is Unicode.

The only confusing thing is that .NET uses the name of a character set for the name of one particular encoding of that character set. Encoding.Unicode should be called Encoding.UTF16.

That's it for most usages. If that's what you need, stop reading here. See the fun Joel Spolsky article if you don't understand what an encoding is.

Specific Need

Now, the question author asks, "Every string is stored as an array of bytes, right? Why can't I simply have those bytes?"

He doesn't want any conversion.

From the C# spec:

Character and string processing in C# uses Unicode encoding. The char type represents a UTF-16 code unit, and the string type represents a sequence of UTF-16 code units.

So, we know that if we ask for the null conversion (i.e., from UTF-16 to UTF-16), we'll get the desired result:

Encoding.Unicode.GetBytes(".NET String to byte array")

But to avoid the mention of encodings, we must do it another way. If an intermediate data type is acceptable, there is a conceptual shortcut for this:

".NET String to byte array".ToCharArray()

That doesn't get us the desired datatype but Mehrdad's answer shows how to convert this Char array to a Byte array using BlockCopy. However, this copies the string twice! And, it too explicitly uses encoding-specific code: the datatype System.Char.

The only way to get to the actual bytes the String is stored in is to use a pointer. The fixed statement allows taking the address of values. From the C# spec:

[For] an expression of type string, ... the initializer computes the address of the first character in the string.

To do so, the compiler writes code skip over the other parts of the string object with RuntimeHelpers.OffsetToStringData. So, to get the raw bytes, just create a pointer to the string and copy the number of bytes needed.

// using System.Runtime.InteropServices
unsafe byte[] GetRawBytes(String s)
{
    if (s == null) return null;
    var codeunitCount = s.Length;
    /* We know that String is a sequence of UTF-16 codeunits 
       and such codeunits are 2 bytes */
    var byteCount = codeunitCount * 2; 
    var bytes = new byte[byteCount];
    fixed(void* pRaw = s)
    {
        Marshal.Copy((IntPtr)pRaw, bytes, 0, byteCount);
    }
    return bytes;
}

As @CodesInChaos pointed out, the result depends on the endianness of the machine. But the question author is not concerned with that.

查看更多
呛了眼睛熬了心
5楼-- · 2018-12-31 02:18

C# to convert a string to a byte array:

public static byte[] StrToByteArray(string str)
{
   System.Text.UTF8Encoding  encoding=new System.Text.UTF8Encoding();
   return encoding.GetBytes(str);
}
查看更多
皆成旧梦
6楼-- · 2018-12-31 02:19

The string can be converted to byte array in few different ways, due to the following fact: .NET supports Unicode, and Unicode standardizes several difference encodings called UTFs. They have different lengths of byte representation but are equivalent in that sense that when a string is encoded, it can be coded back to the string, but if the string is encoded with one UTF and decoded in the assumption of different UTF if can be screwed up.

Also, .NET supports non-Unicode encodings, but they are not valid in general case (will be valid only if a limited sub-set of Unicode code point is used in an actual string, such as ASCII). Internally, .NET supports UTF-16, but for stream representation, UTF-8 is usually used. It is also a standard-de-facto for Internet.

Not surprisingly, serialization of string into an array of byte and deserialization is supported by the class System.Text.Encoding, which is an abstract class; its derived classes support concrete encodings: ASCIIEncoding and four UTFs (System.Text.UnicodeEncoding supports UTF-16)

Ref this link.

For serialization to an array of bytes using System.Text.Encoding.GetBytes. For the inverse operation use System.Text.Encoding.GetChars. This function returns an array of characters, so to get a string, use a string constructor System.String(char[]).
Ref this page.

Example:

string myString = //... some string

System.Text.Encoding encoding = System.Text.Encoding.UTF8; //or some other, but prefer some UTF is Unicode is used
byte[] bytes = encoding.GetBytes(myString);

//next lines are written in response to a follow-up questions:

myString = new string(encoding.GetChars(bytes));
byte[] bytes = encoding.GetBytes(myString);
myString = new string(encoding.GetChars(bytes));
byte[] bytes = encoding.GetBytes(myString);

//how many times shall I repeat it to show there is a round-trip? :-)
查看更多
公子世无双
7楼-- · 2018-12-31 02:20

Just to demonstrate that Mehrdrad's sound answer works, his approach can even persist the unpaired surrogate characters(of which many had leveled against my answer, but of which everyone are equally guilty of, e.g. System.Text.Encoding.UTF8.GetBytes, System.Text.Encoding.Unicode.GetBytes; those encoding methods can't persist the high surrogate characters d800 for example, and those just merely replace high surrogate characters with value fffd ) :

using System;

class Program
{     
    static void Main(string[] args)
    {
        string t = "爱虫";            
        string s = "Test\ud800Test"; 

        byte[] dumpToBytes = GetBytes(s);
        string getItBack = GetString(dumpToBytes);

        foreach (char item in getItBack)
        {
            Console.WriteLine("{0} {1}", item, ((ushort)item).ToString("x"));
        }    
    }

    static byte[] GetBytes(string str)
    {
        byte[] bytes = new byte[str.Length * sizeof(char)];
        System.Buffer.BlockCopy(str.ToCharArray(), 0, bytes, 0, bytes.Length);
        return bytes;
    }

    static string GetString(byte[] bytes)
    {
        char[] chars = new char[bytes.Length / sizeof(char)];
        System.Buffer.BlockCopy(bytes, 0, chars, 0, bytes.Length);
        return new string(chars);
    }        
}

Output:

T 54
e 65
s 73
t 74
? d800
T 54
e 65
s 73
t 74

Try that with System.Text.Encoding.UTF8.GetBytes or System.Text.Encoding.Unicode.GetBytes, they will merely replace high surrogate characters with value fffd

Every time there's a movement in this question, I'm still thinking of a serializer(be it from Microsoft or from 3rd party component) that can persist strings even it contains unpaired surrogate characters; I google this every now and then: serialization unpaired surrogate character .NET. This doesn't make me lose any sleep, but it's kind of annoying when every now and then there's somebody commenting on my answer that it's flawed, yet their answers are equally flawed when it comes to unpaired surrogate characters.

Darn, Microsoft should have just used System.Buffer.BlockCopy in its BinaryFormatter

谢谢!

查看更多
登录 后发表回答