How do I convert a string
to a byte[]
in .NET (C#) without manually specifying a specific encoding?
I'm going to encrypt the string. I can encrypt it without converting, but I'd still like to know why encoding comes to play here.
Also, why should encoding be taken into consideration? Can't I simply get what bytes the string has been stored in? Why is there a dependency on character encodings?
Well, I've read all answers and they were about using encoding or one about serialization that drops unpaired surrogates.
It's bad when the string, for example, comes from SQL Server where it was built from a byte array storing, for example, a password hash. If we drop anything from it, it'll store an invalid hash, and if we want to store it in XML, we want to leave it intact (because the XML writer drops an exception on any unpaired surrogate it finds).
So I use Base64 encoding of byte arrays in such cases, but hey, on the Internet there is only one solution to this in C#, and it has bug in it and is only one way, so I've fixed the bug and written back procedure. Here you are, future googlers:
The key issue is that a glyph in a string takes 32 bits (16 bits for a character code) but a byte only has 8 bits to spare. A one-to-one mapping doesn't exist unless you restrict yourself to strings that only contain ASCII characters. System.Text.Encoding has lots of ways to map a string to byte[], you need to pick one that avoids loss of information and that is easy to use by your client when she needs to map the byte[] back to a string.
Utf8 is a popular encoding, it is compact and not lossy.
A string in .NET represents text as a sequence of UTF-16 code units, so the bytes are encoded in memory in UTF-16 already.
Mehrdad's Answer
You can use Mehrdad's answer, but it does actually use an encoding because chars are UTF-16. It calls ToCharArray which looking at the source creates a
char[]
and copies the memory to it directly. Then it copies the data to a byte array that is also allocated. So under the hood it is copying the underlying bytes twice and allocating a char array that is not used after the call.Tom Blodget's Answer
Tom Blodget's answer is 20-30% faster than Mehrdad since it skips the intermediate step of allocating a char array and copying the bytes to it, but it requires you compile with the
/unsafe
option. If you absolutely do not want to use encoding, I think this is the way to go. If you put your encryption login inside thefixed
block, you don't even need to allocate a separate byte array and copy the bytes to it.Because that is the proper way to do it.
string
is an abstraction.Using an encoding could give you trouble if you have 'strings' with invalid characters, but that shouldn't happen. If you are getting data into your string with invalid characters you are doing it wrong. You should probably be using a byte array or a Base64 encoding to start with.
If you use
System.Text.Encoding.Unicode
, your code will be more resilient. You don't have to worry about the endianness of the system your code will be running on. You don't need to worry if the next version of the CLR will use a different internal character encoding.I think the question isn't why you want to worry about the encoding, but why you want to ignore it and use something else. Encoding is meant to represent the abstraction of a string in a sequence of bytes.
System.Text.Encoding.Unicode
will give you a little endian byte order encoding and will perform the same on every system, now and in the future.Contrary to the answers here, you DON'T need to worry about encoding if the bytes don't need to be interpreted!
Like you mentioned, your goal is, simply, to "get what bytes the string has been stored in".
(And, of course, to be able to re-construct the string from the bytes.)
For those goals, I honestly do not understand why people keep telling you that you need the encodings. You certainly do NOT need to worry about encodings for this.
Just do this instead:
As long as your program (or other programs) don't try to interpret the bytes somehow, which you obviously didn't mention you intend to do, then there is nothing wrong with this approach! Worrying about encodings just makes your life more complicated for no real reason.
Additional benefit to this approach:
It doesn't matter if the string contains invalid characters, because you can still get the data and reconstruct the original string anyway!
It will be encoded and decoded just the same, because you are just looking at the bytes.
If you used a specific encoding, though, it would've given you trouble with encoding/decoding invalid characters.
It depends on what you want the bytes FOR
This is because, as Tyler so aptly said, "Strings aren't pure data. They also have information." In this case, the information is an encoding that was assumed when the string was created.
Assuming that you have binary data (rather than text) stored in a string
This is based off of OP's comment on his own question, and is the correct question if I understand OP's hints at the use-case.
Storing binary data in strings is probably the wrong approach because of the assumed encoding mentioned above! Whatever program or library stored that binary data in a
string
(instead of abyte[]
array which would have been more appropriate) has already lost the battle before it has begun. If they are sending the bytes to you in a REST request/response or anything that must transmit strings, Base64 would be the right approach.If you have a text string with an unknown encoding
Everybody else answered this incorrect question incorrectly.
If the string looks good as-is, just pick an encoding (preferably one starting with UTF), use the corresponding
System.Text.Encoding.???.GetBytes()
function, and tell whoever you give the bytes to which encoding you picked.