I have an array of bytes (any length), and I want to encode this array into string using my own base encoder. In .NET
is standard Base64
encoder, but what if I want to encode the array in Base62
, Base53
or Base13
?
Is it even possible to create such universal base encoder?
I know I could do it the simple way, that is, for each byte reserve fixed number of chars (in case of Base62
, that would be 5 chars), and do direct byte->chars encoding, but I would be wasting space, as 5 Base62
chars are able to contain more than 1 byte, but less than 2 bytes.
How should I write such an encoder? Or is there already some class for this?
And please note that I need universal decoder as well, otherwise this is useless to me.
Resources
As the solution is already known (use BigInteger
), I would just like to put here some resources relating the BigInteger
class, as it is not available in .NET 3.5:
Big integers in C#
http://intx.codeplex.com/
https://svn.apache.org/repos/asf/incubator/heraldry/libraries/csharp/openid/trunk/Mono/Mono.Math/BigInteger.cs
http://www.codeproject.com/KB/cs/BigInteger_Library.aspx
http://www.codeproject.com/KB/cs/biginteger.aspx
BASE64 works well, because 64 is a power of 2 (2^6) so each character holds 6 bits of data, and 3 bytes (3 * 8 = 24 bits) can be encoded into 4 characters (4 * 6 = 24). The encoding & decoding can be down merely bit shifting bits.
For bases which do not align with a power of 2 (like your base 62 or Base 53), Then you must treat the message you are trying to encode as one long number and perform divison and modulo operations on it. You'd probably be better off using a Base32 encoding and squandering a bit of bandwidth.
Another example to look at is Ascii85, used in Adobe PostScript and PDF documents. In Ascii85, 5 characters are used to encode 4 bytes. You can figure out the efficiency of this coding as (256^4)/(85^5) = 96.8%. This is the fraction of bit combinations that will actually be used.
So, for whatever new base you would want to use to encode your data, you want to look for a power that will get it just above a power of 256 if you're trying to maximize coding efficiency. This might not be easy for every base. Checking base 53 shows that the best you'll probably get is using 7 bytes to encode 5 bytes (93.6% efficiency), unless you feel like using 88 bytes to encode 63 bytes.
A little late to the party, but...
Because your specification calls for an arbitrary number of bits, you must have an integer type that can work with an arbitrary number of bits. If you can't target .NET 4.0 you'll have to beg, borrow, or steal a BigInteger implementation somewhere (like .NET 4.0 perhaps).
If performance is not an issue, use the BigInteger class in the background. You have a constructor for BigInteger that takes byte array, and you can then manually run loops of division and modulus to get the representation in other non-standard bases.
Also take a look at this.
A post on CodeReview prompted me to create a RadixEncoding class which is able to handle encoding/decoding a byte array to/from a base-N string.
The class can be found in this Q&A thread, along with documentation on (and solutions for) a few edge cases when dealing with BigInteger, endian-ness support, and the class' overall performance
You can get inspiration from C# implementation of Base32 implementation by Michael Giagnocavo.