A couple of days ago I came across this CodeReview for Base-36 encoding a byte array. However, the answers that followed didn't touch on decoding back into a byte array, or possibly reusing the answer to perform encodings of different bases (radix).
The answer for the linked question uses BigInteger. So as far as implementation goes, the base and its digits could be parametrized.
The problem with BigInteger though, is that we're treating our input as an assumed integer. However, our input, a byte array, is just an opaque series of values.
- If the byte array ends in a series of zero bytes, eg {0xFF,0x7F,0x00,0x00}, those bytes will be lost when using the algorithm in the answer (would only encode {0xFF,0x7F}.
- If the last non-zero byte has the sign bit set then the proceeding zero byte is consumed as it's treated as the BigInt's sign delimiter. So {0xFF,0xFF,0x00,0x00} would encode only as {0xFF,0xFF,0x00}.
How could a .NET programmer use BigInteger to create a reasonably efficient and radix-agnostic encoder, with decoding support, plus the ability to handle endian-ness, and with the ability to 'work around' the ending zero bytes being lost?
edit [2016/04/19]: If you're fond of exceptions, you may wish to change some of the Decode implementation code to throw
InvalidDataException
instead of just returning null.edit [2014/09/14]: I've added a 'HACK' to Encode() to handle cases where the last byte in the input is signed (if you were to convert to sbyte). Only sane solution I could think of right now is to just Resize() the array by one. Additional unit tests for this case passed, but I didn't rerun perf code to account for such cases. If you can help it, always have your input to Encode() include a dummy 0 byte at the end to avoid additional allocations.
Usage
I've created a RadixEncoding class (found in the "Code" section) which initializes with three parameters:
To create a Base-36 encoding, with little-endian input, and with respect given to ending zero bytes:
And then to actually perform encoding/decoding:
Performance
Timed with Diagnostics.Stopwatch, ran on an i7 860 @2.80GHz. Timing EXE ran by itself, not under a debugger.
Encoding was initialized with the same k_base36_digits string from above, EndianFormat.Little, and with ending zero bytes acknowledged (even though the UTF8 bytes don't have any extra ending zero bytes)
To encode the UTF8 bytes of "A test 1234" 1,000,000 times takes 2.6567905secs
To decode the same string the same amount of times takes 3.3916248secs
To encode the UTF8 bytes of "A test 1234. Made slightly larger!" 100,000 times takes 1.1577325secs
To decode the same string the same amount of times takes 1.244326secs
Code
If you don't have a CodeContracts generator, you will have to reimplement the contracts with if/throw code.
Basic Unit Tests
Interestingly, I was able to port Kornman's techniques across to Java and got expected output up to and including base36. Whereas when running his? code from c# using C:\Windows\Microsoft.NET\Framework\v4.0.30319 csc, the output was not as expected.
For example, trying to base16 encode the obtained MD5 hashBytes for the String "hello world" below using Kornman's RadixEncoding encode, I could see the groups of two bytes per characters had the bytes in wrong order.
Rather than 5eb63bbbe01eeed093cb22bb8f5acdc3
I saw something like e56bb3bb0ee1....
This was on Windows 7.
Java code is below for anyone interested. As mentioned above, it only works to base 36.