几天以前,我碰上了这个代码审查的基础-36编码的字节数组。 然而,随后在解码回一个字节数组,或者可能重复使用以执行不同的碱基(基数)的编码的答案没有触及的答案。
为链接的问题的答案使用的BigInteger。 所以,尽可能的实现去,底座和它的数字可以参数化。
与BigInteger的问题虽然,是我们对待我们的输入作为一个假设整数。 然而,我们的输入,一个字节数组,仅仅是一个不透明的值系列。
- 如果字节阵列中的一系列零个字节,例如,{0xFF时,0x7F的,0x00,0x00}结束,这些字节将在应答使用该算法时会丢失(只编码{0xFF时,0x7F的}。
- 如果最后一个非零字节有符号位设置,那么,因为它是作为BigInt有氏征符处理程序零字节被消耗。 所以{0xFF时为0xFF,0x00,0x00}将只编码为{0xFF时为0xFF,0×00}。
怎么可能一个.NET程序员使用的BigInteger创建一个合理的高效和基数无关的编码器,解码与支持,再加上处理字节序的能力,并与以“解决”结束零个字节丢失的能力吗?
编辑 [2016年4月19日]:如果你喜欢的异常,你可能希望改变一些解码实现代码抛出InvalidDataException
,而不是仅仅返回null。
编辑 [2014年9月14日]:我已经添加了“HACK”编码(),以应对在输入最后一个字节签署情况下(如果你要转换到sbyte)。 唯一明智的解决方案,现在我能想到的是一个刚刚调整()的阵列。 对于这种情况下的其他单元测试通过了,但我没有重新运行PERF的代码来说明这种情况。 如果你能帮助它,总是有你的输入编码()包括在最后一个虚拟0字节,以避免额外拨款。
用法
我创建了一个RadixEncoding类(以下简称“规范”部分中找到),它有三个参数初始化:
- 基数的数字作为字符串(长度确定过程的实际基数),
- 输入字节阵列的假定字节顺序(端),
- 而用户是否希望编码/解码逻辑来确认结束零个字节。
创建一个基-36的编码,与小端输入,并与给予结束零个字节方面:
const string k_base36_digits = "0123456789abcdefghijklmnopqrstuvwxyz";
var base36_no_zeros = new RadixEncoding(k_base36_digits, EndianFormat.Little, false);
然后实际执行编码/解码:
const string k_input = "A test 1234";
byte[] input_bytes = System.Text.Encoding.UTF8.GetBytes(k_input);
string encoded_string = base36_no_zeros.Encode(input_bytes);
byte[] decoded_bytes = base36_no_zeros.Decode(encoded_string);
性能
与Diagnostics.Stopwatch定时,跑酷睿i7的860 @ 2.80GHz的。 计时跑EXE本身,而不是下一个调试器。
进行编码与上述相同k_base36_digits字符串初始化,EndianFormat.Little和结尾承认零个字节 (即使UTF8字节没有任何多余的结束零个字节)
为了编码的UTF8字节“A测试1234”百万次取2.6567905secs
为了解码相同的字符串相同数量的时间花费3.3916248secs
要编码的UTF8字节“MADE稍大测试1234!” 10万次需要1.1577325secs
为了解码相同的字符串相同数量的时间花费1.244326secs
码
如果你没有一个CodeContracts发电机 ,你将有如果/扔代码重新实现合同。
using System;
using System.Collections.Generic;
using System.Numerics;
using Contract = System.Diagnostics.Contracts.Contract;
public enum EndianFormat
{
/// <summary>Least Significant Bit order (lsb)</summary>
/// <remarks>Right-to-Left</remarks>
/// <see cref="BitConverter.IsLittleEndian"/>
Little,
/// <summary>Most Significant Bit order (msb)</summary>
/// <remarks>Left-to-Right</remarks>
Big,
};
/// <summary>Encodes/decodes bytes to/from a string</summary>
/// <remarks>
/// Encoded string is always in big-endian ordering
///
/// <p>Encode and Decode take a <b>includeProceedingZeros</b> parameter which acts as a work-around
/// for an edge case with our BigInteger implementation.
/// MSDN says BigInteger byte arrays are in LSB->MSB ordering. So a byte buffer with zeros at the
/// end will have those zeros ignored in the resulting encoded radix string.
/// If such a loss in precision absolutely cannot occur pass true to <b>includeProceedingZeros</b>
/// and for a tiny bit of extra processing it will handle the padding of zero digits (encoding)
/// or bytes (decoding).</p>
/// <p>Note: doing this for decoding <b>may</b> add an extra byte more than what was originally
/// given to Encode.</p>
/// </remarks>
// Based on the answers from http://codereview.stackexchange.com/questions/14084/base-36-encoding-of-a-byte-array/
public class RadixEncoding
{
const int kByteBitCount = 8;
readonly string kDigits;
readonly double kBitsPerDigit;
readonly BigInteger kRadixBig;
readonly EndianFormat kEndian;
readonly bool kIncludeProceedingZeros;
/// <summary>Numerial base of this encoding</summary>
public int Radix { get { return kDigits.Length; } }
/// <summary>Endian ordering of bytes input to Encode and output by Decode</summary>
public EndianFormat Endian { get { return kEndian; } }
/// <summary>True if we want ending zero bytes to be encoded</summary>
public bool IncludeProceedingZeros { get { return kIncludeProceedingZeros; } }
public override string ToString()
{
return string.Format("Base-{0} {1}", Radix.ToString(), kDigits);
}
/// <summary>Create a radix encoder using the given characters as the digits in the radix</summary>
/// <param name="digits">Digits to use for the radix-encoded string</param>
/// <param name="bytesEndian">Endian ordering of bytes input to Encode and output by Decode</param>
/// <param name="includeProceedingZeros">True if we want ending zero bytes to be encoded</param>
public RadixEncoding(string digits,
EndianFormat bytesEndian = EndianFormat.Little, bool includeProceedingZeros = false)
{
Contract.Requires<ArgumentNullException>(digits != null);
int radix = digits.Length;
kDigits = digits;
kBitsPerDigit = System.Math.Log(radix, 2);
kRadixBig = new BigInteger(radix);
kEndian = bytesEndian;
kIncludeProceedingZeros = includeProceedingZeros;
}
// Number of characters needed for encoding the specified number of bytes
int EncodingCharsCount(int bytesLength)
{
return (int)Math.Ceiling((bytesLength * kByteBitCount) / kBitsPerDigit);
}
// Number of bytes needed to decoding the specified number of characters
int DecodingBytesCount(int charsCount)
{
return (int)Math.Ceiling((charsCount * kBitsPerDigit) / kByteBitCount);
}
/// <summary>Encode a byte array into a radix-encoded string</summary>
/// <param name="bytes">byte array to encode</param>
/// <returns>The bytes in encoded into a radix-encoded string</returns>
/// <remarks>If <paramref name="bytes"/> is zero length, returns an empty string</remarks>
public string Encode(byte[] bytes)
{
Contract.Requires<ArgumentNullException>(bytes != null);
Contract.Ensures(Contract.Result<string>() != null);
// Don't really have to do this, our code will build this result (empty string),
// but why not catch the condition before doing work?
if (bytes.Length == 0) return string.Empty;
// if the array ends with zeros, having the capacity set to this will help us know how much
// 'padding' we will need to add
int result_length = EncodingCharsCount(bytes.Length);
// List<> has a(n in-place) Reverse method. StringBuilder doesn't. That's why.
var result = new List<char>(result_length);
// HACK: BigInteger uses the last byte as the 'sign' byte. If that byte's MSB is set,
// we need to pad the input with an extra 0 (ie, make it positive)
if ( (bytes[bytes.Length-1] & 0x80) == 0x80 )
Array.Resize(ref bytes, bytes.Length+1);
var dividend = new BigInteger(bytes);
// IsZero's computation is less complex than evaluating "dividend > 0"
// which invokes BigInteger.CompareTo(BigInteger)
while (!dividend.IsZero)
{
BigInteger remainder;
dividend = BigInteger.DivRem(dividend, kRadixBig, out remainder);
int digit_index = System.Math.Abs((int)remainder);
result.Add(kDigits[digit_index]);
}
if (kIncludeProceedingZeros)
for (int x = result.Count; x < result.Capacity; x++)
result.Add(kDigits[0]); // pad with the character that represents 'zero'
// orientate the characters in big-endian ordering
if (kEndian == EndianFormat.Little)
result.Reverse();
// If we didn't end up adding padding, ToArray will end up returning a TrimExcess'd array,
// so nothing wasted
return new string(result.ToArray());
}
void DecodeImplPadResult(ref byte[] result, int padCount)
{
if (padCount > 0)
{
int new_length = result.Length + DecodingBytesCount(padCount);
Array.Resize(ref result, new_length); // new bytes will be zero, just the way we want it
}
}
#region Decode (Little Endian)
byte[] DecodeImpl(string chars, int startIndex = 0)
{
var bi = new BigInteger();
for (int x = startIndex; x < chars.Length; x++)
{
int i = kDigits.IndexOf(chars[x]);
if (i < 0) return null; // invalid character
bi *= kRadixBig;
bi += i;
}
return bi.ToByteArray();
}
byte[] DecodeImplWithPadding(string chars)
{
int pad_count = 0;
for (int x = 0; x < chars.Length; x++, pad_count++)
if (chars[x] != kDigits[0]) break;
var result = DecodeImpl(chars, pad_count);
DecodeImplPadResult(ref result, pad_count);
return result;
}
#endregion
#region Decode (Big Endian)
byte[] DecodeImplReversed(string chars, int startIndex = 0)
{
var bi = new BigInteger();
for (int x = (chars.Length-1)-startIndex; x >= 0; x--)
{
int i = kDigits.IndexOf(chars[x]);
if (i < 0) return null; // invalid character
bi *= kRadixBig;
bi += i;
}
return bi.ToByteArray();
}
byte[] DecodeImplReversedWithPadding(string chars)
{
int pad_count = 0;
for (int x = chars.Length - 1; x >= 0; x--, pad_count++)
if (chars[x] != kDigits[0]) break;
var result = DecodeImplReversed(chars, pad_count);
DecodeImplPadResult(ref result, pad_count);
return result;
}
#endregion
/// <summary>Decode a radix-encoded string into a byte array</summary>
/// <param name="radixChars">radix string</param>
/// <returns>The decoded bytes, or null if an invalid character is encountered</returns>
/// <remarks>
/// If <paramref name="radixChars"/> is an empty string, returns a zero length array
///
/// Using <paramref name="IncludeProceedingZeros"/> has the potential to return a buffer with an
/// additional zero byte that wasn't in the input. So a 4 byte buffer was encoded, this could end up
/// returning a 5 byte buffer, with the extra byte being null.
/// </remarks>
public byte[] Decode(string radixChars)
{
Contract.Requires<ArgumentNullException>(radixChars != null);
if (kEndian == EndianFormat.Big)
return kIncludeProceedingZeros ? DecodeImplReversedWithPadding(radixChars) : DecodeImplReversed(radixChars);
else
return kIncludeProceedingZeros ? DecodeImplWithPadding(radixChars) : DecodeImpl(radixChars);
}
};
基本的单元测试
using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
static bool ArraysCompareN<T>(T[] input, T[] output)
where T : IEquatable<T>
{
if (output.Length < input.Length) return false;
for (int x = 0; x < input.Length; x++)
if(!output[x].Equals(input[x])) return false;
return true;
}
static bool RadixEncodingTest(RadixEncoding encoding, byte[] bytes)
{
string encoded = encoding.Encode(bytes);
byte[] decoded = encoding.Decode(encoded);
return ArraysCompareN(bytes, decoded);
}
[TestMethod]
public void TestRadixEncoding()
{
const string k_base36_digits = "0123456789abcdefghijklmnopqrstuvwxyz";
var base36 = new RadixEncoding(k_base36_digits, EndianFormat.Little, true);
var base36_no_zeros = new RadixEncoding(k_base36_digits, EndianFormat.Little, true);
byte[] ends_with_zero_neg = { 0xFF, 0xFF, 0x00, 0x00 };
byte[] ends_with_zero_pos = { 0xFF, 0x7F, 0x00, 0x00 };
byte[] text = System.Text.Encoding.ASCII.GetBytes("A test 1234");
Assert.IsTrue(RadixEncodingTest(base36, ends_with_zero_neg));
Assert.IsTrue(RadixEncodingTest(base36, ends_with_zero_pos));
Assert.IsTrue(RadixEncodingTest(base36_no_zeros, text));
}
有趣的是,我是跨到Java能够端口Kornman的技术和得到预期直至并包括base36输出。 而运行时他? 利用C#代码C:\ WINDOWS \ Microsoft.NET \框架\ v4.0.30319 CSC,并没有如预期的输出。
例如,试图base16编码得到MD5 HASHBYTES为字符串“Hello World”的使用Kornman的RadixEncoding编码,我能看到的每个字符两个字节组中有错误的顺序字节。
而不是5eb63bbbe01eeed093cb22bb8f5acdc3
我看到的东西像e56bb3bb0ee1 ....
这是在Windows 7上。
const string input = "hello world";
public static void Main(string[] args)
{
using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create())
{
byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input);
byte[] hashBytes = md5.ComputeHash(inputBytes);
// Convert the byte array to hexadecimal string
StringBuilder sb = new StringBuilder();
for (int i = 0; i < hashBytes.Length; i++)
{
sb.Append(hashBytes[i].ToString("X2"));
}
Console.WriteLine(sb.ToString());
}
}
Java代码下面是感兴趣的人。 如上所述,它仅适用于基底36。
private static final char[] BASE16_CHARS = "0123456789abcdef".toCharArray();
private static final BigInteger BIGINT_16 = BigInteger.valueOf(16);
private static final char[] BASE36_CHARS = "0123456789abcdefghijklmnopqrstuvwxyz".toCharArray();
private static final BigInteger BIGINT_36 = BigInteger.valueOf(36);
public static String toBaseX(byte[] bytes, BigInteger base, char[] chars)
{
if (bytes == null) {
return null;
}
final int bitsPerByte = 8;
double bitsPerDigit = Math.log(chars.length) / Math.log(2);
// Number of chars to encode specified bytes
int size = (int) Math.ceil((bytes.length * bitsPerByte) / bitsPerDigit);
StringBuilder sb = new StringBuilder(size);
for (BigInteger value = new BigInteger(bytes); !value.equals(BigInteger.ZERO);) {
BigInteger[] quotientAndRemainder = value.divideAndRemainder(base);
sb.insert(0, chars[Math.abs(quotientAndRemainder[1].intValue())]);
value = quotientAndRemainder[0];
}
return sb.toString();
}