I've got four unsigned 32-bit integers representing an unsigned 128-bit integer, in little endian order:
typedef struct {
unsigned int part[4];
} bigint_t;
I'd like to convert this number into its decimal string representation and output it to a file.
Right now, I'm using a bigint_divmod10
function to divide the number by 10, keeping track of the remainder. I call this function repeatedly, outputting the remainder as a digit, until the number is zero. It's pretty slow. Is this the fastest way to do it? If so, is there a clever way to implement this function that I'm not seeing? I've tried looking at GMP's get_str.c
, but I find it pretty impenetrable.
EDIT: here's the fastest code I was able to come up with for the divmod10 function:
static unsigned uint128_divmod10(uint128 *value)
{
unsigned int a = value->word[3];
unsigned int b = value->word[2];
unsigned int c = value->word[1];
unsigned int d = value->word[0];
unsigned int diva = a / 5;
unsigned int divb = b / 5;
unsigned int divc = c / 5;
unsigned int divd = d / 5;
value->word[3] = diva;
value->word[2] = divb;
value->word[1] = divc;
value->word[0] = divd;
unsigned int moda = a - diva*5;
unsigned int modb = b - divb*5;
unsigned int modc = c - divc*5;
unsigned int modd = d - divd*5;
unsigned int mod = 0;
mod += moda;
unsigned int carryb = mod*858993459;
mod += modb;
if (mod >= 5) {
mod -= 5;
carryb++;
}
unsigned int carryc = mod*858993459;
mod += modc;
if (mod >= 5) {
mod -= 5;
carryc++;
}
unsigned int carryd = mod*858993459;
mod += modd;
if (mod >= 5) {
mod -= 5;
carryd++;
}
uint128_add(value, carryd, 0);
uint128_add(value, carryc, 1);
uint128_add(value, carryb, 2);
if (value->word[0] & 1) {
mod += 5;
}
uint128_shift(value, -1);
return mod;
}
where the add function is defined as:
static void uint128_add(uint128 *value, unsigned int k, unsigned int pos)
{
unsigned int a = value->word[pos];
value->word[pos] += k;
if (value->word[pos] < a) {
// overflow
for (int i=pos+1; i<4; i++) {
value->word[i]++;
if (value->word[i]) {
break;
}
}
}
}
The most immediate speedup will come from inlining the conversion rather than calling functions; it could be as simple as marking
bigint_divmod10()
inline, or using profile-guided optimisation as offered by your compiler.For future reference, instead of implementing a uint128 type, I just used the characters of the string directly. This turned out to be much faster than going from string to uint128 and back.
Lookup table of 8 bits. You can have 4 lookup tables of 256 numbers. First is from 0-256 for LSB bytes, Second table is first table multiplied by 256 and so on.
SO when you need your number sum up numbers from lookup table. When you adding you can add as bunary and go later one pass over each byte to fix owerflows.
Example number 0x12345678 In first lookup table there is under addres (0x78 = 120) so 0x010200 is first number in second table under(0x56=87) is 0x0202000106 (0x56 in dec is 22016) in third table you hou would have 0x03040007080702 and under last lable at 0x12 you have 0x030001090809080808 (this does not fit in 32 bit arithmetic, but that you allredy know)
Then sum up this numbers (as binary bumbers) and go one pass, byte by byte for overflow code in for loop is something like
If we count operations needed for this.
1.(looking in tables and adding) 4 lookup tables. 16 additions (keep in mind that when you do not need to carry about owerflow, becuase they can not ocur)
2. one pass in each step 3 operatins 16 steps to pass.
passimistic upper bound 6*16 = 100 operations.
EDIT:
Here is c++ code, and is 30% faster than naive implementation.
It depends what else you're doing with the numbers. You can trade off a slight loss in space efficiency and a modest loss in efficiency of multiprecision arithmetic in return for very efficient conversion to and from decimal. The key is to do multiprecision arithmetic with a base that is a power of 10 rather than a power of 2.
For example, you might use base 10,000, where you pack one digit into a 16-bit word and you do your arithmetic on digits in 32-bit integers. (If you're on a 64-bit machine you can double that and do base 1,000,000,000.) This kind of code is relatively efficient timewise, although not quite as fast as using the native power of two because you can't take advantage of the carry bit on the hardware. And you can't represent as many integers in the same number of bits. But it's a whiz at converting to and from decimal, because you get to convert the individual digits without any long division.
If you need to represent the full range of numbers from zero to
((1 << 128) - 1)
, you can still do this, but add an extra digit, so your numbers will be bigger.If it turns out you really need the extra space/speed (maybe you're doing a lot of cryptographic 128-bit calculations) then the method of simultanous div/mod by 10 is the fastest method I know. The only other trick is that if small integers are common, you can handle them specially. (That is, if the three most significant 32-bit words are all zero, just use the native division to convert.)
Dave Hanson's C Interfaces and Implementations has a lengthy chapter on multiprecision arithmetic. Dividing a large number by a single digit is a special case that has this efficient implementation:
For full understanding, it really helps to have the book, but the source code is still a lot easier to understand than the GNU source code. And you could easily adapt it to use base 10,000 (it currently uses base 256).
Summary: if your performance bottleneck is conversion to decimal, implement multiprecision arithmetic with a base that is a power of 10. If your machine's native word size is 32 and you are using C code, use 10,000 in a 16-bit word.
If your values are mostly less than
ULLONG_MAX
(18446744073709551615) I'd try to use for themsprintf(buf,"%llu",ullong_val)
. I bet this is rather well optimized in standard library, but parsing of format will take some cycles though.Otherwise I'd create a
bigint_divmod1000000000
(or better name mod10to9) function and use that. It would need 9 times less divides thanbigint_divmod10
.I know this question is old, but I want to contribute as none put a way avoiding the division cycle. This one uses pow2, I haven't tested the benchmark but in theory should be faster than any other, and also could be tweaked in the pow function as well.
Output: 36