I'm trying to convert a hex char
to integer as fast as possible.
This is only one line:
int x = atoi(hex.c_str);
Is there a faster way?
Here, I have tried a more dynamic approach, and it's slightly faster.
int hextoint(char number) {
if (number == '0') {
return 0;
}
if (number == '1') {
return 1;
}
if (number == '2') {
return 2;
}
/*
* 3 through 8
*/
if (number == '9') {
return 9;
}
if (number == 'a') {
return 10;
}
if (number == 'b') {
return 11;
}
if (number == 'c') {
return 12;
}
if (number == 'd') {
return 13;
}
if (number == 'e') {
return 14;
}
if (number == 'f') {
return 15;
}
return -1;
}
This is my favorite hex-to-int code:
It is case-insensitive for letter, i.e will return correct result for "a" and "A".
This question may evidently have different answers on different systems, and in this sense it is ill-posed from the start. For example an i486 has no pipeline and a pentium has no SSE.
Among approaches herein, the answer to this is actually the same or very very very nearly the same on a system with a multi-stage pipeline. Any system without a pipeline will bend towards the lookup table method (LUT), but if memory access is slow the conditional method (CEV), or the bitwise evaluation method (BEV), may profit depending of the speed of xor vs load for the given CPU.
(CEV) decomposes into 2 load effective addresses a comparison and a conditional move from registers which is not prone to mis-prediction. All these commands are pairable in the pentium pipeline. So they actually go in 1-cycle.
The (LUT) decomposes into a mov between registers and mov from a data dependent memory location plus some nops for alignment, and should take the minimum of 1-cycle. As the previous there are only data dependencies.
The (BEV) is a different beast as it actually requires 2 movs + 2 xors + 1 and a conditional mov. These can also be nicely pipelined.
Of course, it is a very rare occasion that it is application critical (maybe Mars Pathfinder is a candidate) to convert just a signle char. Instead one would expect to convert a larger string by actually making a loop and calling that function.
Thus on such a scenario the code that is better vectorizable is the winner. The LUT does not vectorize, and BEV and CEV have better behaviour. In general such micro-optimization does not get you anywhere, write your code and let live (i.e. let the compiler run).
So I have actually built some tests in this sense that are easily reproducible on any system with a c++11 compiler and a random device source,such as any *nix system. If one does not permit vectorization
-O2
CEV/LUT are almost equal, but once-O3
is set the advantage of writing code that is more decomposable shows the difference.Problem: in question is to convert from the set of chars {0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f} to the set of {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}. There are no capital letters under consideration.
The idea is to take advantage of the linearity of the ascii table in segments.
[Simple and easy]: Conditional evaluation -> CEV
[Dirty and complex]: Bitwise evaluation -> BEV
[compile time]: Template conditional evaluation -> TCV
[Lookup table]: Lookup table -> LUT
Among all , the last seems to be the fastest at first look. The second is only at compile time and constant expression.
An exemplary result for str-sizes 16-12384 can be found below ( lower is better )
The average time (100 runs ) along is show. The size of the bubble is the normal error.
The script for running the tests is available.
Tests have been performed for the
conditional
CEV, thebitwise
BEV and thelookup table
LUT on a set of randomly generated strings. The tests are fairly simple and from:Test source code
these are verifiable:
g++ -std=c++11 -O3 -march=native dectohex.cpp -o d2h
taskset -c 0 d2h
As a side note I have seen in practice version 3 to be much faster with older c++98 compilers.
The results with the unordered_map have been removed since they have been just too slow to compare, or at best case may be as fast as the LUT solution.
Results from my personal pc on strings of size 12384/256 and for 100 strings:
The results from a system with GCC 4.9.3 compiled to the metal without the system being loaded on strings of size 256/12384 and for 100 strings
[HOW TO READ THE RESULTS]
The mean is shown on microseconds required to compute the string of the given size.
The total time for each test is given. The mean is computed as the sum/total of timings to compute one string ( no other code in that region but could be vectorized, and that's ok) . The error is the standard deviation of the timings.
The mean tell us what we should expect on average , and the error how much the timings have been following normality. In this case this is a fair measure of error only when it is small ( otherwise we should use something appropriate for positive distributions ). One usually should expect high errors in case of cache miss , processor scheduling, and many other factors.
The code has a unique macro defined to run the tests, permits to define compile time variables to set up the tests, and prints complete information such as:
For example to run the test with a
2sec
pause on a str of size256
for a total of10000
different strings, output timings indouble precision
, and count innanoseconds
the following command compiles and runs the test.Proposed Solutions that Render Faster than the OP's if-else:
Provided that your input strings are always hex numbers you could define a lookup table as an
unordered_map
:constexpr
literal (C++14)Or if you want something more faster instead of an
unordered_map
you could use the new C++14 facilities with user literal types and define your table as a literal type at compile time:Live Demo
Benchmarks:
I ran benchmarks with the code written by Nikos Athanasiou that was posted recently on isocpp.org as a proposed method for C++ micro-benchmarking.
The algorithms that were compared are:
1. OP's original
if-else
:2. Compact if-else, proposed by Christophe:
3. Corrected ternary operator version that handles also capital letter inputs, proposed by g24l:
4. Lookup Table (
unordered_map
):where
table
is the unordered map shown previously.5. Lookup Table (user
constexpr
literal):Where table is user defined literal as shown above.
Experimental Settings
I defined a function that transforms an input hex string to an integer:
I also defined a function that populates a vector of strings with random hex strings:
I created vectors populated with 50000, 100000, 150000, 200000 and 250000 random hex strings respectively. Then for each algorithm I run 100 experiments and averaged the time results.
Compiler was GCC version 5.2 with optimization option
-O3
.Results:
Discussion
From the results we can conclude that for these experimental settings the proposed table method out-performs all the other methods. The if-else method is by far the worst where as the
unordered_map
although it wins the if-else method it is significantly slower than the other proposed methods.CODE
Edit:
Results for method proposed by stgatilov, with bitwise operations:
Edit:
I also tested the original code from g24l against the table method:
Note that this method doesn't handle capital letters
A
,B
,C
,D
,E
andF
.Results:
Still the table method renders faster.
Well, that's a weird question. Converting a single hex char into an integer is so fast, that it is really hard to tell which is faster, because all methods are almost likely faster than the code you write in order to use them =)
I'll assume the following things:
eax
.Solution
Now here are several methods for solving the problem: the first one based on lookup, two based on ternary operator, the last one based on bit operations:
Here are the corresponding assembly listings generated (only the relevant parts are shown):
Analysis
I'll try to estimate number of cycles taken by each approach in throughput sense, which is essentially the time spent per one input number when a lot of numbers are processed at once. Consider a Sandy Bridge architecture as an example.
The
hextoint_lut
function consists of a single memory load, which takes 1 uop on port 2 or 3. Both of these ports are dedicated to memory loads, and they also have address calculation inside, which are capable of doingrax+rcx
with no additional cost. There are two such ports, each can do one uop in a cycle. So supposedly this version would take 0.5 clock time. If we have to load input number from memory, that would require one more memory load per value, so the total cost would be 1 clock.The
hextoint_cond
version has 4 instructions, butcmov
is broken into two separate uops. So there are 5 uops in total, each can be processed on any of the three arithmetic ports 0, 1, and 5. So supposedly it would take 5/3 cycles time. Note that memory load ports are free, so the time would not increase even if you have to load the input value from memory.The
hextoint_cond2
version has 5 instructions. But in a tight loop the constants can be preloaded to registers, so there would be only comparison, cmov and subtraction. They are 4 uops in total, giving 4/3 cycles per value (even with memory read).The
hextoint_bit
version is a solution which is guaranteed to have no branches and lookup, which is handy if you do not want to check always whether your compiler generated a cmov instruction. The first mov is free, since the constant can be preloaded in a tight loop. The rest are 5 arithmetic instructions, which a 5 uops in ports 0, 1, 5. So it should take 5/3 cycles (even with a memory read).Benchmark
I have performed a benchmark for the C++ functions described above. In a benchmark, 64 KB of random data is generated, then each function is run many times on this data. All the results are added to checksum to ensure that compiler does not remove the code. Manual 8x unrolling is used. I have tested on a Ivy Bridge 3.4 Ghz core, which is very similar to Sandy Bridge. Each string of output contains: name of function, total time taken by benchmark, number of cycles per input value, sum of all outputs.
Benchmark code
Clearly, LUT approach takes one cycle per value (as predicted). The other approaches normally take from 2.2 to 2.6 cycles per value. In case of GCC,
hextoint_cond2
is slow because compiler uses cmp+sbb+and magic instead of desired cmov instructions. Also note that by default GCC vectorizes most of the approaches (last paragraph), which provides expectedly faster results than the unvectorizable LUT approach. Note that manual vectorization would give significantly greater boost.Discussion
Note that
hextoint_cond
with ordinary conditional jump instead ofcmov
would have a branch. Assuming random input hex digits, it will be mispredicted almost always. So performance would be terrible, I think.I have analysed throughput performance. But if we have to process tons of input values, then we should definitely vectorize the conversion to get better speed.
hextoint_cond
can be vectorized with SSE in a pretty straightforward way. It allows to process 16 bytes to 16 bytes by using only 4 instructions, taking about 2 cycles I suppose.Note that in order to see any performance difference, you must ensure that all the input values fit into cache (L1 is the best case). If you read the input data from main memory, even
std::atoi
is equally fast with the considered methods =)Also, you should unroll your main loop 4x or even 8x for maximum performance (to remove looping overhead). As you might have already noticed, the speed of both methods highly depends on which operations are surrounding the code. E.g. adding a memory load doubles time taken by the first approach, but does not influence the other approaches.
P.S. Most likely you don't really need to optimize this.
In case you (or someone else) are actually converting an array of values, I made an AVX2 SIMD encoder and decoder that benchmarks ~12x faster than the fastest scalar implementation: https://github.com/zbjornson/fast-hex
The 16 hex values conveniently fit (twice) into a YMM register, so you can use
PSHUFB
to do a parallelized lookup. Decoding is a bit harder and based on bit-wise ops.Assuming that your function is called for a valid hex digit, it will cost in average at least 8 comparison operations (and perhap's 7 jumps). Pretty expensive.
An alternative would be the more compact:
Yet another alternative would be to use a lookup table (trade space against speed), that you would initialise only once, and then access directly:
If you want to convert more than one digit at a time, you could have a look at this question)
Edit: benchmark
Following Ike's observations, Ive written a small informal benchmark (available online here), that you can run on your favourite compiler.
Conclusions: