What's the fastest way to convert hex to integ

2楼-- · 2019-02-02 00:50

This is my favorite hex-to-int code:

inline int htoi(int x) {
    return 9 * (x >> 6) + (x & 017);
}

It is case-insensitive for letter, i.e will return correct result for "a" and "A".

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

3楼-- · 2019-02-02 00:51

This question may evidently have different answers on different systems, and in this sense it is ill-posed from the start. For example an i486 has no pipeline and a pentium has no SSE.

The correct question to ask would be: " what is the fastest way to convert a single char hex to dec in X system , e.g. i686 " .

Among approaches herein, the answer to this is actually the same or very very very nearly the same on a system with a multi-stage pipeline. Any system without a pipeline will bend towards the lookup table method (LUT), but if memory access is slow the conditional method (CEV), or the bitwise evaluation method (BEV), may profit depending of the speed of xor vs load for the given CPU.

(CEV) decomposes into 2 load effective addresses a comparison and a conditional move from registers which is not prone to mis-prediction. All these commands are pairable in the pentium pipeline. So they actually go in 1-cycle.

  8d 57 d0                lea    -0x30(%rdi),%edx
  83 ff 39                cmp    $0x39,%edi
  8d 47 a9                lea    -0x57(%rdi),%eax
  0f 4e c2                cmovle %edx,%eax

The (LUT) decomposes into a mov between registers and mov from a data dependent memory location plus some nops for alignment, and should take the minimum of 1-cycle. As the previous there are only data dependencies.

  48 63 ff                movslq %edi,%rdi
  8b 04 bd 00 1d 40 00    mov    0x401d00(,%rdi,4),%eax

The (BEV) is a different beast as it actually requires 2 movs + 2 xors + 1 and a conditional mov. These can also be nicely pipelined.

  89 fa                   mov    %edi,%edx
  89 f8                   mov    %edi,%eax
  83 f2 57                xor    $0x57,%edx
  83 f0 30                xor    $0x30,%eax
  83 e7 40                and    $0x40,%edi
  0f 45 c2                cmovne %edx,%eax

Of course, it is a very rare occasion that it is application critical (maybe Mars Pathfinder is a candidate) to convert just a signle char. Instead one would expect to convert a larger string by actually making a loop and calling that function.

Thus on such a scenario the code that is better vectorizable is the winner. The LUT does not vectorize, and BEV and CEV have better behaviour. In general such micro-optimization does not get you anywhere, write your code and let live (i.e. let the compiler run).

So I have actually built some tests in this sense that are easily reproducible on any system with a c++11 compiler and a random device source,such as any *nix system. If one does not permit vectorization -O2 CEV/LUT are almost equal, but once -O3 is set the advantage of writing code that is more decomposable shows the difference.

To summarised, if you have an old compiler use LUT, if your system is low-end or old consider BEV, otherwise the compiler will outsmart you and you should use CEV.

Problem: in question is to convert from the set of chars {0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f} to the set of {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}. There are no capital letters under consideration.

The idea is to take advantage of the linearity of the ascii table in segments.

[Simple and easy]: Conditional evaluation -> CEV

int decfromhex(int const x)
{
return x<58?x-48:x-87;
}

[Dirty and complex]: Bitwise evaluation -> BEV

int decfromhex(int const x)
{
return 9*(x&16)+( x & 0xf  );
}

[compile time]: Template conditional evaluation -> TCV

template<char n> int decfromhex()
{
  int constexpr x = n;
  return x<58 ? x-48 : x -87;
}

[Lookup table]: Lookup table -> LUT

int decfromhex(char n)
{
static int constexpr x[255]={
           // fill everything with invalid, e.g. -1 except places\
           // 48-57 and 97-102 where you place 0..15 
           };
return x[n];
}

Among all , the last seems to be the fastest at first look. The second is only at compile time and constant expression.

[RESULT] (Please verify): *BEV is the fastest among all and handles lower and upper case letter , but only marginal to CEV which does not handle capital letters. LUT becomes slower than both CEV and BEV as the size of the string increases.

An exemplary result for str-sizes 16-12384 can be found below ( lower is better )

The average time (100 runs ) along is show. The size of the bubble is the normal error.

The script for running the tests is available.

Tests have been performed for the conditional CEV, the bitwise BEV and the lookup table LUT on a set of randomly generated strings. The tests are fairly simple and from:

Test source code

these are verifiable:

A local copy of the input string is placed in a local buffer every time.
A local copy of the results is kept that is then copied to the heap for every string test
Duration only for the time operating on the string is extracted
uniform approach, there is no complicated machinery and wrap/around code that is fitted for other cases.
no sampling the entire timing sequence is used
CPU preheating is executed
Sleep between tests occur to permit marshal the code such that one test does not take advantage of the previous test.
Compilation is performed with g++ -std=c++11 -O3 -march=native dectohex.cpp -o d2h
Launch with taskset -c 0 d2h
No thread dependencies or multithreading
The results are actually being used, to avoid any type of loop optimization

As a side note I have seen in practice version 3 to be much faster with older c++98 compilers.

[BOTTOM LINE]: use CEV without fear, unless you know your variables at compile time where you could use version TCV. The LUT should only be used after significant performance per use case evaluation, and probably with older compilers. Another case is when your set is larger i.e. {0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f,A,B,C,D,E,F} . This can be achieved as well. Finally if you are permormance hungry use BEV .

The results with the unordered_map have been removed since they have been just too slow to compare, or at best case may be as fast as the LUT solution.

Results from my personal pc on strings of size 12384/256 and for 100 strings:

 g++ -DS=2 -DSTR_SIZE=256 -DSET_SIZE=100 -DUNITS=nanoseconds -O3 -std=c++11 -march=native dectohex.cpp -o d2h && taskset -c 0 ./d2h
sign: -2709
-------------------------------------------------------------------
(CEV) Total: 185568 nanoseconds - mean: 323.98 nanoseconds  error: 88.2699 nanoseconds
(BEV) Total: 185568 nanoseconds - mean: 337.68 nanoseconds  error: 113.784 nanoseconds
(LUT) Total: 229612 nanoseconds - mean: 667.89 nanoseconds  error: 441.824 nanoseconds
-------------------------------------------------------------------


g++ -DS=2 -DSTR_SIZE=12384 -DSET_SIZE=100 -DUNITS=nanoseconds -O3 -std=c++11 -march=native hextodec.cpp -o d2h && taskset -c 0 ./h2d

-------------------------------------------------------------------
(CEV) Total: 5539902 nanoseconds - mean: 6229.1 nanoseconds error: 1052.45 nanoseconds
(BEV) Total: 5539902 nanoseconds - mean: 5911.64 nanoseconds    error: 1547.27 nanoseconds
(LUT) Total: 6346209 nanoseconds - mean: 14384.6 nanoseconds    error: 1795.71 nanoseconds
-------------------------------------------------------------------
Precision: 1 ns

The results from a system with GCC 4.9.3 compiled to the metal without the system being loaded on strings of size 256/12384 and for 100 strings

g++ -DS=2 -DSTR_SIZE=256 -DSET_SIZE=100 -DUNITS=nanoseconds -O3 -std=c++11 -march=native dectohex.cpp -o d2h && taskset -c 0 ./d2h
sign: -2882
-------------------------------------------------------------------
(CEV) Total: 237449 nanoseconds - mean: 444.17 nanoseconds  error: 117.337 nanoseconds
(BEV) Total: 237449 nanoseconds - mean: 413.59 nanoseconds  error: 109.973 nanoseconds
(LUT) Total: 262469 nanoseconds - mean: 731.61 nanoseconds  error: 11.7507 nanoseconds
-------------------------------------------------------------------
Precision: 1 ns


g++ -DS=2 -DSTR_SIZE=12384 -DSET_SIZE=100 -DUNITS=nanoseconds -O3 -std=c++11 -march=native dectohex.cpp -o d2h && taskset -c 0 ./d2h
sign: -137532
-------------------------------------------------------------------
(CEV) Total: 6834796 nanoseconds - mean: 9138.93 nanoseconds    error: 144.134 nanoseconds
(BEV) Total: 6834796 nanoseconds - mean: 8588.37 nanoseconds    error: 4479.47 nanoseconds
(LUT) Total: 8395700 nanoseconds - mean: 24171.1 nanoseconds    error: 1600.46 nanoseconds
-------------------------------------------------------------------
Precision: 1 ns

[HOW TO READ THE RESULTS]

The mean is shown on microseconds required to compute the string of the given size.

The total time for each test is given. The mean is computed as the sum/total of timings to compute one string ( no other code in that region but could be vectorized, and that's ok) . The error is the standard deviation of the timings.

The mean tell us what we should expect on average , and the error how much the timings have been following normality. In this case this is a fair measure of error only when it is small ( otherwise we should use something appropriate for positive distributions ). One usually should expect high errors in case of cache miss , processor scheduling, and many other factors.

The code has a unique macro defined to run the tests, permits to define compile time variables to set up the tests, and prints complete information such as:

g++ -DS=2 -DSTR_SIZE=64 -DSET_SIZE=1000 -DUNITS=nanoseconds -O3 -std=c++11 -march=native dectohex.cpp -o d2h && taskset -c 0 ./d2h
sign: -6935
-------------------------------------------------------------------
(CEV) Total: 947378 nanoseconds - mean: 300.871 nanoseconds error: 442.644 nanoseconds
(BEV) Total: 947378 nanoseconds - mean: 277.866 nanoseconds error: 43.7235 nanoseconds
(LUT) Total: 1040307 nanoseconds - mean: 375.877 nanoseconds    error: 14.5706 nanoseconds
-------------------------------------------------------------------

For example to run the test with a 2sec pause on a str of size 256 for a total of 10000 different strings, output timings in double precision, and count in nanoseconds the following command compiles and runs the test.

g++ -DS=2 -DSTR_SIZE=256 -DSET_SIZE=10000 -DUTYPE=double -DUNITS=nanoseconds -O3 -std=c++11 -march=native dectohex.cpp -o d2h && taskset -c 0 ./d2h

0人赞添加讨论(0) 举报

倾城　Initia

4楼-- · 2019-02-02 00:55

Proposed Solutions that Render Faster than the OP's if-else:

Unordered Map Lookup Table

Provided that your input strings are always hex numbers you could define a lookup table as an unordered_map:

std::unordered_map<char, int> table {
{'0', 0}, {'1', 1}, {'2', 2},
{'3', 3}, {'4', 4}, {'5', 5},
{'6', 6}, {'7', 7}, {'8', 8},
{'9', 9}, {'a', 10}, {'A', 10},
{'b', 11}, {'B', 11}, {'c', 12},
{'C', 12}, {'d', 13}, {'D', 13},
{'e', 14}, {'E', 14}, {'f', 15},
{'F', 15}, {'x', 0}, {'X', 0}};

int hextoint(char number) {
  return table[(std::size_t)number];
}

Lookup Table as user constexpr literal (C++14)

Or if you want something more faster instead of an unordered_map you could use the new C++14 facilities with user literal types and define your table as a literal type at compile time:

struct Table {
  long long tab[128];
  constexpr Table() : tab {} {
    tab['1'] = 1;
    tab['2'] = 2;
    tab['3'] = 3;
    tab['4'] = 4;
    tab['5'] = 5;
    tab['6'] = 6;
    tab['7'] = 7;
    tab['8'] = 8;
    tab['9'] = 9;
    tab['a'] = 10;
    tab['A'] = 10;
    tab['b'] = 11;
    tab['B'] = 11;
    tab['c'] = 12;
    tab['C'] = 12;
    tab['d'] = 13;
    tab['D'] = 13;
    tab['e'] = 14;
    tab['E'] = 14;
    tab['f'] = 15;
    tab['F'] = 15;
  }
  constexpr long long operator[](char const idx) const { return tab[(std::size_t) idx]; } 
} constexpr table;

constexpr int hextoint(char number) {
  return table[(std::size_t)number];
}

Live Demo

Benchmarks:

I ran benchmarks with the code written by Nikos Athanasiou that was posted recently on isocpp.org as a proposed method for C++ micro-benchmarking.

The algorithms that were compared are:

1. OP's original if-else:

long long hextoint3(char number) {
  if(number == '0') return 0;
  if(number == '1') return 1;
  if(number == '2') return 2;
  if(number == '3') return 3;
  if(number == '4') return 4;
  if(number == '5') return 5;
  if(number == '6') return 6;
  if(number == '7') return 7;
  if(number == '8') return 8;
  if(number == '9') return 9;
  if(number == 'a' || number == 'A') return 10;
  if(number == 'b' || number == 'B') return 11;
  if(number == 'c' || number == 'C') return 12;
  if(number == 'd' || number == 'D') return 13;
  if(number == 'e' || number == 'E') return 14;
  if(number == 'f' || number == 'F') return 15;
  return 0;
}

2. Compact if-else, proposed by Christophe:

long long hextoint(char number) {
  if (number >= '0' && number <= '9') return number - '0';
  else if (number >= 'a' && number <= 'f') return number - 'a' + 0x0a;
  else if (number >= 'A' && number <= 'F') return number - 'A' + 0X0a;
  else return 0;
}

3. Corrected ternary operator version that handles also capital letter inputs, proposed by g24l:

long long hextoint(char in) {
  int const x = in;
  return (x <= 57)? x - 48 : (x <= 70)? (x - 65) + 0x0a : (x - 97) + 0x0a;
}

4. Lookup Table (unordered_map):

long long hextoint(char number) {
  return table[(std::size_t)number];
}

where table is the unordered map shown previously.

5. Lookup Table (user constexpr literal):

long long hextoint(char number) {
  return table[(std::size_t)number];
}

Where table is user defined literal as shown above.

Experimental Settings

I defined a function that transforms an input hex string to an integer:

long long hexstrtoint(std::string const &str, long long(*f)(char)) {
  long long ret = 0;
  for(int j(1), i(str.size() - 1); i >= 0; --i, j *= 16) {
    ret += (j * f(str[i]));
  }
  return ret;
}

I also defined a function that populates a vector of strings with random hex strings:

std::vector<std::string>
populate_vec(int const N) {
  random_device rd;
  mt19937 eng{ rd() };
  uniform_int_distribution<long long> distr(0, std::numeric_limits<long long>::max() - 1);
  std::vector<std::string> out(N);
  for(int i(0); i < N; ++i) {
    out[i] = int_to_hex(distr(eng));
  }
  return out;
}

I created vectors populated with 50000, 100000, 150000, 200000 and 250000 random hex strings respectively. Then for each algorithm I run 100 experiments and averaged the time results.

Compiler was GCC version 5.2 with optimization option -O3.

Results:

Discussion

From the results we can conclude that for these experimental settings the proposed table method out-performs all the other methods. The if-else method is by far the worst where as the unordered_map although it wins the if-else method it is significantly slower than the other proposed methods.

CODE

Edit:

Results for method proposed by stgatilov, with bitwise operations:

long long hextoint(char x) {
    int b = uint8_t(x);
    int maskLetter = (('9' - b) >> 31);
    int maskSmall = (('Z' - b) >> 31);
    int offset = '0' + (maskLetter & int('A' - '0' - 10)) + (maskSmall & int('a' - 'A'));
    return b - offset;
}

Edit:

I also tested the original code from g24l against the table method:

long long hextoint(char in) {
  long long const x = in;
  return x < 58? x - 48 : x - 87;
}

Note that this method doesn't handle capital letters A, B, C, D, E and F.

Results:

Still the table method renders faster.

0人赞添加讨论(0) 举报

Solution

Now here are several methods for solving the problem: the first one based on lookup, two based on ternary operator, the last one based on bit operations:

int hextoint_lut(char x) {
    static char lut[256] = {???};
    return lut[uint8_t(x)];
}

int hextoint_cond(char x) {
    uint32_t dig = x - '0';
    uint32_t alp = dig + ('0' - 'a' + 10);
    return dig <= 9U ? dig : alp;
}
int hextoint_cond2(char x) {
    uint32_t offset = (uint8_t(x) <= uint8_t('9') ? '0' : 'a' - 10);
    return uint8_t(x) - offset;
}

int hextoint_bit(char x) {
    int b = uint8_t(x);
    int mask = (('9' - b) >> 31);
    int offset = '0' + (mask & int('a' - '0' - 10));
    return b - offset;
}

Here are the corresponding assembly listings generated (only the relevant parts are shown):

;hextoint_lut;
movsx   eax, BYTE PTR [rax+rcx]   ; just load the byte =)

;hextoint_cond;
sub edx, 48                       ; subtract '0'
cmp edx, 9                        ; compare to '9'
lea eax, DWORD PTR [rdx-39]       ; add ('0' - 'a' + 10)
cmovbe  eax, edx                  ; choose between two cases in branchless way

;hextoint_cond2;                  ; (modified slightly)
mov eax, 48                       
mov edx, 87                       ; set two offsets to registers
cmp ecx, 57                       ; compare with '9'
cmovbe  edx, eax                  ; choose one offset
sub ecx, edx                      ; subtract the offset

;hextoint_bit;
mov ecx, 57                       ; load '9'
sub ecx, eax                      ; get '9' - x
sar ecx, 31                       ; convert to mask if negative
and ecx, 39                       ; set to 39 (for x > '9')
sub eax, ecx                      ; subtract 39 or 0
sub eax, 48                       ; subtract '0'

Analysis

I'll try to estimate number of cycles taken by each approach in throughput sense, which is essentially the time spent per one input number when a lot of numbers are processed at once. Consider a Sandy Bridge architecture as an example.

The hextoint_lut function consists of a single memory load, which takes 1 uop on port 2 or 3. Both of these ports are dedicated to memory loads, and they also have address calculation inside, which are capable of doing rax+rcx with no additional cost. There are two such ports, each can do one uop in a cycle. So supposedly this version would take 0.5 clock time. If we have to load input number from memory, that would require one more memory load per value, so the total cost would be 1 clock.

The hextoint_cond version has 4 instructions, but cmov is broken into two separate uops. So there are 5 uops in total, each can be processed on any of the three arithmetic ports 0, 1, and 5. So supposedly it would take 5/3 cycles time. Note that memory load ports are free, so the time would not increase even if you have to load the input value from memory.

The hextoint_cond2 version has 5 instructions. But in a tight loop the constants can be preloaded to registers, so there would be only comparison, cmov and subtraction. They are 4 uops in total, giving 4/3 cycles per value (even with memory read).

The hextoint_bit version is a solution which is guaranteed to have no branches and lookup, which is handy if you do not want to check always whether your compiler generated a cmov instruction. The first mov is free, since the constant can be preloaded in a tight loop. The rest are 5 arithmetic instructions, which a 5 uops in ports 0, 1, 5. So it should take 5/3 cycles (even with a memory read).

Benchmark

I have performed a benchmark for the C++ functions described above. In a benchmark, 64 KB of random data is generated, then each function is run many times on this data. All the results are added to checksum to ensure that compiler does not remove the code. Manual 8x unrolling is used. I have tested on a Ivy Bridge 3.4 Ghz core, which is very similar to Sandy Bridge. Each string of output contains: name of function, total time taken by benchmark, number of cycles per input value, sum of all outputs.

Benchmark code

MSVC2013 x64 /O2:
hextoint_lut: 0.741 sec, 1.2 cycles  (check: -1022918656)
hextoint_cond: 1.925 sec, 3.0 cycles  (check: -1022918656)
hextoint_cond2: 1.660 sec, 2.6 cycles  (check: -1022918656)
hextoint_bit: 1.400 sec, 2.2 cycles  (check: -1022918656)

GCC 4.8.3 x64 -O3 -fno-tree-vectorize
hextoint_lut: 0.702 sec, 1.1 cycles  (check: -1114112000)
hextoint_cond: 1.513 sec, 2.4 cycles  (check: -1114112000)
hextoint_cond2: 2.543 sec, 4.0 cycles  (check: -1114112000)
hextoint_bit: 1.544 sec, 2.4 cycles  (check: -1114112000)

GCC 4.8.3 x64 -O3
hextoint_lut: 0.702 sec, 1.1 cycles  (check: -1114112000)
hextoint_cond: 0.717 sec, 1.1 cycles  (check: -1114112000)
hextoint_cond2: 0.468 sec, 0.7 cycles  (check: -1114112000)
hextoint_bit: 0.577 sec, 0.9 cycles  (check: -1114112000)

Clearly, LUT approach takes one cycle per value (as predicted). The other approaches normally take from 2.2 to 2.6 cycles per value. In case of GCC, hextoint_cond2 is slow because compiler uses cmp+sbb+and magic instead of desired cmov instructions. Also note that by default GCC vectorizes most of the approaches (last paragraph), which provides expectedly faster results than the unvectorizable LUT approach. Note that manual vectorization would give significantly greater boost.

Discussion

Note that hextoint_cond with ordinary conditional jump instead of cmov would have a branch. Assuming random input hex digits, it will be mispredicted almost always. So performance would be terrible, I think.

I have analysed throughput performance. But if we have to process tons of input values, then we should definitely vectorize the conversion to get better speed. hextoint_cond can be vectorized with SSE in a pretty straightforward way. It allows to process 16 bytes to 16 bytes by using only 4 instructions, taking about 2 cycles I suppose.

Note that in order to see any performance difference, you must ensure that all the input values fit into cache (L1 is the best case). If you read the input data from main memory, even std::atoi is equally fast with the considered methods =)

Also, you should unroll your main loop 4x or even 8x for maximum performance (to remove looping overhead). As you might have already noticed, the speed of both methods highly depends on which operations are surrounding the code. E.g. adding a memory load doubles time taken by the first approach, but does not influence the other approaches.

P.S. Most likely you don't really need to optimize this.

0人赞添加讨论(0) 举报