I'd like to try writing an atoi implementation using SIMD instructions, to be included in RapidJSON (a C++ JSON reader/writer library). It currently has some SSE2 and SSE4.2 optimizations in other places.
If it's a speed gain, multiple atoi
results can be done in parallel. The strings are originally coming from a buffer of JSON data, so a multi-atoi function will have to do any required swizzling.
The algorithm I came up with is the following:
- I can initialize a vector of length N in the following fashion: [10^N..10^1]
- I convert each character in the buffer to an integer and place them in another vector.
- I take each number in the significant digits vector and multiply it by the matching number in the numbers vector and sum the results.
I'm targeting x86 and x86-64 architectures.
I know that AVX2 supports three operand Fused Multiply-Add so I'll be able to perform Sum = Number * Significant Digit + Sum.
That's where I got so far.
Is my algorithm correct? Is there a better way?
Is there a reference implementation for atoi using any SIMD instructions set?
The algorithm and its implementation is finished now. It's complete and (moderately) tested (Updated for less constant memory usage and tolerating plus-char).
The properties of this code are as follows:
int
anduint
, fromMIN_INT=-2147483648
toMAX_INT=2147483647
and fromMIN_UINT=0
toMAX_UINT=4294967295
'-'
char indicates a negative number (as sensible), a leading'+'
char is ignored0 = -0
About this implementation:
as
) using.intel_syntax noprefix
at the beginningThe approach of the algorithm:
And the sample implementation in GNU Assembler with intel syntax:
The result of Intel-IACA Throughput Analysis for Haswell 32-bit:
The result of Intel-IACA Latency Analysis for Haswell 32-bit:
An alternative handling suggested in comments by Peter Cordes is replacing the last two
add+xor
instructions by animul
. This concentration of OpCodes is likely to be superior. Unfortunately IACA doesn't support that instruction and throws a! - instruction not supported, was not accounted in Analysis
comment. Nevertheless, although I like the reduction of OpCodes and reduction from (2uops, 2c latency) to (1 uop, 3c latency - "worse latency, but still one m-op on AMD"), I prefer to leave it to the implementer which way to choose. I haven't checked if the following code is sufficient for parsing any number. It is just mentioned for completeness and code modifications in other parts may be necessary (especially handling positive numbers).The alternative may be replacing the last two lines with:
I would approach this problem like this:
'0'
from each character.9
.1000
,100
,10
,1
) and multiply with that.9
was found in step four, goto step 2.You could simplify the algorithm by zeroing out all entries beginning with the wrong one in step 5 instead of shifting and then dividing by an appropriate power of ten in the end.
Please keep in mind that this algorithm reads past the end of the string and is thus not a drop-in replacement for
atoi
.