Can anyone beat the performance of my integer to std::string code, linked below?
There are already several questions that explain how to convert an integer into a std::string
in C++, such as this one, but none of the solutions provided are efficient.
Here is compile-ready code for some common methods to compete against:
- The "C++ way", using stringstream: http://ideone.com/jh3Sa
- sprintf, which SO-ers usually recommend to the performance conscious: http://ideone.com/82kwR
Contrary to popular belief, boost::lexical_cast
has its own implementation (white paper) and does not use stringstream
and numeric insertion operators. I'd really like to see its performance compared, because this other question suggests that it's miserable.
And my own contribution, which is competitive on desktop computers, and demonstrates an approach that runs at full speed on embedded systems as well, unlike algorithms dependent on integer modulo:
- Ben's algorithms: http://ideone.com/SsEUW
If you want to use that code, I'll make it available under a simplified BSD license (commercial use allowed, attribution required). Just ask.
Finally, the function ltoa
is non-standard but widely available.
- ltoa version, for anyone who has a compiler that provides it (ideone doesn't): http://ideone.com/T5Wim
I'll post my performance measurements as an answer shortly.
Rules for algorithms
- Provide code for a conversion of at least 32-bit signed and unsigned integers into decimal.
- Produce output as a
std::string
. - No tricks that are incompatible with threading and signals (for example, static buffers).
- You may assume an ASCII character set.
- Make sure to test your code on
INT_MIN
on a two's complement machine where the absolute value is not representable. - Ideally, the output should be character-for-character identical with the canonical C++ version using
stringstream
, http://ideone.com/jh3Sa, but anything that is clearly understandable as the correct number is ok, too. - NEW: Although you can use whatever compiler and optimizer options (except completely disabled) you want for the comparison, the code needs to also compile and give correct results under at least VC++ 2010 and g++.
Hoped-for Discussion
Besides better algorithms, I'd also like to get some benchmarks on several different platforms and compilers (let's use MB/s throughput as our standard unit of measure). I believe that the code for my algorithm (I know the sprintf
benchmark takes some shortcuts -- now fixed) is well-defined behavior by the standard, at least under the ASCII assumption, but if you see any undefined behavior or inputs for which the output is invalid, please point that out.
Conclusions:
Different algorithms perform for g++ and VC2010, likely due to the different implementations of std::string
on each. VC2010 clearly does a better job with NRVO, getting rid of return-by-value helped only on gcc.
Code was found that outperforms sprintf
by an order of magnitude. ostringstream
falls behind by a factor of 50 and more.
The winner of the challenge is user434507 who produces code that runs 350% of the speed of my own on gcc. Further entries are closed due to the whims of the SO community.
The current (final?) speed champions are:
- For gcc: user434507, at 8 times faster than
sprintf
: http://ideone.com/0uhhX - For Visual C++: Timo, at 15 times faster than
sprintf
: http://ideone.com/VpKO3
Modification to user434507's solution. Modified to use character array instead of C++ string. Runs a bit faster. Also moved the check for 0 lower in the code...as this never happens for my particular case. Move it back if it's more common for your case.
Here's my little attempt of this fun puzzle.
Instead of using lookup tables, I wanted the compiler to figure it all out. In this case in particular - if you read Hackers' Delight, you see how divide and modulo work -- which makes it very possible to optimize that using SSE/AVX instructions.
Performance benchmark
As for speed, my benchmark here tells me it's 1,5 times faster than the work of Timo (on my Intel Haswell it runs on approximately 1 GB/s).
Things you could consider a cheat
As for the not-making-a-std-string cheat that I use -- of course I took that into consideration for my benchmark of Timo's method as well.
I do use an intrinsic: BSR. If you like, you can also use DeBruijn tables instead - which is one of the things I wrote quite a bit about on my 'fastest 2log' post. Of course, this does have a performance penalty (*well... if you're doing a lot of itoa operations you can actually make a faster BSR but I guess that's not fair...).
The way it works
First thing to do is figure out how much memory we need. This is basically a 10log, which can be implemented in a number of smart ways. See the frequently quoted "Bit Twiddling Hacks" for details.
Next thing to do is to execute the numeric output. I use template recursion for this, so the compiler will figure it out.
I use 'modulo' and 'div' right next to each other. If you read Hacker's Delight, you will notice that the two are closely related, so if you have one answer, you probably have the other as well. I figured the compiler can figure out the details... :-)
The code
Getting the number of digits using a (modified) log10:
Getting yourself the string:
This will blow up on systems that disallow unaligned memory accesses (in which case, the first unaligned assignment via
*(short*)
would cause a segfault), but should work very nicely otherwise.One important thing to do is to minimize the use of
std::string
. (Ironic, I know.) In Visual Studio, for example, most calls to methods of std::string are not inlined, even if you specify /Ob2 in compiler options. So even something as trivial as a call tostd::string::clear()
, which you might expect to be very fast, can take 100 clockticks when linking CRT as a static library, and as much as 300 clockticks when linking as a DLL.For the same reason, returning by reference is better because it avoids an assignment, a constructor and a destructor.
I can't test under VS, but this seems to be faster than your code for g++, about 10%. It could probably be tuned, the decision values chosen are guesses. int only, sorry.
Benchmark data for the code provided in the question:
On ideone (gcc 4.3.4):
Core i7, Windows 7 64-bit, 8 GB RAM, Visual C++ 2010 32-bit:
cl /Ox /EHsc
Core i7, Windows 7 64-bit, 8 GB RAM, Visual C++ 2010 64-bit:
cl /Ox /EHsc
Core i7, Windows 7 64-bit, 8 GB RAM, cygwin gcc 4.3.4:
g++ -O3
edit: I was gonna add my own answer, but the question was was closed so I'm adding it here. :) I wrote my own algorithm and managed to get a decent improvement over Ben's code, though I only tested it in MSVC 2010. I also made a benchmark of all the implementations presented so far, using the same testing setup that was in Ben's original code. -- Timo
Intel Q9450, Win XP 32bit, MSVC 2010
cl /O2 /EHsc
-
Ah, awesome challenge by the way... I've had a lot of fun with this.
I have two algorithms to submit (code is at the bottom if you feel like skipping to it). In my comparisons I require that the function return a string and that it can handle int and unsigned int. Comparing things that don't construct a string to those that do doesn't really make sense.
The first one is a fun implementation that doesn't use any precomputed lookup tables or explicit division/modulo. This one is competitive with the others with gcc and with all but Timo's on msvc (for a good reason that I explain below). The second algorithm is my actual submission for highest performance. In my tests it beats all the others on both gcc and msvc.
I think I know why some of the results on MSVC are very good. std::string has two relevant constructors
std::string(char* str, size_t n)
and
std::string(ForwardIterator b, ForwardIterator e)
gcc does the same thing for both of them... that is it uses the second to implement the first. The first constructor can be implemented significantly more efficiently than that and MSVC does so. The side benefit of this is that in some cases (like my fast code and Timo's code) the string constructor can be inlined. In fact, just switching between these constructors in MSVC is almost a 2x difference for my code.
My performance testing results:
Code Sources:
- Voigt
- Timo
- ergosys
- user434507
- user-voigt-timo
- hopman-fun
- hopman-fast
gcc 4.4.5 -O2 on Ubuntu 10.10 64-bit, Core i5
MSVC 2010 64-bit /Ox on Windows 7 64-bit, Core i5
Here are some results and a testing/timing framework on ideone
http://ideone.com/XZRqp
Note that ideone is a 32-bit environment. Both of my algorithms suffer from that, but hopman_fast is at least still competetive.
Note that for those the two or so that don't construct a string I added the following function template:
Now for my code...first the fun one:
And then the fast one: