I'm trying to convert some code from Python to C++ in an effort to gain a little bit of speed and sharpen my rusty C++ skills. Yesterday I was shocked when a naive implementation of reading lines from stdin was much faster in Python than C++ (see this). Today, I finally figured out how to split a string in C++ with merging delimiters (similar semantics to python's split()), and am now experiencing deja vu! My C++ code takes much longer to do the work (though not an order of magnitude more, as was the case for yesterday's lesson).
Python Code:
#!/usr/bin/env python
from __future__ import print_function
import time
import sys
count = 0
start_time = time.time()
dummy = None
for line in sys.stdin:
dummy = line.split()
count += 1
delta_sec = int(time.time() - start_time)
print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='')
if delta_sec > 0:
lps = int(count/delta_sec)
print(" Crunch Speed: {0}".format(lps))
else:
print('')
C++ Code:
#include <iostream>
#include <string>
#include <sstream>
#include <time.h>
#include <vector>
using namespace std;
void split1(vector<string> &tokens, const string &str,
const string &delimiters = " ") {
// Skip delimiters at beginning
string::size_type lastPos = str.find_first_not_of(delimiters, 0);
// Find first non-delimiter
string::size_type pos = str.find_first_of(delimiters, lastPos);
while (string::npos != pos || string::npos != lastPos) {
// Found a token, add it to the vector
tokens.push_back(str.substr(lastPos, pos - lastPos));
// Skip delimiters
lastPos = str.find_first_not_of(delimiters, pos);
// Find next non-delimiter
pos = str.find_first_of(delimiters, lastPos);
}
}
void split2(vector<string> &tokens, const string &str, char delim=' ') {
stringstream ss(str); //convert string to stream
string item;
while(getline(ss, item, delim)) {
tokens.push_back(item); //add token to vector
}
}
int main() {
string input_line;
vector<string> spline;
long count = 0;
int sec, lps;
time_t start = time(NULL);
cin.sync_with_stdio(false); //disable synchronous IO
while(cin) {
getline(cin, input_line);
spline.clear(); //empty the vector for the next line to parse
//I'm trying one of the two implementations, per compilation, obviously:
// split1(spline, input_line);
split2(spline, input_line);
count++;
};
count--; //subtract for final over-read
sec = (int) time(NULL) - start;
cerr << "C++ : Saw " << count << " lines in " << sec << " seconds." ;
if (sec > 0) {
lps = count / sec;
cerr << " Crunch speed: " << lps << endl;
} else
cerr << endl;
return 0;
//compiled with: g++ -Wall -O3 -o split1 split_1.cpp
Note that I tried two different split implementations. One (split1) uses string methods to search for tokens and is able to merge multiple tokens as well as handle numerous tokens (it comes from here). The second (split2) uses getline to read the string as a stream, doesn't merge delimiters, and only supports a single delimeter character (that one was posted by several StackOverflow users in answers to string splitting questions).
I ran this multiple times in various orders. My test machine is a Macbook Pro (2011, 8GB, Quad Core), not that it matters much. I'm testing with a 20M line text file with three space-separated columns that each look similar to this: "foo.bar 127.0.0.1 home.foo.bar"
Results:
$ /usr/bin/time cat test_lines_double | ./split.py
15.61 real 0.01 user 0.38 sys
Python: Saw 20000000 lines in 15 seconds. Crunch Speed: 1333333
$ /usr/bin/time cat test_lines_double | ./split1
23.50 real 0.01 user 0.46 sys
C++ : Saw 20000000 lines in 23 seconds. Crunch speed: 869565
$ /usr/bin/time cat test_lines_double | ./split2
44.69 real 0.02 user 0.62 sys
C++ : Saw 20000000 lines in 45 seconds. Crunch speed: 444444
What am I doing wrong? Is there a better way to do string splitting in C++ that does not rely on external libraries (i.e. no boost), supports merging sequences of delimiters (like python's split), is thread safe (so no strtok), and whose performance is at least on par with python?
Edit 1 / Partial Solution?:
I tried making it a more fair comparison by having python reset the dummy list and append to it each time, as C++ does. This still isn't exactly what the C++ code is doing, but it's a bit closer. Basically, the loop is now:
for line in sys.stdin:
dummy = []
dummy += line.split()
count += 1
The performance of python is now about the same as the split1 C++ implementation.
/usr/bin/time cat test_lines_double | ./split5.py
22.61 real 0.01 user 0.40 sys
Python: Saw 20000000 lines in 22 seconds. Crunch Speed: 909090
I still am surprised that, even if Python is so optimized for string processing (as Matt Joiner suggested), that these C++ implementations would not be faster. If anyone has ideas about how to do this in a more optimal way using C++, please share your code. (I think my next step will be trying to implement this in pure C, although I'm not going to trade off programmer productivity to re-implement my overall project in C, so this will just be an experiment for string splitting speed.)
Thanks to all for your help.
Final Edit/Solution:
Please see Alf's accepted answer. Since python deals with strings strictly by reference and STL strings are often copied, performance is better with vanilla python implementations. For comparison, I compiled and ran my data through Alf's code, and here is the performance on the same machine as all the other runs, essentially identical to the naive python implementation (though faster than the python implementation that resets/appends the list, as shown in the above edit):
$ /usr/bin/time cat test_lines_double | ./split6
15.09 real 0.01 user 0.45 sys
C++ : Saw 20000000 lines in 15 seconds. Crunch speed: 1333333
My only small remaining gripe is regarding the amount of code necessary to get C++ to perform in this case.
One of the lessons here from this issue and yesterday's stdin line reading issue (linked above) are that one should always benchmark instead of making naive assumptions about languages' relative "default" performance. I appreciate the education.
Thanks again to all for your suggestions!
You're making the mistaken assumption that your chosen C++ implementation is necessarily faster than Python's. String handling in Python is highly optimized. See this question for more: Why do std::string operations perform poorly?
As a guess, Python strings are reference counted immutable strings, so that no strings are copied around in the Python code, while C++
std::string
is a mutable value type, and is copied at the smallest opportunity.If the goal is fast splitting, then one would use constant time substring operations, which means only referring to parts of the original string, as in Python (and Java, and C#…).
The C++
std::string
class has one redeeming feature, though: it is standard, so that it can be used to pass strings safely and portably around where efficiency is not a main consideration. But enough chat. Code -- and on my machine this is of course faster than Python, since Python's string handling is implemented in C which is a subset of C++ (he he):Disclaimer: I hope there aren't any bugs. I haven't tested the functionality, but only checked the speed. But I think, even if there is a bug or two, correcting that won't significantly affect the speed.
I think the following code is better, using some C++17 and C++14 features:
The choice of container:
std::vector
.Assuming the initial size of allocated internal array is 1, and the ultimate size is N, you will allocate and deallocate for log2(N) times, and you will copy the (2 ^ (log2(N) + 1) - 1) = (2N - 1) times. As pointed out in Is the poor performance of std::vector due to not calling realloc a logarithmic number of times?, this can have a poor performance when the size of vector is unpredictable and could be very large. But, if you can estimate the size of it, this'll be less a problem.
std::list
.For every push_back, the time it consumed is a constant, but it'll probably takes more time than std::vector on individual push_back. Using a per-thread memory pool and a custom allocator can ease this problem.
std::forward_list
.Same as std::list, but occupy less memory per element. Require a wrapper class to work due to the lack of API push_back.
std::array
.If you can know the limit of growth, then you can use std::array. Of cause, you can't use it directly, since it doesn't have the API push_back. But you can define a wrapper, and I think it's the fastest way here and can save some memory if your estimation is quite accurate.
std::deque
.This option allows you to trade memory for performance. There'll be no (2 ^ (N + 1) - 1) times copy of element, just N times allocation, and no deallocation. Also, you'll has constant random access time, and the ability to add new elements at both ends.
According to std::deque-cppreference
or you can use combo of these:
std::vector< std::array<T, 2 ^ M> >
This is similar to std::deque, the difference is just this container doesn't support to add element at the front. But it is still faster in performance, due to the fact that it won't copy the underlying std::array for (2 ^ (N + 1) - 1) times, it'll just copy the pointer array for (2 ^ (N - M + 1) - 1) times, and allocating new array only when the current is full and doesn't need to deallocate anything. By the way, you can get constant random access time.
std::list< std::array<T, ...> >
Greatly ease the pressure of memory framentation. It will only allocate new array when the current is full, and does not need to copy anything. You will still have to pay the price for an additional pointer conpared to combo 1.
std::forward_list< std::array<T, ...> >
Same as 2, but cost the same memory as combo 1.
I suspect that this is related to buffering on sys.stdin in Python, but no buffering in the C++ implementation.
See this post for details on how to change the buffer size, then try the comparison again: Setting smaller buffer size for sys.stdin?
I suspect that this is because of the way
std::vector
gets resized during the process of a push_back() function call. If you try usingstd::list
orstd::vector::reserve()
to reserve enough space for the sentences, you should get a much better performance. Or you could use a combination of both like below for split1():EDIT: The other obvious thing I see is that Python variable
dummy
gets assigned each time but not modified. So it's not a fair comparison against C++. You should try modifying your Python code to bedummy = []
to initialize it and then dodummy += line.split()
. Can you report the runtime after this?EDIT2: To make it even more fair can you modify the while loop in C++ code to be:
I'm not providing any better solutions (at least performance-wise), but some additional data that could be interesting.
Using
strtok_r
(reentrant variant ofstrtok
):Additionally using character strings for parameters, and
fgets
for input:And, in some cases, where destroying the input string is acceptable:
The timings for these are as follows (including my results for the other variants from the question and the accepted answer):
As we can see, the solution from the accepted answer is still fastest.
For anyone who would want to do further tests, I also put up a Github repo with all the programs from the question, the accepted answer, this answer, and additionally a Makefile and a script to generate test data: https://github.com/tobbez/string-splitting.