I'm a beginner C++ programmer working on a small C++ project for which I have to process a number of relatively large XML files and remove the XML tags out of them. I've succeeded doing so using the C++0x regex library. However, I'm running into some performance issues. Just reading in the files and executing the regex_replace function over its contents takes around 6 seconds on my PC. I can bring this down to 2 by adding some compiler optimization flags. Using Python, however, I can get it done it less than 100 milliseconds. Obviously, I'm doing something very inefficient in my C++ code. What can I do to speed this up a bit?
My C++ code:
std::regex xml_tags_regex("<[^>]*>");
for (std::vector<std::string>::iterator it = _files.begin(); it !=
_files.end(); it++) {
std::ifstream file(*it);
file.seekg(0, std::ios::end);
size_t size = file.tellg();
std::string buffer(size, ' ');
file.seekg(0);
file.read(&buffer[0], size);
buffer = regex_replace(buffer, xml_tags_regex, "");
file.close();
}
My Python code:
regex = re.compile('<[^>]*>')
for filename in filenames:
with open(filename) as f:
content = f.read()
content = regex.sub('', content)
P.S. I don't really care about processing the complete file at once. I just found that reading a file line by line, word by word or character by character slowed it down considerably.
C++11 regex replace is indeed rather slow, as of yet, at least. PCRE performs much better in terms of pattern matching speed, however, PCRECPP provides very limited means for regular expression based substitution, citing the man page:
This is really poor, compared to Perl's 's' command. That is why I wrote my own C++ wrapper around PCRE that handles regular expression based substitution in a fashion that is close to Perl's 's', and also supports 16- and 32-bit character strings: PCRSCPP:
I wrote a simple speed test code, which stores a 10x copy of file "move.sh" and tests regex performance on resulting string:
(note that
std
andpcrscpp
regular expressions do the same here, the trailing newline in expression forpcrscpp
is due tostd::regex_replace
not stripping newlines despitestd::regex_constants::format_no_copy
)and launched it on a large (20.9 MB) shell move script:
As you can see, PCRSCPP is more than 2x faster. And I expect this gap to grow with pattern complexity increase, since PCRE deals with complicated patterns much better. I originally wrote a wrapper for myself, but I think it can be useful for others too.
Regards, Alex
I don't think you're doing anything "wrong" per-say, the C++ regex library just isn't as fast as the python one (for this use case at this time at least). This isn't too surprising, keeping in mind the python regex code is all C/C++ under the hood as well, and has been tuned over the years to be pretty fast as that's a fairly important feature in python, so naturally it is going to be pretty fast.
But there are other options in C++ for getting things faster if you need. I've used PCRE ( http://pcre.org/ ) in the past with great results, though I'm sure there are other good ones out there these days as well.
For this case in particular however, you can also achieve what you're after without regexes, which in my quick tests yielded a 10x performance improvement. For example, the following code scans your input string copying everything to a new buffer, when it hits a
<
it starts skipping over characters until it sees the closing>