C++ regex segfault on long sequences

2019-02-04 19:56发布

问题:

I was parsing stackoverflow dump and came up on this seemingly innocent question with small, almost invisible detail that it has 22311 spaces at the end of text.

I'm using std::regex (somehow they work better for me than boost::regex) to replace all continuous whitespaces with single space like this:

std::regex space_regex("\\s+", std::regex::optimize);
...
std::regex_replace(out, in, in + strlen(in), space_regex, " ");

SIGSEGV shows up and I have begun to investigate.

Test code:

#include <regex>
...
std::regex r("\\s+",  std::regex::optimize);
const char* bomb2 = "Small text\n\nwith several\n\nlines.";
std::string test(bomb2);
for (auto i = 0; i < N; ++i) test += " ";

std::string out = std::regex_replace(test.c_str(), r, " ");
std::cout << out << std::endl;

for (gcc 5.3.0)

$ g++ -O3 -std=c++14 regex-test.cpp -o regex-test.out

maximum N before SIGSEGV shows up is 21818 (for this particular string), and for

$ g++ -O0 -std=c++14 regex-test.cpp -o regex-test.out

it's 12180.

'Ok, let's try clang, it's trending and aims to replace gcc' - never have I been so wrong. With -O0 clang (v. 3.7.1) crashes on 9696 spaces - less then gcc, but not much, yet with -O3 and even with -O2 it crashes on ZERO spaces.

Crash dump presents huge stacktraces (35k frames) of recursive calls of

std::__detail::_Executor<char*, std::allocator<std::__cxx11::sub_match<char*> >, std::__cxx11::regex_traits<char>, true>::_M_dfs

Question 1: Is this a bug? If so, should I report it?

Question 2: Is there smart way to overcome the problem (other than increasing system stack size, trying other regex libraries and writing own function to replace whitespaces)?


Amendment: bug report created for libstdc++

回答1:

Is this a bug? If so, should I report it?

Yes this is a bug.

cout << '"' << regex_replace("Small text\n\nwith several\n\nlines." + string(22311, ' '), regex("\\s+", regex::optimize), " ") << '"' << endl;
  • Runs fine with libc++: http://coliru.stacked-crooked.com/a/f9ee5438745a5b22
  • Runs fine with Visual Studio 2015, you can test by copying and running the code at: http://webcompiler.cloudapp.net/
  • Fails with libstdc++: http://coliru.stacked-crooked.com/a/3f4bbe5c46b6b627

But this is just a bug against libstdc++ so feel free to report it here: https://gcc.gnu.org/bugzilla/buglist.cgi?product=gcc&component=libstdc%2B%2B&resolution=---

Is there smart way to overcome the problem?

If you're asking for a new regex that works, I've tried a handful of different versions, and all of them fail on libstdc++, so I'd say, if you want to use a regex to solve this, you'll need to compile against libc++.

But honestly if you're using a regex to strip duplicate white space, "Now you have two problems"

A better solution could use adjacent_find which runs fine with libstdc++ as well:

const auto func = [](const char a, const char b){ return isspace(a) && isspace(b); };

for(auto it = adjacent_find(begin(test), end(test), func); it != end(test); it = adjacent_find(it, end(test), func)) {
    *it = ' ';
    it = test.erase(next(it), find_if_not(next(it), end(test), [](const auto& i) { return isspace(i); }));
}

This will return the same thing your regex would:

"Small text with several lines. "

But if you're going for simplicity, you could also use unique:

test.resize(distance(test.begin(), unique(test.begin(), test.end(), [](const auto& a, const auto& b) { return isspace(a) && isspace(b); })));

Which will return:

"Small text
with several
lines. "



回答2:

Question 2 (smart way to overcome the problem)

Not really smart but... you can iterate a limited replace.

An example

#include <regex>
#include <iostream>

int main()
 {
   constexpr int N = 22311;

   //std::regex r("\\s+");
   std::regex r("\\s{2,100}");

   const char* bomb2 = "Small text\n\nwith several\n\nlines.";

   std::string test(bomb2);

   for (auto i = 0; i < N; ++i)
      test += " ";

   std::string out = test;

   std::size_t  preSize;

   do
    {
      preSize = out.size();

      out = std::regex_replace(out, r, " ");
    }
   while ( out.size() < preSize );

   std::cout << '\"' << out << '\"' << std::endl;

   return 0;
 }