why c++11 regex (libc++ implementation) is so slow

2020-06-03 06:25发布

I compared with Linux C regex library,

#include <iostream>
#include <chrono>
#include <regex.h>

int main()
{
    const int count = 100000;

    regex_t exp;
    int rv = regcomp(&exp, R"_(([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?)_", REG_EXTENDED);
    if (rv != 0) {
            std::cout << "regcomp failed with " << rv << std::endl;
    }

    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < count; i++)
    {
            regmatch_t match;
            const char *sz = "http://www.abc.com";

            if (regexec(&exp, sz, 1, &match, 0) == 0) {
    //              std::cout << sz << " matches characters " << match.rm_so << " - " << match.rm_eo << std::endl;
            } else {
    //              std::cout << sz << " does not match" << std::endl;
            }
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

    std::cout << elapsed.count() << std::endl;

    return 0;
}

The result is roughly 60-70 milliseconds on my testing machine.

Then I used libc++'s library,

#include <iostream>
#include <chrono>
#include <regex>


int main()
{
        const int count = 100000;

        std::regex rgx(R"_(([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/[^ ]*)?)_", std::regex_constants::extended);
        auto start = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < count; i++)
        {
                std::cmatch match;
                const char sz[] = "http://www.abc.com";

                if (regex_search(sz, match, rgx)) {
                } else {
                }
        }
        auto end = std::chrono::high_resolution_clock::now();
        auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

        std::cout << "regex_search: " << elapsed.count() << std::endl;


        start = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < count; i++)
        {
                const char sz[] = "http://www.abc.com";

                if (regex_match(sz, rgx)) {
                } else {
                }
        }
        end = std::chrono::high_resolution_clock::now();
        elapsed = std::chrono::duration_cast<std::chrono::microseconds>(end - start);

        std::cout << "regex_match: " << elapsed.count() << std::endl;

        return 0;
}

The result is roughly 2 seconds for both regex_search & regex_match. This is about 30 times slower than C's regex.h library.

Is there anything wrong with my comparison? Is C++'s regex library not suitable for high performance case?

I can understand it's slow because there's no optimization in c++'s regex library yet, but 30 times slower is just too much.

Thanks.


Hi all,

Thanks for answering.

Sorry for my mistake I was using [] for C too but later I changed, and forgot to change C++ code.

I made two changes,

  1. I moved const char sz[] out of the loop for both C & C++.
  2. I compiled it with -O2 (I wasn't using any optimization before), C library's implementation is still around 60 milliseconds, but libc++'s regex now gives a number says, 1 second for regex_search, and 150 milliseconds for regex_match.

This is still a bit slow, but not as much as the original comparison.

标签: c++ c regex
3条回答
狗以群分
2楼-- · 2020-06-03 06:51

sz and match are loop invariant, you should move them to before (in both cases for sz).

In the second case sz is an initialised array instead of a pointer to a constant literal - that is an unfair and unnecessary difference. That said, if you move the declaration to before the loop as suggested, that should make little or no difference.

Although regex_search() is overloaded for const const char* that may internally cause construction of a std::string, to avoid that possibility you should test it with:

const std::string sz( "http://www.abc.com" ) ;

(again before the loop).

So test:

std::cmatch match;
const char* = "http://www.abc.com";
for (int i = 0; i < count; i++)
{
    if (regex_search(sz, match, rgx)) {
    } else {
    }
}

and

std::cmatch match;
const std::string sz( "http://www.abc.com" )
for (int i = 0; i < count; i++)
{
    if (regex_search(sz, match, rgx)) {
    } else {
    }
}
查看更多
混吃等死
3楼-- · 2020-06-03 06:55

The following two lines do not do the same thing!

const char  sz1[] = "http://www.abc.com";
const char* sz2   = "http://www.abc.com";

That's already enough to make it an unfair test.

查看更多
老娘就宠你
4楼-- · 2020-06-03 06:58

If you take a look at http://llvm.org/svn/llvm-project/libcxx/trunk/include/regex you'll see this implementation of regex_match is layered atop regex_search, and all overloads extract sub-expression match positions even if only into local temporaries that are thrown away. regex_search uses a vector of __state objects that have .resize() called on them so are presumably vectors too - all heap allocations and unnecessary when the subexpression matches aren't wanted, but would need to be tracked to support \1 etc in perl-style extensions to regular expressions: the old regcomp/regexec C functions didn't provide those extended features never have to do this extra work. Of course it would be nice if the clang implementation checked the regular expression's need for tracking matches during compilation and called leaner, faster functions to match when possible, but I guess they're just starting with support for the general case.

查看更多
登录 后发表回答