boost string matching DFA

2020-03-04 08:33发布

问题:

Given a string I have to test whether that ends with a known set of suffixes. Now as the the of suffixes are not very small and every word in the document has to be checked against that list of known suffixes. Every character in the word and suffix is char32_t. As a naive iterative matching will be expensive. Though most of the suffixes are not sub suffix or prefix of another suffix, most of them are constructed with a small set of characters. Most of the checks will be a miss rather than being a hit.

So I want to build a DFA of the suffixes to minimize the cost of miss. I can manually parse the unicode codepoints and create a DFA using boost-graph. But is there any existing library that will build that for me ?

Will a huge regex containing all the suffixes be less expensive than the DFA, because regex search also builds a DFA for matching in the similar way ? But I want to know which suffix was matched when there is a hit. In regex case I'll need to perform another linear search to get that (I cannot label the vertices of the regex's internal DFA). Also I need unicode regex. Just putting all the suffixes with a | will be as expensive as a linear search I guess. I think I need to check the common characters and create the regex accordingly with lookahed and lookbacks. Isn't it the same difficulty that I need to face to construct the DFA manually ?

I am using utf-32 for random access. However it is not a problem to switch to utf-8 if I can solve it easily with that. I will reverse the string and the pattern right to left.

回答1:

Have you considered Spirit? Of course you didn't specify how you detect suffixes in context (do you require them at the end, do you require some grammar preceding it etc.) but you could do something like this:

    x3::symbols<Char> sym;
    sym += "foo", "bar", "qux";

It builds a Trie, which is pretty effective. It can parse any kind of input iterator (including streams if you are so inclined). Just add a bit of magic constrain for contextual requirements, like end-of-input:

bool has_suffix(string_view sv) {
    return parse(sv.cbegin(), sv.cend(), x3::seek[suffix >> x3::eoi]);
}

If you even wish to return the text value of the string, simply do this:

string_view get_suffix(string_view sv) {
    boost::iterator_range<string_view::const_iterator> output;
    parse(sv.cbegin(), sv.cend(), x3::seek[x3::raw[suffix >> x3::eoi]], output);
    return {output.begin(), output.size()};
}

Spirit leaves you a lot of freedom to surround with smarts, dynamically add/remove symbols, e.g. use no_case with the Trie etc.

Full Demo

Using X3 (c++14)

Live On Coliru

#include <boost/spirit/home/x3.hpp>
#include <string_view>
#include <cstdint>

namespace Demo {
    using Char = char32_t;
    using string_view = std::basic_string_view<Char>;

    namespace x3 = boost::spirit::x3;

    static auto const suffix = [] {
        x3::symbols<Char> sym;
        sym += "foo", "bar", "qux";

        return sym; // x3::no_case[sym];
    }();

    bool has_suffix(string_view sv) {
        return parse(sv.cbegin(), sv.cend(), x3::seek[suffix >> x3::eoi]);
    }

    string_view get_suffix(string_view sv) {
        boost::iterator_range<string_view::const_iterator> output;
        parse(sv.cbegin(), sv.cend(), x3::seek[x3::raw[suffix >> x3::eoi]], output);
        return {output.begin(), output.size()};
    }
}

#include <iostream>
#include <iomanip>

int main() {
    using namespace Demo;

    auto widen = [](string_view sv) { return std::wstring(sv.begin(), sv.end()); };
    std::wcout << std::boolalpha;

    for (string_view testcase : { U"nope", U"lolbar you betqux" }) {
        std::wcout 
            << widen(testcase) 
            << L" -> " << has_suffix(testcase)
            << L" (" << widen(get_suffix(testcase))
            << L")\n";
    }
}

Prints

nope -> false ()
lolbar you betqux -> true (qux)

Spirit Qi Version

A literal port: Live On Coliru

A C++11 only version: Live On Coliru

And a C++03 version for the really retro programming experience: Live On Coliru