Efficient matching of text messages against thousa

2019-07-14 06:55发布

I am solving a problem where I have text message to match with thousands of regular expressions of the form

<some string> {0 or 300 chars} <some string> {0 or 300 chars}

e.g.

"on"[ \t\r]*(.){0,300}"."[ \t\r]*(.){0,300}"from"

or a real example can be

"Dear"[ \t\r]*"Customer,"[ \t\r]*"Your"[ \t\r]*"package"[ \t\r]*(.){0,80}[ \t\r]*"is"[ \t\r]*"out"[ \t\r]*"for"[ \t\r]*"delivery"[ \t\r]*"via"(.){0,80}[ \t\r]*"Courier,"[ \t\r]*(.){0,80}[ \t\r]*"on"(.){0,80}"."[ \t\r]*"Delivery"[ \t\r]*"will"[ \t\r]*"be"[ \t\r]*"attempted"[ \t\r]*"in"[ \t\r]*"5"[ \t\r]*"wkg"[ \t\r]*"days."

To start with I used Java's regular expression engine. I matched input string with one regular expression at a time. This process was too slow. I found that Java's regex engine compiles regular expression into NFA (Non Deterministic Finite Automata) which can get slow due to catastrophic backtracking. Therefore I thought of converting regular expressions into DFA (Deterministic Finite Automata) using flex-lexer so as to compile hundreds of regex into one DFA and thus I would get match result in O(n), n is the length of input string. But due to fixed repetition count in regex, flex is taking forever to compile see here.

May be I am doing it all wrong. Is there any better way to do this? One way that I could think of is to convert fixed repetition count to indefinite repetition (the star operator) as follows

"on"[ \t\r]*(.)*"."[ \t\r]*(.)*"from"

This regex compiles without problem and it takes only milliseconds. If input string matches with this rule I know that constant strings from rule ("on", "." and "from") are present in input string. Now iff flex supported named group of regular expressions, I could simply count number of characters in those groups and verify but flex is not meant for this purpose.

Question -- Is there any way to solve this problem efficiently?

1条回答
在下西门庆
2楼-- · 2019-07-14 07:23

The problem is every other part of the regex is (.){0,80}:

"Dear"[ \t\r]*"Customer,"[ \t\r]*"Your"[ \t\r]*"package"[ \t\r]*
(.){0,80}
[ \t\r]*"is"[ \t\r]*"out"[ \t\r]*"for"[ \t\r]*"delivery"[ \t\r]*"via"
(.){0,80}
[ \t\r]*"Courier,"[ \t\r]*
(.){0,80}
[ \t\r]*"on"
(.){0,80}"."
[ \t\r]*"Delivery"[ \t\r]*"will"[ \t\r]*"be"[ \t\r]*"attempted"[ \t\r]*"in"[ \t\r]*"5"[ \t\r]*"wkg"[ \t\r]*"days."

The regex is slow when the next word doesn't appear exactly 80 characters after the last one. It needs to backtrack to see if 79 would work. Or 78. Or 77... It's not all or nothing, (as you seem to believe it is; 80 or 0 characters would be .{80}?).

The engine is simply more optimized to deal with .* because it is more common

Depending on where things are in the string, you may be able to get better performance with the lazy .{0,80}?. But this isn't a great solution.

I think the answer here is to use more than one regex.

You can find the index the match happened at, and then compare compare it to see if it came before or after where the phrase before first matched.

It gets more complicated things can match in multiple areas AND you need for each match to be no more than x characters away. In that case, you would just need to collect multiple matches and change the math a little.

查看更多
登录 后发表回答