I am solving a problem where I have text message to match with thousands of regular expressions of the form
<some string> {0 or 300 chars} <some string> {0 or 300 chars}
e.g.
"on"[ \t\r]*(.){0,300}"."[ \t\r]*(.){0,300}"from"
or a real example can be
"Dear"[ \t\r]*"Customer,"[ \t\r]*"Your"[ \t\r]*"package"[ \t\r]*(.){0,80}[ \t\r]*"is"[ \t\r]*"out"[ \t\r]*"for"[ \t\r]*"delivery"[ \t\r]*"via"(.){0,80}[ \t\r]*"Courier,"[ \t\r]*(.){0,80}[ \t\r]*"on"(.){0,80}"."[ \t\r]*"Delivery"[ \t\r]*"will"[ \t\r]*"be"[ \t\r]*"attempted"[ \t\r]*"in"[ \t\r]*"5"[ \t\r]*"wkg"[ \t\r]*"days."
To start with I used Java's regular expression engine. I matched input string with one regular expression at a time. This process was too slow. I found that Java's regex engine compiles regular expression into NFA (Non Deterministic Finite Automata) which can get slow due to catastrophic backtracking. Therefore I thought of converting regular expressions into DFA (Deterministic Finite Automata) using flex-lexer so as to compile hundreds of regex into one DFA and thus I would get match result in O(n), n is the length of input string. But due to fixed repetition count in regex, flex is taking forever to compile see here.
May be I am doing it all wrong. Is there any better way to do this? One way that I could think of is to convert fixed repetition count to indefinite repetition (the star operator) as follows
"on"[ \t\r]*(.)*"."[ \t\r]*(.)*"from"
This regex compiles without problem and it takes only milliseconds. If input string matches with this rule I know that constant strings from rule ("on", "." and "from")
are present in input string. Now iff flex supported named group of regular expressions, I could simply count number of characters in those groups and verify but flex is not meant for this purpose.
Question -- Is there any way to solve this problem efficiently?