How to implement regular expression NFA with chara

2020-02-11 08:39发布

问题:

When you read such posts as Regex: NFA and Thompson's algorithm everything looks rather straightforward until you realize in real life you need not only direct characters like "7" or "b", but also:

[A-Z]
[^_]
.

namely character classes (or ranges). And thus my question -- how to build NFA using character ranges? Using meta-characters like "not A", "anything else" and then computing overlapping ranges? This would lead to using tree-like structure when using final automaton, instead of just a table.

Update: please assume non-trivial in size (>>256) alphabet.

I am asking about NFA, but later I would like to convert NFA to DFA as well.

回答1:

The simplest approach would be:

  1. Use segments as labels for transitions in both NFA and DFA. For example, range [a-z] would be reperesented as segment [97, 122]; single character 'a' would become [97,97]; and any character '.' would become [minCode, maxCode].

  2. Each negated range [^a-z] would result in two transitions from starting state to next state. In this example two transitions [minCode, 96] and [123, maxCode] should be created.

  3. When range is represented by enumerating all possible characters [abcz], either transition per character should be created, or the code migh first group characters into ranges to optimize the number of transitions. So the [abcz] would become [a-c]|z. Thus two transitions instead of four.

This should be enough for NFA. However the classical power set construction to transform NFA to DFA will not work when there are transitions with intersecting character ranges. To solve this issue only one additional generalization step is required. Once a set of all input symbols created, in our case it will be a set of segments, it should be transformed into a set of non-intersecting segments. This can be done in time O(n*Log(n)), where n is a number of segments in a set using priority equeue (PQ) in which segments are ordered by the left component. Example:

  Procedure DISJOIN:
   Input  <- [97, 99] [97, 100] [98, 108]
   Output -> [97, 97] [98, 99], [100, 100], [101, 108]

Step 2. To create new transitions from a "set state" the algorithm should be modified as following:

   for each symbol in DISJOIN(input symbols)
     S <- empty set of symbols
     T <- empty "set state"
     for each state in "set state"
      for each transition in state.transitions
        I <- intersection(symbol, transition.label) 
        if (I is not empty)
        {
           Add I to the set S
           Add transition.To to the T
        }       

     for each segement from DISJOIN(S)
        Create transition from "set state" to T

To speed up matching when searching for a transition and input symbol C, transitions per state might be sorted by segments and binary search applied.



标签: regex dfa nfa