I'm building a simple scanner. Suppose I have the following tokens defined for my language:
!, !=, !==, <, <<, {
Now I can specify them using regular expressions, so:
!=?=? | { | <<?
Then I used http://hackingoff.com to build NFA and DFA. Each machine now can determine if the input is in the language of regexp or not. But my program is a sequence of tokens, not one token:
!!=!<!==<<!{
My question is how I should use the machines to parse the string into tokens? I'm interested in the approach rather then implementation.
The most common rule is "maximal munch", which always selects the longest possible token.
In general, the algorithm to scan a single token using a DFA is as follows: (To tokenise an input, this algorithm is repeated until the end of the input is reached, with each scan starting at the input cursor left by the previous scan.)
Set the DFA state to the start state. Then, for each input character in sequence:
if the DFA has a transition on the character, move to the iindicated state. If that state is an accepting state, record the current input cursor and the state number.
otherwise, rewind the input to the last recorded accept position and return the recorded state number.
Here, the accepting state number is used to indicate which token was encountered. In practice, it is common to associate each accepting state with a token code, because some token types will have more than one accepting state.
It is not always necessary to use the backtracking algorithm described above. In some cases, analysis of the DFA will reveal that the last accepting state was always at the immediately preceding input position. However, many languages require backtracking.
For example, a language in which .
and ...
are both tokens but not ..
(such as C) must backtrack on input ..1
, which should be tokenised as .
, .1
. In the tokenisation of this input, the first scan will accept the first .
, continue with the second .
(which is not an accepting state) and then fail to find a transition on the input 1
. It will then report that it found a .
(the recorded accepted token) and reset the input cursor to the second character position, so that the next scan will see the token .1
.