I'm trying to use the Earley parser in NLTK to parse sentences such as:
If date is before 12/21/2010 then serial = 10
To do this, I'm trying to write a CFG but the problem is I would need to have a general format of dates and integers as terminals, instead of the specific values. Is there any ways to specify the right hand side of a production rule as a regular expression, which would allow this kind of processing?
Something like:
S -> '[0-9]+'
which would handle all integers.
For this to work, you'll need to tokenize the date so that each digit and slash is a separate token.
The output is:
This also affords some flexibility in the form of allowing dates and months to be single-digit.