I'm using ply as my lex parser. My specifications are the following :
t_WHILE = r'while'
t_THEN = r'then'
t_ID = r'[a-zA-Z_][a-zA-Z0-9_]*'
t_NUMBER = r'\d+'
t_LESSEQUAL = r'<='
t_ASSIGN = r'='
t_ignore = r' \t'
When i try to parse the following string :
"while n <= 0 then h = 1"
It gives following output :
LexToken(ID,'while',1,0)
LexToken(ID,'n',1,6)
LexToken(LESSEQUAL,'<=',1,8)
LexToken(NUMBER,'0',1,11)
LexToken(ID,'hen',1,14) ------> PROBLEM!
LexToken(ID,'h',1,18)
LexToken(ASSIGN,'=',1,20)
LexToken(NUMBER,'1',1,22)
It doesn't recognize the token THEN, instead it takes "hen" as an identifier.
Any ideas?
The reason that this didn't work is related to the way ply prioritises matches of tokens, the longest token regex is tested first.
The easiest way to prevent this problem is to match identifiers and reserved words at the same type, and select an appropriate token type based on the match. The following code is similar to an example in the ply documentation
PLY prioritizes the tokens declared as simple strings according the longest regular expression, but the tokens declared as functions have their order prioritized.
From the docs:
So, an alternative solution would be simply to specify the tokens you want prioritized as functions, instead of strings, like so:
This way WHILE and THEN will be the first rules to be added, and you get the behaviour you expected.
As a side note, you were using
r' \t'
(raw string) for t_ignore, so Python was treating the\
as a backslash. It should be a simple string instead, as in the example above.