Parsing fortran-style .op. operators

2019-02-20 22:18发布

问题:

I'm trying to write an ANTLR4 grammar for a fortran-inspired DSL. I'm having difficulty with the 'ole classic ".op." operators:

if (1.and.1) then

where both "1"s should be intepreted as integer. I looked at the OpenFortranParser for insight, but I can't make sense out of it.

Initially, I had suitable definitions for INTEGER and REAL in my lexer. Consequently, the first "1" above always parsed as a REAL, no matter what I tried. I tried moving things into the parser, and got it to the point where I could reliably recognize the ".and." along with numbers around it as appropriately INTEGER or REAL.

if (1.and.1)   # INT/INT
if (1..and..1) # REAL/REAL

...etc...

I of course want to recognize variable-names in such statements:

if (a.and.b)

and have an appropriate rule for ID. In the small grammar below, however, any literals in quotes (ex, 'and', 'if', all the single-character numerical suffixes) are not accepted as an ID, and I get an error; any other ID-conforming string is accepted:

if (a.and.b)  # errs, as 'b' is valid INTEGER suffix
if (a.and.c)  # OK

Any insights into this behavior, or better suggestions on how to parse the .op. operators in fortran would be greatly appreciated -- Thanks!

grammar Foo;

start  : ('if' expr | ID)+ ;

DOT : '.' ;

DIGITS: [0-9]+;

ID : [a-zA-Z0-9][a-zA-Z0-9_]* ;

andOp : DOT 'and' DOT ;

SIGN : [+-];

expr     
    : ID
    | expr andOp expr
    | numeric
    | '(' expr ')'
    ;

integer : DIGITS ('q'|'Q'|'l'|'L'|'h'|'H'|'b'|'B'|'i'|'I')? ;

real    
    : DIGITS DOT DIGITS? (('e'|'E') SIGN? DIGITS)? ('d' | 'D')?
    |        DOT DIGITS  (('e'|'E') SIGN? DIGITS)? ('d' | 'D')?
    ;

numeric : integer | real;

EOLN  : '\r'? '\n' -> skip;

WS    :  [ \t]+ -> skip;   

回答1:

To disambiguate DOT, add a lexer rule with a predicate just before the DOT rule.

DIT : DOT { isDIT() }? ;
DOT : '.' ;

Change the 'andOp'

andOp : DIT 'and' DIT ;

Then add a predicate method

@lexer::members {

public boolean isDIT() {
    int offset = _tokenStartCharIndex;
    String r = _input.getText(Interval.of(offset-4, offset));
    String s = _input.getText(Interval.of(offset, offset+4));
    if (".and.".equals(s) || ".and.".equals(r)) {
        return true;
    }
    return false;
}

}

But, that is not really the source of your current problem. The integer parser rule defines lexer constants effectively outside of the lexer, which is why 'b' is not matched to an ID.

Change it to

integer : INT ;

INT:  DIGITS ('q'|'Q'|'l'|'L'|'h'|'H'|'b'|'B'|'i'|'I')? ;

and the lexer will figure out the rest.



标签: antlr4