can an element contain attribute as parsed by pars

2019-03-04 00:06发布

站内文章 / 后端开发

25 0

该账号已被封号

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am following this tutorial and successfully replicated its behavior except that I am using Antlr 4.7 instead of the 4.5 that the tutorial was using.

I am trying to build a DSL for expense tracker.

Was wondering if each element can have attributes?

E.g. this is what it looks like now

This is the code for the todo.g4 as seen in https://github.com/simkimsia/learn-antlr-web-js/blob/master/todo.g4

grammar todo;

elements
    : (element|emptyLine)* EOF
    ;

element
    : '*' ( ' ' | '\t' )* CONTENT NL+
    ;

emptyLine
    : NL
    ;

NL
    : '\r' | '\n' 
    ;

CONTENT
    : [a-zA-Z0-9_][a-zA-Z0-9_ \t]*
    ;

Meaning to say the element will also have 2 attributes such as amount and payee. To keep it simple, I will have the same sentence structure so to allow parsing to be done more easily.

the format will be pay [payee] [amount]

the example is pay Acme Corp 123,789.45

so the payee is Acme Corp and the amount is 12378945 as expressed in integers to denote the amount in denominations of cents

another example is pay Banana Inc 700

so the payee is Banana Inc and the amount is 70000 as expressed in integers to denote the amount in denominations of cents

I am guessing I need to change the todo.g4 and then re generate the parser.

Can an element have other attributes? If so, how do I get started?

UPDATE

This is my latest attempts ranked with latest updates on top:

I just figured out how to use grun and testRig. Thanks @Raven for that tip.

latest attempt: My latest expense.g4 (only difference from earlier attempt is the regex for payment)

grammar expense;

payments: (payment NL)* ;  
payment: PAY receiver amount=NUMBER ;  
receiver: surname=ID (lastname=ID)? ;  

PAY: 'pay' ;
NUMBER: ([0-9]+(','[0-9]+)*)('.'[0-9]*)?;
ID: [a-zA-Z0-9_]+ ;
NL: '\n' | '\r\n' ;  
WS: [\t ]+ -> skip ;

Earlier attempt: This is my expense.g4

grammar expense;

payments: (payment NL)* ;  
payment: PAY receiver amount=NUMBER ;  
receiver: surname=ID (lastname=ID)? ;  

PAY: 'pay' ;
NUMBER: [0-9]+ (',' [0-9]+)+ ('.' [0-9]+)? ;  
ID: [a-zA-Z0-9_]+ ;
NL: '\n' | '\r\n' ;  
WS: [\t ]+ -> skip ;

Earlier attempt: https://github.com/simkimsia/learn-antlr-web-js/commit/728813ac275a3f2ad16d7f51ce15fcc27d40045b#commitcomment-25127606

Earlier attempt: https://github.com/simkimsia/learn-antlr-web-js/commit/0c32aec6ffb4b4275db86d54e9788058a2ce8759#commitcomment-25125695

回答1:

Situation on October 24. 2017 at 19:00 UTC+1.

Your grammar works perfectly. I made a full test in Java.

File Expense.g4 :

grammar Expense;

payments
@init {System.out.println("Expense last update 1853");}
    : (payment NL)*
    ;

payment
    : PAY receiver amount=NUMBER
      {System.out.println("Payement found " + $amount.text + " to " + $receiver.text);}
    ;

receiver
    : surname=ID (lastname=ID)?
    ; 

PAY    : 'pay' ;
NUMBER : ([0-9]+(','[0-9]+)*)('.'[0-9]*)? ;
ID     : [a-zA-Z0-9_]+ ;
NL     : '\n' | '\r\n' ;  
WS     : [\t ]+ -> channel(HIDDEN) ; // keep the spaces (witout spaces ==> paydeltaco98)

File ExpenseMyListener.java :

public class ExpenseMyListener extends ExpenseBaseListener {
    ExpenseParser parser;
    public ExpenseMyListener(ExpenseParser parser) { this.parser = parser; }

    public void exitPayments(ExpenseParser.PaymentsContext ctx) {
        System.out.println(">>> in ExpenseMyListener for paymentsss");
        System.out.println(">>> there are " + ctx.payment().size() + " elements in the list of payments");
        for (int i = 0; i < ctx.payment().size(); i++) {
            System.out.println(ctx.payment(i).getText());
        }
    }

    public void exitPayment(ExpenseParser.PaymentContext ctx) {
        System.out.println(">>> in ExpenseMyListener for payment");
        System.out.println(parser.getTokenStream().getText(ctx));
    }
}

File test_expense.java :

import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.ParserRuleContext;
import org.antlr.v4.runtime.tree.*;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.IOException;

public class test_expense {
    public static void main(String[] args) throws IOException {
        ANTLRInputStream input = new ANTLRFileStream(args[0]);
        ExpenseLexer lexer = new ExpenseLexer(input);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        ExpenseParser parser = new ExpenseParser(tokens);
        ParseTree tree = parser.payments();
        System.out.println("---parsing ended");
        ParseTreeWalker walker = new ParseTreeWalker();
        ExpenseMyListener my_listener = new ExpenseMyListener(parser);
        System.out.println(">>>> about to walk");
        walker.walk(my_listener, tree);
    }
}

Input file top.text :

pay Acme Corp 123,456
pay Banana Inc 456789.00
pay charlie pte 123,456.89
pay delta co 98

Execution :

$ export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar"
$ alias
alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Expense.g4 
$ javac Ex*.java
$ javac test_expense.java 
$ grun Expense payments -tokens -diagnostics top.text
[@0,0:2='pay',<'pay'>,1:0]
[@1,3:3=' ',<WS>,channel=1,1:3]
[@2,4:7='Acme',<ID>,1:4]
[@3,8:8=' ',<WS>,channel=1,1:8]
[@4,9:12='Corp',<ID>,1:9]
...
[@32,90:89='<EOF>',<EOF>,5:0]
Expense last update 1853
Payement found 123,456 to Acme Corp
Payement found 456789.00 to Banana Inc
Payement found 123,456.89 to charlie pte
Payement found 98 to delta co

$ java test_expense top.text 
Expense last update 1853
Payement found 123,456 to Acme Corp
Payement found 456789.00 to Banana Inc
Payement found 123,456.89 to charlie pte
Payement found 98 to delta co
---parsing ended
>>>> about to walk
>>> in ExpenseMyListener for payment
pay Acme Corp 123,456
>>> in ExpenseMyListener for payment
pay Banana Inc 456789.00
>>> in ExpenseMyListener for payment
pay charlie pte 123,456.89
>>> in ExpenseMyListener for payment
pay delta co 98
>>> in ExpenseMyListener for paymentsss
>>> there are 4 elements in the list of payments
payAcmeCorp123,456
payBananaInc456789.00
paycharliepte123,456.89
paydeltaco98

回答2:

I'm not entirely sure what exactly you want but for the provided examples this grammar should do the job:

payments: (payment NL)* ;  
payment: PAY receiver amount=NUMBER ;  
receiver: surname=ID (lastname=ID)? ;  

PAY: 'pay' ;
NUMBER: [0-9]+ (',' [0-9]+)+ ('.' [0-9]+)? ;  
ID: [a-zA-Z0-9_]+ ;
NL: '\n' | '\r\n' ;  
WS: [\t ]+ -> skip ;

If this is what you were asking for I will add some more explanation if needed...

回答3:

I am guessing I need to change the todo.g4 and then re generate the parser.

Of course regenerate after each change. For me it's :

$ a4 Question.g4
$ javac Q*.java
$ grun Question elements -tokens -diagnostics t.text

where

$ alias
alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
alias grun='java org.antlr.v4.gui.TestRig'

The more you describe specific contents, the more you may face ambiguity problems. For example, you have two rules :

payment   : 'pay' [payee] [amount]
free_text : ... any character ...

Consider the following content :

* pay Federico Tomassetti 10 € for the tutorial

* pay Federico Tomassetti 10 is ambiguous and can be matched by the two rules, but it will finally be parsed as free text, because of € for the tutorial which doesn't satisfy payment.

If later you change the payment rule to accept more info after the amount :

payment   : 'pay' [payee] [amount] payment_info

the above content will be matched by payment (in case of ambiguity ANTLR chooses the first rule). The good news is that ANTLR 4 is very strong to disambiguate, it reads the whole file if necessary.

For ambiguous tokens and precedence rules, read the posts of these last three weeks, a lot have been said.

Mixing Raven's grammar with yours, this is one possible solution :

File Question.g4

grammar Question;

elements
@init {System.out.println("Question last update 1432");}
    : ( element | emptyLine )* EOF
    ;

element
    : '*' content NL
    ;

content
    : payment   //{System.out.println("Payement found " + $payment.text);}
    | free_text {System.out.println("Free text found " + $free_text.text);}
    ;

payment
    : PAY receiver amount=NUMBER
      {System.out.println("Payement found " + $amount.text + " to " + $receiver.text);}
    ;

receiver
    : surname=WORD ( lastname=WORD )?
    ;  

free_text
    : ( WORD | PAY | NUMBER )+
    ;

emptyLine
    : NL
    ;

PAY    : 'pay' ;
WORD   : LETTER ( LETTER | DIGIT | '_' )* ;
NUMBER : DIGIT+ ( ',' DIGIT+ )? ( '.' DIGIT+ )? ;  

NL  : [\r\n]
    | '\r\n' 
    ;
//WS  : [ \t]+ -> skip ; // $payment.text => payAcmeCorp123,789.45
WS  : [ \t]+ -> channel(HIDDEN) ; // spaces are needed to nicely display $payment.text

fragment DIGIT  : [0-9] ;
fragment LETTER : [a-zA-Z] ;

File t.text

* play with ANTLR 4
* write a tutorial
* pay Acme Corp 123,789.45
* pay Banana Inc 700
* pay Federico Tomassetti 10 € for the tutorial

Execution :

$ grun Question elements -tokens -diagnostics t.text
line 5:29 token recognition error at: '€'
[@0,0:0='*',<'*'>,1:0]
[@1,1:1=' ',<WS>,channel=1,1:1]
[@2,2:5='play',<WORD>,1:2]
[@3,6:6=' ',<WS>,channel=1,1:6]
[@4,7:10='with',<WORD>,1:7]
[@5,11:11=' ',<WS>,channel=1,1:11]
[@6,12:16='ANTLR',<WORD>,1:12]
[@7,17:17=' ',<WS>,channel=1,1:17]
[@8,18:18='4',<NUMBER>,1:18]
[@9,19:19='\n',<NL>,1:19]
[@10,20:20='*',<'*'>,2:0]
[@11,21:21=' ',<WS>,channel=1,2:1]
[@12,22:26='write',<WORD>,2:2]
[@13,27:27=' ',<WS>,channel=1,2:7]
[@14,28:28='a',<WORD>,2:8]
[@15,29:29=' ',<WS>,channel=1,2:9]
[@16,30:37='tutorial',<WORD>,2:10]
[@17,38:38='\n',<NL>,2:18]
...
[@56,136:135='<EOF>',<EOF>,7:0]
Question last update 1432
Free text found play with ANTLR 4
Free text found write a tutorial
line 3:26 reportAttemptingFullContext d=2 (content), input='pay Acme Corp 123,789.45
'
...
Payement found 700 to Banana Inc
Free text found pay Federico Tomassetti 10  for the tutorial

As you can see, the € symbol is not recognized. You may need a CONTENT rule similar to FIELDTEXT here, and then you get into trouble ...

Federico's Mega tutorial is a good start. For nitty-gritty details, see The Definitive ANTLR 4 Reference or the online doc from www.antlr.org.