Antlr4 doesn't correctly recognizes unicode ch

2019-07-19 06:30发布

I've very simple grammar which tries to match 'é' to token E_CODE. I've tested it using TestRig tool (with -tokens option), but parser can't correctly match it. My input file was encoded in UTF-8 without BOM and I've used ANTLR version 4.4. Could somebody else also check this ? I got this output on my console:
line 1:0 token recognition error at: 'Ă'

grammar Unicode;

stat:EOF;  
E_CODE: '\u00E9' | 'é';

标签: antlr4
2条回答
爷的心禁止访问
2楼-- · 2019-07-19 06:48

Your grammar file is not saved in utf8 format. Utf8 is default format that antlr accept as input grammar file, according with terence Parr book.

查看更多
兄弟一词,经得起流年.
3楼-- · 2019-07-19 07:05

I tested the grammar:

grammar Unicode;

stat: E_CODE* EOF;

E_CODE: '\u00E9' | 'é';

as follows:

UnicodeLexer lexer = new UnicodeLexer(new ANTLRInputStream("\u00E9é"));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());

and the following got printed to my console:

éé<EOF>

Tested with 4.2 and 4.3 (4.4 isn't in Maven Central yet).

EDIT

Looking at the source I see TestRig takes an optional -encoding param. Have you tried setting it?

查看更多
登录 后发表回答