Antlr4 doesn't correctly recognizes unicode ch

2019-07-19 06:30发布

I've very simple grammar which tries to match 'é' to token E_CODE. I've tested it using TestRig tool (with -tokens option), but parser can't correctly match it. My input file was encoded in UTF-8 without BOM and I've used ANTLR version 4.4. Could somebody else also check this ? I got this output on my console:
line 1:0 token recognition error at: 'Ă'

grammar Unicode;

stat:EOF;  
E_CODE: '\u00E9' | 'é';

标签： antlr4

2条回答

爷的心禁止访问

2楼-- · 2019-07-19 06:48

Your grammar file is not saved in utf8 format. Utf8 is default format that antlr accept as input grammar file, according with terence Parr book.

0人赞添加讨论(0) 举报

兄弟一词,经得起流年.

3楼-- · 2019-07-19 07:05

I tested the grammar:

grammar Unicode;

stat: E_CODE* EOF;

E_CODE: '\u00E9' | 'é';

as follows:

UnicodeLexer lexer = new UnicodeLexer(new ANTLRInputStream("\u00E9é"));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());

and the following got printed to my console:

éé<EOF>

Tested with 4.2 and 4.3 (4.4 isn't in Maven Central yet).

EDIT

Looking at the source I see TestRig takes an optional -encoding param. Have you tried setting it?

0人赞添加讨论(0) 举报

Antlr4 doesn't correctly recognizes unicode ch

EDIT

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间