I've very simple grammar which tries to match 'é' to token E_CODE.
I've tested it using TestRig tool (with -tokens option), but parser can't correctly match it.
My input file was encoded in UTF-8 without BOM and I've used ANTLR version 4.4.
Could somebody else also check this ? I got this output on my console:
line 1:0 token recognition error at: 'Ă'
grammar Unicode;
stat:EOF;
E_CODE: '\u00E9' | 'é';
I tested the grammar:
grammar Unicode;
stat: E_CODE* EOF;
E_CODE: '\u00E9' | 'é';
as follows:
UnicodeLexer lexer = new UnicodeLexer(new ANTLRInputStream("\u00E9é"));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());
and the following got printed to my console:
éé<EOF>
Tested with 4.2 and 4.3 (4.4 isn't in Maven Central yet).
EDIT
Looking at the source I see TestRig takes an optional -encoding
param. Have you tried setting it?
Your grammar file is not saved in utf8 format.
Utf8 is default format that antlr accept as input grammar file, according with terence Parr book.