I have an antlr4 lexer grammar. It has many rules for words, but I also want it to create an Unknown token for any word that it can not match by other rules. I have something like this:
Whitespace : [ \t\n\r]+ -> skip;
Punctuation : [.,:;?!];
// Other rules here
Unknown : .+? ;
Now generated matcher catches '~' as unknown but creates 3 '~' Unknown tokens for input '~~~' instead of a single '~~~' token. What should I do to tell lexer to generate word tokens for unknown consecutive characters. I also tried "Unknown: . ;" and "Unknown : .+ ;" with no results.
EDIT: In current antlr versions .+? now catches remaining words, so this problem seems to be resolved.
.+?
at the end of a lexer rule will always match a single character. But.+
will consume as much as possible, which was illegal at the end of a rule in ANTLR v3 (v4 probably as well).What you can do is just match a single char, and "glue" these together in the parser:
EDIT
Ah, I see. Then you could override the
nextToken()
method:Running it:
will print: