antlr4 multiline string parsing

2019-07-26 20:22发布

问题:

If I have a ONELINE_STRING fragment rule in an antlr4 lexer that identifies a simple quoted string on one line, how can I create a more general STRING rule in the lexer that will concatenate adjacent ONELINE_STRING's (ie, separated only by whitespace and/or comments) as long as they each start on a different line?

ie,

"foo" "bar" 

would be parsed as two STRING tokens, "foo" followed by "bar"

while:

"foo"
"bar"

would be seen as one STRING token: "foobar"

For clarification: The idea is that while I generally want the parser to be able to recognize adjacent strings as separate, and whitespace and comments to be ignored by the parser, I want to use the idea that if the last non-whitespace sub-token on a line was a string, and the first sub-token on the next line that is not all whitespace is also a string, then the separate strings should be concatenated into one long string as a means of specifying potentially very long strings without having to put the whole thing on one line. This is very straightforward if I were wanting all adjacent string sub-tokens to be concatenated, as they are in C... but for my purposes, I only want concatenation to occur when the string sub-tokens start on different lines. This concatenation should be invisible to any rule in the parser that might use a string. This is why I was thinking it might be better to situate the rule inside the lexer instead of the parser, but I'm not wholly opposed to doing this in the parser, and all the parsing rules which might have referred to a STRING token would instead refer to the parser string rule whenever they want a string.

Sample1:

"desc" "this sample will parse as two strings.

Sample3 (note, 'output' is a keyword in the language):

output "this is a very long line that I've explicitly made so that it does not "
       "easily fit on just one line, so it gets split up into separate ones for "
       "ease of reading, but the  parser should see it all as one long string. "
       "This example will parse as if the output command had been followed by "
       "only a single string, even though it is composed of multiple string "
       "fragments, all of which should be invisible to the parser.%n";

Both of these examples should be accepted as valid by the parser. The former is an example of a declaration, while the latter is an example of an imperative statement in the language.

Addendum:

I had originally been thinking that this would need to be done in the lexer because although newlines are supposed to be ignored by the parser, like all other whitespace, a multiline string is actually sensitive to the presence of newlines I did not think that the parser could perceive that.

However, I have been thinking that it may be possible to have the ONELINE_STRING as a lexer rule, and have a general 'string' parser rule which detects adjacent ONELINE_STRINGS, using a predicate between strings to detect if the next ONELINE_STRING token is starting on a different line than the previous one, and if so, it should invisibly concatenate them so that its text is indistinguishable from a string that had been specified all on one line. I am unsure of the logistics of how this would be implemented, however.

Okay, I have it.

I need to have the string recognizer in the parser, as some of you have suggested. The trick is to use lexer modes in the lexer.

So in the Lexer file I have this:

BEGIN_STRING : '"' -> pushMode(StringMode);

mode StringMode;
END_STRING: '"'-> popMode;
STRING_LITERAL_TEXT : ~[\r\n%"];
STRING_LITERAL_ESCAPE_QUOTE : '%"' { setText("\""); }; 
STRING_LITERAL_ESCAPE_PERCENT: '%%' { setText("%"); };
STRING_LITERAL_ESCAPE_NEWLINE : '%n'{ setText("\n"); };
UNTERMINATED_STRING: { _input.LA(1) == '\n' || _input.LA(1) == '\r' || _input.LA(1) == EOF}? -> popMode;

And in the parser file I have this:

string returns [String text] locals [int line] : a=stringLiteral { $line = $a.line; $text=$a.text;}
                           ({_input.LT(1)!=null && _input.LT(1).getLine()>$line}? 
                            a=stringLiteral { $line = $a.line; $text+=$a.text; })*
                         ;

stringLiteral returns [int line, String text]: BEGIN_STRING {$text = "";}
    (a=(STRING_LITERAL_TEXT
    | STRING_LITERAL_ESCAPE_NEWLINE
    | STRING_LITERAL_ESCAPE_QUOTE
    | STRING_LITERAL_ESCAPE_PERCENT
    ) {$text+=$a.text;} )*
    stringEnd { $line = $BEGIN_STRING.line; }
  ;
stringEnd: END_STRING #string_finish
         | UNTERMINATED_STRING #string_hang
         ;

The string rule thus concatenates adjacent string literals as long as they are on different lines. The stringEnd rule needs an event handler for when a string literal is not terminated correctly so that the parser can report a syntax error, but the string is otherwise treated as if it had been closed correctly.

回答1:

EDIT: Sorry, have not read your requirements fully. The following approach would match both examples not only the desired one. Have to think about it...

The simplest way would be to do this in the parser. And I see no point that would require this to be done in the lexer.

multiString : singleString +;
singleString : ONELINE_STRING; 


ONELINE_STRING: ...; // no fragment!
WS : ... -> skip;
Comment : ... -> skip;


回答2:

As already mentioned, the (IMO) better way would be to handle this inside the parser. But here's a way to handle it in the lexer:

STRING
 : SINGLE_STRING ( LINE_CONTINUATION SINGLE_STRING )*
 ;

HIDDEN
 : ( SPACE | LINE_BREAK | COMMENT ) -> channel(HIDDEN)
 ;

fragment SINGLE_STRING
 : '"' ~'"'* '"'
 ;

fragment LINE_CONTINUATION
 : ( SPACE | COMMENT )* LINE_BREAK ( SPACE | COMMENT )*
 ;

fragment SPACE
 : [ \t]
 ;

fragment LINE_BREAK
 : [\r\n]
 | '\r\n'
 ;

fragment COMMENT
 : '//' ~[\r\n]+
 ;

Tokenizing the input:

"a" "b"

"c"
"d"

"e"

"f"

would create the following 5 tokens:

  • "a"
  • "b"
  • "c"\n"d"
  • "e"
  • "f"

However, if the token would include a comment:

"c" // comment 
"d"

then you'd need to strip this "// comment" from the token yourself at a later stage. The lexer will not be able to put this substring on a different channel, or skip it.



标签: antlr antlr4