Using Antlr for parsing data from never-ending str

2019-03-28 03:07发布

问题:

Is Antlr suitable for parsing data from streams that don't have EOF right after the text to parse? According to my observation, the lexer does not emit the current token until the first character of next token is received. On top of that - the parser seems not to emit the rule until the first token of next rule is received. Here is a simple grammar I tried:

fox: 'quick' 'brown' 'fox' '\r'? '\n' ;

Then I used the generated parser with UnbufferedCharStream and UnbufferedTokenStream:

  CharStream input = new UnbufferedCharStream(is);
  MyLexer lex = new MyLexer(input);
  lex.setTokenFactory(new CommonTokenFactory(true));
  TokenStream tokens = new UnbufferedTokenStream(lex);
  MyParser parser = new MyParser(tokens);
  MyParser.FoxContext fox = parser.fox();

when the stream gets 'quick' - nothing happens.

when 'b' comes in - entering rule 'fox'

then 'roun' - nothing (2 tokens are in the stream - none of them is known to leser yet!)

only after 'f' the listener visits the first token: 'quick'

then - nothing on 'ox'

on new line (unix): visit token 'brown'

Now the stream has all data (4 tokens), but only 2 tokens are recognized.

I found that in order to push those tokens through the system the stream can emit 2 tokens, that is any tokens known to the grammar. It could be 2 extra new lines, or let's say 'fox' and 'brown'. Only then the tokens 'fox' and '\n' get visited, the parser exits rule 'fox' and parsing gets finished.

Is that a bug or a feature? Is there a way to eliminate that lag?

Thanks!

回答1:

The ANTLR 4 book was originally going to contain an example of parsing a streaming input, but I argued against it due to the severe complications that will inevitably arise from the use of an adaptive unlimited lookahead parser for something like this.

ANTLR 4 has no guaranteed lookahead bound (and no way to tell it to look for or even attempt to enforce one), so any implementation that operates on a blocking stream has the possibility of deadlock without returning information about the parse leading up to that point. I wouldn't even entertain the possibility of parsing a streaming input unless I saw an intermediate buffer in place first.

  1. Take all available (or previously unparsed) input and place it in a String or char[].
  2. Create an ANTLRInputStream for the buffer.
  3. Attempt to lex/parse this stream, which will have an implicit EOF on the end.

The result of the parse will tell you whether to discard the results to that point, or hold on to them to retry when more data is available:

  • If no syntax error occurs, the input was successfully parsed, and you can parse the next section of input when it becomes available later.

  • If a syntax error is reported before the EOF token is consumed, then a syntax error appears in the actual input so you'll want to handle it (report it to the user, etc...).

  • If a syntax error is reported at the point where the EOF token is consumed then additional input may resolve the problem - ignore the results of the current parse, and then retry once more data is available from the input stream.



回答2:

I think you're using the unbuffered streams correctly and what you see is the expected, desired result of using those streams. But I think you may have expectations of them that they aren't obligated to meet.

Below is test code for us to poke with sticks. I'm using System.in for the input, so I modified the grammar to account for the newline characters between the word tokens.

Streaming.g

grammar Streaming;

fox   : 'quick' NL 'brown' NL 'fox' NL DONE NL;
DONE  : 'done';
NL    : '\r'? '\n';

StreamingTest.java

import org.antlr.v4.runtime.CommonToken;
import org.antlr.v4.runtime.CommonTokenFactory;
import org.antlr.v4.runtime.Token;
import org.antlr.v4.runtime.UnbufferedCharStream;
import org.antlr.v4.runtime.UnbufferedTokenStream;
import org.antlr.v4.runtime.tree.TerminalNode;

public class StreamingTest {
    public static void main(String[] args) throws Exception {
        lex();
        parse();
    }

    private static void lex() {
        System.out.println("-> Reading from lexer:");
        UnbufferedCharStream input = new UnbufferedCharStream(System.in);
        StreamingLexer lexer = new StreamingLexer(input);
        lexer.setTokenFactory(new CommonTokenFactory(true));

        Token t;

        //read each token until hitting input "done"
        while ((t = lexer.nextToken()).getType() != StreamingLexer.DONE){
            if (t.getText().trim().length() == 0){
                System.out.println("-> " + StreamingLexer.tokenNames[t.getType()]);
            } else { 
                System.out.println("-> " + t.getText());
            }
        }
    }

    private static void parse() {
        System.out.println("-> Reading from parser:");
        UnbufferedCharStream input = new UnbufferedCharStream(System.in);
        StreamingLexer lexer = new StreamingLexer(input);
        lexer.setTokenFactory(new CommonTokenFactory(true));

        StreamingParser parser = new StreamingParser(new UnbufferedTokenStream<CommonToken>(lexer));
        parser.addParseListener(new StreamingBaseListener(){
            @Override
            public void visitTerminal(TerminalNode t) {
                if (t.getText().trim().length() == 0){
                    System.out.println("-> " + StreamingLexer.tokenNames[t.getSymbol().getType()]);
                } else { 
                    System.out.println("-> " + t.getText());
                }
            }
        });

        parser.fox();
    }
}

Below is a mix of the input and output as they're provided to/received from the lexer and parser in the program above. Each line of output is prefixed with ->. I'll explain why things are the way they are after that.

Input & Output

-> Reading from lexer:
quick
-> quick
brown
-> NL
-> brown
fox
-> NL
-> fox
done
-> NL
-> Reading from parser:
quick
brown
-> quick
-> NL
fox
-> brown
-> NL
done
-> fox
-> NL

-> done

-> NL

The first thing I notice is that the lexer immediately received quick NL for input, but only provided a token for quick. The reason for this discrepancy is that the UnbufferedCharStream reads ahead one more character (even though it has a perfectly good NL token ready for me!) because it won't sit on an empty look-ahead character buffer. Alas, the unbuffered stream is buffered. According to the Javadoc comment in the class itself:

"Unbuffered" here refers to fact that it doesn't buffer all data, not that's it's on demand loading of char.

This extra read translates into waiting on the stream for more input, which explains why the lexer is one token behind for the rest of the input.

Now on to the parser. Why does it lag behind two tokens to the lexer's one? Simple: because UnbufferedTokenStream won't sit on an empty look-ahead buffer either. But it can't create that next token until a) it has a spare token from the lexer and b) the lexer's UnbufferedCharStream reads its own look-ahead character. In effect, it's the lexer's one-character "lag" plus a one-token "lag."

It appears that getting "lag-free," data-on-demand streams in ANTLR v4 means writing your own. But it seems to me that the existing streams work as expected.


Is Antlr suitable for parsing data from streams that don't have EOF right after the text to parse?

I can't answer that with confidence for ANTLR 4 yet. It seems easy enough to write a token stream that doesn't buffer ahead until it's needed (override UnbufferedTokenStream's consume to skip calling sync), but the character stream gets called by classes that do their own reading ahead regardless of anyone's buffering. Or so it seems. I'll keep digging into this as best I can, but it may require learning the official way to do this.



回答3:

Apparently the root of the issue is not in Unbuffered*Streams. It's in Interpreters, like LexerATNSimulator.execATN() method. That method interprets the lexer as a state machine, moving from one tag to another once the first character of next tag is consumed. The similar algorithm is used in ParserATNSimulator, which deals with Tokens recognized by the Lexer. That's what causes that double lag. So, now I'm pretty much confident that Antlr 4 as it's implemented now cannot be used for parsing continuous, interactive data. Unlike Flex/Bison, where the lexer returns the tag right when the last characters possibly matching the tag. As the result - the parse() function ends right when the portion of data matching the grammar arrives. That provides nice ability to read exact amount of data, determined by the data structure when the size is not defined otherwise.