How to detect partial unfinished token and join it

2019-09-05 18:16发布

问题:

I am writing toy terminal, where I use Flex to parse normal text and control sequences that I get from tty. One detail of Cocoa machinery is that it reads from tty by chunks of 1024 bytes so that any token described in my .lex file at any time can become broken into two parts: some bytes of a token are the last bytes of first 1024 chunk and remaining bytes are the very first bytes of next 1024 bytes chunk.

So I need to somehow:

  1. First of all detect this situation: when a token is split between two 1024-byte chunks.
  2. Remember the first part of a token
  3. When second 1024-chunk arrives, restore that first part by putting it in front of this second chunk somehow.

I am completely new to Flex so I am looking for a right way to accomplish this.


I have created dumb simple lexer to assist this discussion.

My question about this demo is:

How can I detect that last "FO" (unfinished "FOO") token is actually an unfinished token that is it is not an exception to my grammar but just needs its "O" from next chunk of input?

回答1:

You should let flex do the reading. It is designed to work that way; it will do all the buffering necessary, including the case where a token is split between two (or more) input buffers.

If you cannot simply read from stdin using the standard fread function, then you can redefine the way the flex-generated parser gets input by redefining the macro YY_INPUT. See the "Generated Parser" chapter of the flex manual for a description of this macro.



回答2:

I have accepted @rici's answer as correct one as it gave me important hint about redefining the macro YY_INPUT.

In this answer I just want to share some details for newbies like me.

I have used How to make YY_INPUT point to a string rather than stdin in Lex & Yacc (Solaris) as example of custom YY_INPUT and this made my artificial example to work correctly with partial tokens.

To make Flex work correctly with partial tokens, the input should not contain '\0' symbols, i.e. scanning process should be "endless". Here is how YY_INPUT is redefined:

int readInputForLexer(char *buffer, int *numBytesRead, int maxBytesToRead) {
    static int Flip = 0;

    if ((Flip++ % 2) == 0) {
        strcpy(buffer, "FOO F");

        *numBytesRead = 5; // IMPORTANT: this is 5, not 6, to cut off \0
    } else {
        strcpy(buffer, "OO FOO");
        *numBytesRead = 6; // IMPORTANT: this is 6, not 7, to cut off \0
    }

    return 0;
}

In this example partial token F-OO is glued by Flex into a correct one: FOO.

As @rici pointed out in his comment, correct way to stop scanning is to set: *numBytesRead = 0.

See also another answer by @rici on similar SO question: Flex, continuous scanning stream (from socket). Did I miss something using yywrap()?.

See my example for further details.