-->

How can I modify the text of tokens in a CommonTok

2019-03-21 10:36发布

问题:

I'm trying to learn ANTLR and at the same time use it for a current project.

I've gotten to the point where I can run the lexer on a chunk of code and output it to a CommonTokenStream. This is working fine, and I've verified that the source text is being broken up into the appropriate tokens.

Now, I would like to be able to modify the text of certain tokens in this stream, and display the now modified source code.

For example I've tried:

import org.antlr.runtime.*;
import java.util.*;

public class LexerTest
{
    public static final int IDENTIFIER_TYPE = 4;

    public static void main(String[] args)
    {
    String input = "public static void main(String[] args) { int myVar = 0; }";
    CharStream cs = new ANTLRStringStream(input);


        JavaLexer lexer = new JavaLexer(cs);
        CommonTokenStream tokens = new CommonTokenStream();
        tokens.setTokenSource(lexer);

        int size = tokens.size();
        for(int i = 0; i < size; i++)
        {
            Token token = (Token) tokens.get(i);
            if(token.getType() == IDENTIFIER_TYPE)
            {
                token.setText("V");
            }
        }
        System.out.println(tokens.toString());
    }  
}

I'm trying to set all Identifier token's text to the string literal "V".

  1. Why are my changes to the token's text not reflected when I call tokens.toString()?

  2. How am I suppose to know the various Token Type IDs? I walked through with my debugger and saw that the ID for the IDENTIFIER tokens was "4" (hence my constant at the top). But how would I have known that otherwise? Is there some other way of mapping token type ids to the token name?


EDIT:

One thing that is important to me is I wish for the tokens to have their original start and end character positions. That is, I don't want them to reflect their new positions with the variable names changed to "V". This is so I know where the tokens were in the original source text.

回答1:

ANTLR has a way to do this in it's grammar file.

Let's say you're parsing a string consisting of numbers and strings delimited by comma's. A grammar would look like this:

grammar Foo;

parse
  :  value ( ',' value )* EOF
  ;

value
  :  Number
  |  String
  ;

String
  :  '"' ( ~( '"' | '\\' ) | '\\\\' | '\\"' )* '"'
  ;

Number
  :  '0'..'9'+
  ;

Space
  :  ( ' ' | '\t' ) {skip();}
  ;

This should all look familiar to you. Let's say you want to wrap square brackets around all integer values. Here's how to do that:

grammar Foo;

options {output=template; rewrite=true;} 

parse
  :  value ( ',' value )* EOF
  ;

value
  :  n=Number -> template(num={$n.text}) "[<num>]" 
  |  String
  ;

String
  :  '"' ( ~( '"' | '\\' ) | '\\\\' | '\\"' )* '"'
  ;

Number
  :  '0'..'9'+
  ;

Space
  :  ( ' ' | '\t' ) {skip();}
  ;

As you see, I've added some options at the top, and added a rewrite rule (everything after the ->) after the Number in the value parser rule.

Now to test it all, compile and run this class:

import org.antlr.runtime.*;

public class FooTest {
  public static void main(String[] args) throws Exception {
    String text = "12, \"34\", 56, \"a\\\"b\", 78";
    System.out.println("parsing: "+text);
    ANTLRStringStream in = new ANTLRStringStream(text);
    FooLexer lexer = new FooLexer(in);
    CommonTokenStream tokens = new TokenRewriteStream(lexer); // Note: a TokenRewriteStream!
    FooParser parser = new FooParser(tokens);
    parser.parse();
    System.out.println("tokens: "+tokens.toString());
  }
}

which produces:

parsing: 12, "34", 56, "a\"b", 78
tokens: [12],"34",[56],"a\"b",[78]


回答2:

In ANTLR 4 there is a new facility using parse tree listeners and TokenStreamRewriter (note the name difference) that can be used to observe or transform trees. (The replies suggesting TokenRewriteStream apply to ANTLR 3 and will not work with ANTLR 4.)

In ANTL4 an XXXBaseListener class is generated for you with callbacks for entering and exiting each non-terminal node in the grammar (e.g. enterClassDeclaration() ).

You can use the Listener in two ways:

1) As an observer - By simply overriding the methods to produce arbitrary output related to the input text - e.g. override enterClassDeclaration() and output a line for each class declared in your program.

2) As a transformer using TokenRewriteStream to modify the original text as it passes through. To do this you use the rewriter to make modifications (add, delete, replace) tokens in the callback methods and you use the rewriter and the end to output the modified text.

See the following examples from the ANTL4 book for an example of how to do transformations:

https://github.com/mquinn/ANTLR4/blob/master/book_code/tour/InsertSerialIDListener.java

and

https://github.com/mquinn/ANTLR4/blob/master/book_code/tour/InsertSerialID.java



回答3:

The other given example of changing the text in the lexer works well if you want to globally replace the text in all situations, however you often only want to replace a token's text during certain situations.

Using the TokenRewriteStream allows you the flexibility of changing the text only during certain contexts.

This can be done using a subclass of the token stream class you were using. Instead of using the CommonTokenStream class you can use the TokenRewriteStream.

So you'd have the TokenRewriteStream consume the lexer and then you'd run your parser.

In your grammar typically you'd do the replacement like this:

/** Convert "int foo() {...}" into "float foo();" */
function
:
{
    RefTokenWithIndex t(LT(1));  // copy the location of the token you want to replace
    engine.replace(t, "float");
}
type id:ID LPAREN (formalParameter (COMMA formalParameter)*)? RPAREN
    block[true]
;

Here we've replaced the token int that we matched with the text float. The location information is preserved but the text it "matches" has been changed.

To check your token stream after you would use the same code as before.