-->

Python AST from ANTLR Parse Tree?

2020-07-17 05:46发布

问题:

I found an ANTLRv4 Python3 grammer, but it generates a parse-tree, which generally has many useless nodes.

I'm looking for a known package to get a Python AST from that parse tree.

Does something like this exist?

EDIT: Clarification regarding use of the Python ast package: my project is in Java and I need to parse Python files.

EDIT 2: By 'AST' I mean http://docs.python.org/2/library/ast.html#abstract-grammar, while by 'parse tree' I mean http://docs.python.org/2/reference/grammar.html.

回答1:

The following could be a start:

public class AST {

    private final Object payload;

    private final List<AST> children;

    public AST(ParseTree tree) {
        this(null, tree);
    }

    private AST(AST ast, ParseTree tree) {
        this(ast, tree, new ArrayList<AST>());
    }

    private AST(AST parent, ParseTree tree, List<AST> children) {

        this.payload = getPayload(tree);
        this.children = children;

        if (parent == null) {
            walk(tree, this);
        }
        else {
            parent.children.add(this);
        }
    }

    public Object getPayload() {
        return payload;
    }

    public List<AST> getChildren() {
        return new ArrayList<>(children);
    }

    private Object getPayload(ParseTree tree) {
        if (tree.getChildCount() == 0) {
            return tree.getPayload();
        }
        else {
            String ruleName = tree.getClass().getSimpleName().replace("Context", "");
            return Character.toLowerCase(ruleName.charAt(0)) + ruleName.substring(1);
        }
    }

    private static void walk(ParseTree tree, AST ast) {

        if (tree.getChildCount() == 0) {
            new AST(ast, tree);
        }
        else if (tree.getChildCount() == 1) {
            walk(tree.getChild(0), ast);
        }
        else if (tree.getChildCount() > 1) {

            for (int i = 0; i < tree.getChildCount(); i++) {

                AST temp = new AST(ast, tree.getChild(i));

                if (!(temp.payload instanceof Token)) {
                    walk(tree.getChild(i), temp);
                }
            }
        }
    }

    @Override
    public String toString() {

        StringBuilder builder = new StringBuilder();

        AST ast = this;
        List<AST> firstStack = new ArrayList<>();
        firstStack.add(ast);

        List<List<AST>> childListStack = new ArrayList<>();
        childListStack.add(firstStack);

        while (!childListStack.isEmpty()) {

            List<AST> childStack = childListStack.get(childListStack.size() - 1);

            if (childStack.isEmpty()) {
                childListStack.remove(childListStack.size() - 1);
            }
            else {
                ast = childStack.remove(0);
                String caption;

                if (ast.payload instanceof Token) {
                    Token token = (Token) ast.payload;
                    caption = String.format("TOKEN[type: %s, text: %s]",
                            token.getType(), token.getText().replace("\n", "\\n"));
                }
                else {
                    caption = String.valueOf(ast.payload);
                }

                String indent = "";

                for (int i = 0; i < childListStack.size() - 1; i++) {
                    indent += (childListStack.get(i).size() > 0) ? "|  " : "   ";
                }

                builder.append(indent)
                        .append(childStack.isEmpty() ? "'- " : "|- ")
                        .append(caption)
                        .append("\n");

                if (ast.children.size() > 0) {
                    List<AST> children = new ArrayList<>();
                    for (int i = 0; i < ast.children.size(); i++) {
                        children.add(ast.children.get(i));
                    }
                    childListStack.add(children);
                }
            }
        }

        return builder.toString();
    }
}

and can be used to create an AST for the input "f(arg1='1')\n" as follows:

public static void main(String[] args) {

    Python3Lexer lexer = new Python3Lexer(new ANTLRInputStream("f(arg1='1')\n"));
    Python3Parser parser = new Python3Parser(new CommonTokenStream(lexer));

    ParseTree tree = parser.file_input();
    AST ast = new AST(tree);

    System.out.println(ast);
}

which would print:

'- file_input
   |- stmt
   |  |- small_stmt
   |  |  |- atom
   |  |  |  '- TOKEN[type: 35, text: f]
   |  |  '- trailer
   |  |     |- TOKEN[type: 47, text: (]
   |  |     |- arglist
   |  |     |  |- test
   |  |     |  |  '- TOKEN[type: 35, text: arg1]
   |  |     |  |- TOKEN[type: 53, text: =]
   |  |     |  '- test
   |  |     |     '- TOKEN[type: 36, text: '1']
   |  |     '- TOKEN[type: 48, text: )]
   |  '- TOKEN[type: 34, text: \n]
   '- TOKEN[type: -1, text: ]

I realize this still contains nodes you might not want, but you could even add a set of token types you'd like to exclude. Feel free to hack away!

Here is a Gist containing a version of the code above with the proper import statements and some JavaDocs and inline comments.



回答2:

The Eclipse DLTK project Python subproject implements a custom Python AST model in Java. It is built from from an AntlrV3 ast, but should not be too difficult to refit to build from an AntlrV4 parse tree.

The Eclipse PyDev project presumably also implements a Java-based AST for python source. Note, the layout of the source tree in both projects should be quite similar.

Naturally, you should check the licenses before using code from these sources, just to be sure.



回答3:

I found a workaround:

Use Jython and ast (thanks @delnan for leading me there). Or, do everything you need directly in Python code, and just spit out the results back to Java.

PythonInterpreter interpreter = new PythonInterpreter();
interpreter.exec("import ast");
PyObject o = interpreter.eval(
    "ast.dump(ast.parse('f(arg1=\\'1\\')', 'filename', 'eval'))" + "\n");
System.out.print(o.toString());

Output is

Expression(body=Call(func=Name(id='f', ctx=Load()), args=[], keywords=[keyword(arg='arg1', value=Str(s='1'))], starargs=None, kwargs=None))

This doesn't strictly answer the question, and might not be applicable for all users, so I'm leaving this answer unselected.



回答4:

ANTLR4 can generate a visitor, which you can use to traverse the parse tree and to construct an AST. Python has an ast package, so this should not be a problem (if you're using Python).

I have written a toy Python interpreter in Python 3 using ANTLR4 (as a part of my study). Visitor code is located in /tinypy/AST/builder/, so you can get an idea of how it's done.