Parsing MathML Operators using ANTLR

2019-08-09 07:46发布

I'm trying to parse Presentation MathML and build an AST using ANTLR. I have most of the tags supported and I can build nodes for specific constructs.

I'm having trouble with the operators. On this page;

http://www.w3.org/TR/MathML3/appendixc.html

There is a list of the operators, the form they appear in by default (prefix, infix or postifx) and a priority value, which gives the precedence of the operator.

I could take each operator code and add it to my lexer and then write rules for unary, binary and postfix expression based on the precedence, just like how I would write the expressions for C or some other programming language.

The problem is that the operator tags can contain a 'form' attribute which can take the value 'prefix', 'infix' and 'postfix', which changes the tree structure. I can't see the attributes until the parser stage though.

Additionally a operator tag can contain natural language to act as an operator, so I can't deduce the precedence and thus build a correct tree.

Would it be possible to ignore the operator precedence at the parser stage, just load the expressions in as a list of nodes and then re-write the tree at the semantic stage, using a tree walker? I'd have the attribute values at this stage and I hold a dictionary of known operators and their precedence/priority.

This is a major milestone to my progress because I have to decide what I can do before I continue.

EDIT

I have the following MathML expression...

<math>
<mrow>
<mi>a</mi>
<mo>+</mo>
<mi>b</mi>
<mo>+</mo>
<mi>c</mi>
</mrow>
</math>

I can build two different trees...

enter image description here

or...

enter image description here

The second one encodes the associativity of the '+' operator in the tree, and this is what we usually do for programming languages.

But there are hundreds of operators in the specification and so I would have a very large grammar, and lots of alternatives in my production rules.

Natural language can also be used (Although really shouldn't) for operators...

<math>
<mrow>
<mo>there exists</mo>
<mi>x</mi>
<mo>in</mo>
<mi>S</mi>
</mrow>
</math>

So what I'm asking is what is the best way to go about encoding the operators in the tree. I'm trying to convert presentation MathML to Content MathML so I need to analyse the semantics of the presentation to be able to decide what it means mathematically.

Is there a way to convert the first tree to the second one in a Tree Grammar phase?

EDIT

I have the following MathML and the generated tree...

<math>
<mrow>
<mi>a</mi>
<mo>+</mo>
<mi>b</mi>
<mo>+</mo>
<mi>c</mi>
</mrow>
</math>

enter image description here

Here is a simple tree grammar I want to use to find any MO nodes that are in-between other nodes, e.g. MI...

tree grammar SimpleReWriter;

options 
{
  tokenVocab = MathML;
  ASTLabelType = CommonTree;
  output = AST;
  backtrack = true;
  language = CSharp3;
  filter = true; // use pattern matching
  rewrite = true;
}

topdown:   findInfix; // look for infix operators

findInfix : ^(MROW left=.+ MO right=.+) -> ^(MROW ^(MO $left $right));

My program crashes inside the SimpleReWriter class, with the error message : Operation is not valid due to the current state of the object.

My tree grammar works if there was only a single + between nodes, but when there is a sequence of more than one, it crashes.

0条回答
登录 后发表回答