I'm trying to parse Presentation MathML and build an AST using ANTLR. I have most of the tags supported and I can build nodes for specific constructs.
I'm having trouble with the operators. On this page;
http://www.w3.org/TR/MathML3/appendixc.html
There is a list of the operators, the form they appear in by default (prefix, infix or postifx) and a priority value, which gives the precedence of the operator.
I could take each operator code and add it to my lexer and then write rules for unary, binary and postfix expression based on the precedence, just like how I would write the expressions for C or some other programming language.
The problem is that the operator tags can contain a 'form' attribute which can take the value 'prefix', 'infix' and 'postfix', which changes the tree structure. I can't see the attributes until the parser stage though.
Additionally a operator tag can contain natural language to act as an operator, so I can't deduce the precedence and thus build a correct tree.
Would it be possible to ignore the operator precedence at the parser stage, just load the expressions in as a list of nodes and then re-write the tree at the semantic stage, using a tree walker? I'd have the attribute values at this stage and I hold a dictionary of known operators and their precedence/priority.
This is a major milestone to my progress because I have to decide what I can do before I continue.
EDIT
I have the following MathML expression...
<math>
<mrow>
<mi>a</mi>
<mo>+</mo>
<mi>b</mi>
<mo>+</mo>
<mi>c</mi>
</mrow>
</math>
I can build two different trees...
or...
The second one encodes the associativity of the '+' operator in the tree, and this is what we usually do for programming languages.
But there are hundreds of operators in the specification and so I would have a very large grammar, and lots of alternatives in my production rules.
Natural language can also be used (Although really shouldn't) for operators...
<math>
<mrow>
<mo>there exists</mo>
<mi>x</mi>
<mo>in</mo>
<mi>S</mi>
</mrow>
</math>
So what I'm asking is what is the best way to go about encoding the operators in the tree. I'm trying to convert presentation MathML to Content MathML so I need to analyse the semantics of the presentation to be able to decide what it means mathematically.
Is there a way to convert the first tree to the second one in a Tree Grammar phase?
EDIT
I have the following MathML and the generated tree...
<math>
<mrow>
<mi>a</mi>
<mo>+</mo>
<mi>b</mi>
<mo>+</mo>
<mi>c</mi>
</mrow>
</math>
Here is a simple tree grammar I want to use to find any MO
nodes that are in-between other nodes, e.g. MI
...
tree grammar SimpleReWriter;
options
{
tokenVocab = MathML;
ASTLabelType = CommonTree;
output = AST;
backtrack = true;
language = CSharp3;
filter = true; // use pattern matching
rewrite = true;
}
topdown: findInfix; // look for infix operators
findInfix : ^(MROW left=.+ MO right=.+) -> ^(MROW ^(MO $left $right));
My program crashes inside the SimpleReWriter
class, with the error message : Operation is not valid due to the current state of the object.
My tree grammar works if there was only a single +
between nodes, but when there is a sequence of more than one, it crashes.