I'm trying to flesh out a wikitext-to-HTML translator in ANTLR 3, but I keep getting stuck.
Do you know of a working example that I can inspect? I tried the MediaWiki ANTLR grammar and the Wiki Creole grammar, but I can't get them to generate the lexer & parser in ANTLR 3.
Here are the links to two grammars I've tried using:
- http://www.mediawiki.org/wiki/Markup_spec/ANTLR
- http://www.wikicreole.org/wiki/EBNFGrammarForCreole1.0
I can't get any of these two to generate my Java Lexer and Parser. (I'm using ANTLR3 as Eclipse plugin). MediaWiki takes a looong time to build and then at some point it throws an OutOfMemory exception. The other one has errors in it which I don't know how to debug.
EDIT: Okay I've got a very basic grammar:
grammar wikitext;
options {
//output = AST;
//ASTLabelType = CommonTree;
output = template;
language = Java;
}
document: line (NL line?)*;
line: horizontal_line | list | heading | paragraph;
/* horizontal line */
horizontal_line: HRLINE;
/* lists */
list: unordered_list | ordered_list;
unordered_list: '*'+ content;
ordered_list: '#'+ content;
/* Headings */
heading: heading1 | heading2 | heading3 | heading4 | heading5 | heading6;
heading1: H1 plain H1;
heading2: H2 plain H2;
heading3: H3 plain H3;
heading4: H4 plain H4;
heading5: H5 plain H5;
heading6: H6 plain H6;
/* Paragraph */
paragraph: content;
content: (formatted | link)+;
/* links */
link: external_link | internal_link;
external_link: '[' external_link_uri ('|' external_link_title)? ']';
internal_link: '[[' internal_link_ref ('|' internal_link_title)? ']]' ;
external_link_uri: CHARACTER+;
external_link_title: plain;
internal_link_ref: plain;
internal_link_title: plain;
/* bold & italic */
formatted: bold_italic | bold | italic | plain;
bold_italic: BOLD_ITALIC plain BOLD_ITALIC;
bold: BOLD plain BOLD;
italic: ITALIC plain ITALIC;
/* Plain text */
plain: (CHARACTER | SPACE)+;
/**
* LEXER RULES
* --------------------------------------------------------------------------
*/
HRLINE: '---' '-'+;
H1: '=';
H2: '==';
H3: '===';
H4: '====';
H5: '=====';
H6: '======';
BOLD_ITALIC: '\'\'\'\'\'';
BOLD: '\'\'\'';
ITALIC: '\'\'';
NL: '\r'?'\n';
CHARACTER : '!' | '"' | '#' | '$' | '%' | '&'
| '*' | '+' | ',' | '-' | '.' | '/'
| ':' | ';' | '?' | '@' | '\\' | '^' | '_' | '`' | '~'
| '0'..'9' | 'A'..'Z' |'a'..'z'
| '\u0080'..'\u7fff'
| '(' | ')'
| '\'' | '<' | '>' | '=' | '[' | ']' | '|'
;
SPACE: ' ' | '\t';
It's not clear for me though how one would go about outputting HTML. I've been looking into StringTemplate, but I don't understand how to structure my templates. Specifically, which template goes where in the grammar. Can you help me with a short example?
Okay, after your EDIT, I have a couple of recommendations.
Like I said in the comments, writing a grammar for such a language is nearly impossible. At least, trying to do so in one go, that is. The only way I see this working would be to do this with multiple parsers where the first "parsing-stage" would parse the wiki-source very "coarsely". For example: a
table
would be tokenized as:TABLE : '{|' .* '|}'
and then you'd create another parser that parses this table properly. Doing it in one parser will result in quite a few ambiguities in your parser rules IMO.About emitting HTML code, the "proper" way to do this is indeed with StringTemplate, but given the fact that you're rather new to ANTLR itself, I'd keep things simple. You could create a StringBuilder attribute in your parser class that would collect all your HTML code as you parse your source file. You can embed code in ANTLR rules by wrapping it with
{
and}
.Here's a quick demo:
From that grammar, you generate a parser and lexer:
and then create a little class to test your parser:
and then compile all your source files:
and finally, run your main class
which will print the following to the console:
But, again, if you are free to choose a different language to parse, I'd do that and forget about parsing this horrible Wiki-thing.
Anyway, whatever you do: best of luck!