I need to identify (potentially nested) capture groups within regular expressions and create a tree. The particular target is Java-1.6 and I'd ideally like Java code. A simple example is:
"(a(b|c)d(e(f*g))h)"
which would be parsed to
"a(b|c)d(e(f*g))h"
... "b|c"
... "e(f*g)"
... "f*g"
The solution should ideally account for count expressions, quantifiers, etc and levels of escaping. However if this is not easy to find a simpler approach might suffice as we can limit the syntax used.
EDIT. To clarify. I want to parse the regular expression string itself. To do so I need to know the BNF or equivalent for Java 1.6 regexes. I am hoping someone has already done this.
A byproduct of a result would be that the process would test for validity of the regex.
Consider stepping up to an actual parser/lexer:
http://www.antlr.org/wiki/display/ANTLR3/FAQ+-+Getting+Started
It looks complicated, but if your language is fairly simple, it's fairly straightforward. And if it's not, doing it in regexes will probably make your life hell :)
I came up with a partial solution using an XML tool (XOM, http://www.xom.nu) to hold the tree. First the code, then an example parse. First the escaped characters (\ , ( and ) ) are de-escaped (here I use BS, LB and RB), then remaining brackets are translated to XML tags, then the XML is parsed and the characters re-escaped. What is needed further is a BNF for Java 1.6 regexes doe quantifiers such as ?:, {d,d} and so on.
public static Element parseRegex(String regex) throws Exception {
regex = regex.replaceAll("\\\\", "BS");
regex.replaceAll("BS\\(", "LB");
regex.replaceAll("BS\\)", "RB");
regex = regex.replaceAll("\\(", "<bracket>");
regex.replaceAll("\\)", "</bracket>");
Element regexX = new Builder().build(new StringReader(
"<regex>"+regex+"</regex>")).getRootElement();
extractCaptureGroupContent(regexX);
return regexX;
}
private static String extractCaptureGroupContent(Element regexX) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < regexX.getChildCount(); i++) {
Node childNode = regexX.getChild(i);
if (childNode instanceof Text) {
Text t = (Text)childNode;
String s = t.getValue();
s = s.replaceAll("BS", "\\\\").replaceAll("LB",
"\\(").replaceAll("RB", "\\)");
t.setValue(s);
sb.append(s);
} else {
sb.append("("+extractCaptureGroupContent((Element)childNode)+")");
}
}
String capture = sb.toString();
regexX.addAttribute(new Attribute("capture", capture));
return capture;
}
example:
@Test
public void testParseRegex2() throws Exception {
String regex = "(.*(\\(b\\))c(d(e)))";
Element regexElement = ParserUtil.parseRegex(regex);
CMLUtil.debug(regexElement, "x");
}
gives:
<regex capture="(.*((b))c(d(e)))">
<bracket capture=".*((b))c(d(e))">.*
<bracket capture="(b)">(b)</bracket>c
<bracket capture="d(e)">d
<bracket capture="e">e</bracket>
</bracket>
</bracket>
</regex>