I need to identify (potentially nested) capture groups within regular expressions and create a tree. The particular target is Java-1.6 and I'd ideally like Java code. A simple example is:
"(a(b|c)d(e(f*g))h)"
which would be parsed to
"a(b|c)d(e(f*g))h"
... "b|c"
... "e(f*g)"
... "f*g"
The solution should ideally account for count expressions, quantifiers, etc and levels of escaping. However if this is not easy to find a simpler approach might suffice as we can limit the syntax used.
EDIT. To clarify. I want to parse the regular expression string itself. To do so I need to know the BNF or equivalent for Java 1.6 regexes. I am hoping someone has already done this.
A byproduct of a result would be that the process would test for validity of the regex.
Consider stepping up to an actual parser/lexer: http://www.antlr.org/wiki/display/ANTLR3/FAQ+-+Getting+Started
It looks complicated, but if your language is fairly simple, it's fairly straightforward. And if it's not, doing it in regexes will probably make your life hell :)
I came up with a partial solution using an XML tool (XOM, http://www.xom.nu) to hold the tree. First the code, then an example parse. First the escaped characters (\ , ( and ) ) are de-escaped (here I use BS, LB and RB), then remaining brackets are translated to XML tags, then the XML is parsed and the characters re-escaped. What is needed further is a BNF for Java 1.6 regexes doe quantifiers such as ?:, {d,d} and so on.
example:
gives: