-->

Code to parse capture groups in regular expression

2019-05-31 01:37发布

问题:

I need to identify (potentially nested) capture groups within regular expressions and create a tree. The particular target is Java-1.6 and I'd ideally like Java code. A simple example is:

"(a(b|c)d(e(f*g))h)"

which would be parsed to

"a(b|c)d(e(f*g))h"
... "b|c"
... "e(f*g)"
     ... "f*g"

The solution should ideally account for count expressions, quantifiers, etc and levels of escaping. However if this is not easy to find a simpler approach might suffice as we can limit the syntax used.

EDIT. To clarify. I want to parse the regular expression string itself. To do so I need to know the BNF or equivalent for Java 1.6 regexes. I am hoping someone has already done this.

A byproduct of a result would be that the process would test for validity of the regex.

回答1:

Consider stepping up to an actual parser/lexer: http://www.antlr.org/wiki/display/ANTLR3/FAQ+-+Getting+Started

It looks complicated, but if your language is fairly simple, it's fairly straightforward. And if it's not, doing it in regexes will probably make your life hell :)



回答2:

I came up with a partial solution using an XML tool (XOM, http://www.xom.nu) to hold the tree. First the code, then an example parse. First the escaped characters (\ , ( and ) ) are de-escaped (here I use BS, LB and RB), then remaining brackets are translated to XML tags, then the XML is parsed and the characters re-escaped. What is needed further is a BNF for Java 1.6 regexes doe quantifiers such as ?:, {d,d} and so on.

public static Element parseRegex(String regex) throws Exception {
    regex = regex.replaceAll("\\\\", "BS");
    regex.replaceAll("BS\\(", "LB");
    regex.replaceAll("BS\\)", "RB");
    regex = regex.replaceAll("\\(", "<bracket>");
    regex.replaceAll("\\)", "</bracket>");
    Element regexX = new Builder().build(new StringReader(
         "<regex>"+regex+"</regex>")).getRootElement();
    extractCaptureGroupContent(regexX);
    return regexX;
}

private static String extractCaptureGroupContent(Element regexX) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < regexX.getChildCount(); i++) {
        Node childNode = regexX.getChild(i);
        if (childNode instanceof Text) {
            Text t = (Text)childNode;
            String s = t.getValue();
            s = s.replaceAll("BS", "\\\\").replaceAll("LB", 
                        "\\(").replaceAll("RB", "\\)");
            t.setValue(s);
            sb.append(s);
        } else {
            sb.append("("+extractCaptureGroupContent((Element)childNode)+")");
        }
    }
    String capture = sb.toString();
    regexX.addAttribute(new Attribute("capture", capture));
    return capture;
}

example:

@Test
public void testParseRegex2() throws Exception {
    String regex = "(.*(\\(b\\))c(d(e)))";
    Element regexElement = ParserUtil.parseRegex(regex);
    CMLUtil.debug(regexElement, "x");
}

gives:

<regex capture="(.*((b))c(d(e)))">
  <bracket capture=".*((b))c(d(e))">.*
    <bracket capture="(b)">(b)</bracket>c
    <bracket capture="d(e)">d
      <bracket capture="e">e</bracket>
    </bracket>
  </bracket>
</regex>