Regular expression to detect semi-colon terminated-第2页回答

In my Python application, I need to write a regular expression that matches a C++ for or while loop that has been terminated with a semi-colon (;). For example, it should match this:

for (int i = 0; i < 10; i++);

... but not this:

for (int i = 0; i < 10; i++)

This looks trivial at first glance, until you realise that the text between the opening and closing parenthesis may contain other parenthesis, for example:

for (int i = funcA(); i < funcB(); i++);

I'm using the python.re module. Right now my regular expression looks like this (I've left my comments in so you can understand it easier):

# match any line that begins with a "for" or "while" statement:
^\s*(for|while)\s*
\(  # match the initial opening parenthesis
    # Now make a named group 'balanced' which matches a balanced substring.
    (?P<balanced>
        # A balanced substring is either something that is not a parenthesis:
        [^()]
        | # …or a parenthesised string:
        \( # A parenthesised string begins with an opening parenthesis
            (?P=balanced)* # …followed by a sequence of balanced substrings
        \) # …and ends with a closing parenthesis
    )*  # Look for a sequence of balanced substrings
\)  # Finally, the outer closing parenthesis.
# must end with a semi-colon to match:
\s*;\s*

This works perfectly for all the above cases, but it breaks as soon as you try and make the third part of the for loop contain a function, like so:

for (int i = 0; i < 10; doSomethingTo(i));

I think it breaks because as soon as you put some text between the opening and closing parenthesis, the "balanced" group matches that contained text, and thus the (?P=balanced) part doesn't work any more since it won't match (due to the fact that the text inside the parenthesis is different).

In my Python code I'm using the VERBOSE and MULTILINE flags, and creating the regular expression like so:

REGEX_STR = r"""# match any line that begins with a "for" or "while" statement:
^\s*(for|while)\s*
\(  # match the initial opening parenthesis
    # Now make a named group 'balanced' which matches
    # a balanced substring.
    (?P<balanced>
        # A balanced substring is either something that is not a parenthesis:
        [^()]
        | # …or a parenthesised string:
        \( # A parenthesised string begins with an opening parenthesis
            (?P=balanced)* # …followed by a sequence of balanced substrings
        \) # …and ends with a closing parenthesis
    )*  # Look for a sequence of balanced substrings
\)  # Finally, the outer closing parenthesis.
# must end with a semi-colon to match:
\s*;\s*"""

REGEX_OBJ = re.compile(REGEX_STR, re.MULTILINE| re.VERBOSE)

Can anyone suggest an improvement to this regular expression? It's getting too complicated for me to get my head around.

标签： c++ python regex parsing recursion

9条回答

春风洒进眼中

2楼-- · 2019-01-01 07:58

This is a great example of using the wrong tool for the job. Regular expressions do not handle arbitrarily nested sub-matches very well. What you should do instead is use a real lexer and parser (a grammar for C++ should be easy to find) and look for unexpectedly empty loop bodies.

0人赞添加讨论(0) 举报

牵手、夕阳

3楼-- · 2019-01-01 08:02

Try this regexp

^\s*(for|while)\s*
\(
(?P<balanced>
[^()]*
|
(?P=balanced)
\)
\s*;\s

I removed the wrapping \( \) around (?P=balanced) and moved the * to behind the any not paren sequence. I have had this work with boost xpressive, and rechecked that website (Xpressive) to refresh my memory.

0人赞添加讨论(0) 举报

十年一品温如言

4楼-- · 2019-01-01 08:04

As Frank suggested, this is best without regex. Here's (an ugly) one-liner:

match_string = orig_string[orig_string.index("("):len(orig_string)-orig_string[::-1].index(")")]

Matching the troll line est mentioned in his comment:

orig_string = "for (int i = 0; i < 10; doSomethingTo(\"(\"));"
match_string = orig_string[orig_string.index("("):len(orig_string)-orig_string[::-1].index(")")]

returns (int i = 0; i < 10; doSomethingTo("("))

This works by running through the string forward until it reaches the first open paren, and then backward until it reaches the first closing paren. It then uses these two indices to slice the string.

0人赞添加讨论(0) 举报

上一页 1 2

Regular expression to detect semi-colon terminated

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间