Regular expression to detect semi-colon terminated

2019-01-01 07:32发布

站内文章 / C++

42 0

荒废的爱情

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

In my Python application, I need to write a regular expression that matches a C++ for or while loop that has been terminated with a semi-colon (;). For example, it should match this:

for (int i = 0; i < 10; i++);

... but not this:

for (int i = 0; i < 10; i++)

This looks trivial at first glance, until you realise that the text between the opening and closing parenthesis may contain other parenthesis, for example:

for (int i = funcA(); i < funcB(); i++);

I\'m using the python.re module. Right now my regular expression looks like this (I\'ve left my comments in so you can understand it easier):

# match any line that begins with a \"for\" or \"while\" statement:
^\\s*(for|while)\\s*
\\(  # match the initial opening parenthesis
    # Now make a named group \'balanced\' which matches a balanced substring.
    (?P<balanced>
        # A balanced substring is either something that is not a parenthesis:
        [^()]
        | # …or a parenthesised string:
        \\( # A parenthesised string begins with an opening parenthesis
            (?P=balanced)* # …followed by a sequence of balanced substrings
        \\) # …and ends with a closing parenthesis
    )*  # Look for a sequence of balanced substrings
\\)  # Finally, the outer closing parenthesis.
# must end with a semi-colon to match:
\\s*;\\s*

This works perfectly for all the above cases, but it breaks as soon as you try and make the third part of the for loop contain a function, like so:

for (int i = 0; i < 10; doSomethingTo(i));

I think it breaks because as soon as you put some text between the opening and closing parenthesis, the \"balanced\" group matches that contained text, and thus the (?P=balanced) part doesn\'t work any more since it won\'t match (due to the fact that the text inside the parenthesis is different).

In my Python code I\'m using the VERBOSE and MULTILINE flags, and creating the regular expression like so:

REGEX_STR = r\"\"\"# match any line that begins with a \"for\" or \"while\" statement:
^\\s*(for|while)\\s*
\\(  # match the initial opening parenthesis
    # Now make a named group \'balanced\' which matches
    # a balanced substring.
    (?P<balanced>
        # A balanced substring is either something that is not a parenthesis:
        [^()]
        | # …or a parenthesised string:
        \\( # A parenthesised string begins with an opening parenthesis
            (?P=balanced)* # …followed by a sequence of balanced substrings
        \\) # …and ends with a closing parenthesis
    )*  # Look for a sequence of balanced substrings
\\)  # Finally, the outer closing parenthesis.
# must end with a semi-colon to match:
\\s*;\\s*\"\"\"

REGEX_OBJ = re.compile(REGEX_STR, re.MULTILINE| re.VERBOSE)

Can anyone suggest an improvement to this regular expression? It\'s getting too complicated for me to get my head around.

回答1:

You could write a little, very simple routine that does it, without using a regular expression:

Set a position counter pos so that is points to just before the opening bracket after your for or while.
Set an open brackets counter openBr to 0.
Now keep incrementing pos, reading the characters at the respective positions, and increment openBr when you see an opening bracket, and decrement it when you see a closing bracket. That will increment it once at the beginning, for the first opening bracket in \"for (\", increment and decrement some more for some brackets in between, and set it back to 0 when your for bracket closes.
So, stop when openBr is 0 again.

The stopping positon is your closing bracket of for(...). Now you can check if there is a semicolon following or not.

回答2:

This is the kind of thing you really shouldn\'t do with a regular expression. Just parse the string one character at a time, keeping track of opening/closing parentheses.

If this is all you\'re looking for, you definitely don\'t need a full-blown C++ grammar lexer/parser. If you want practice, you can write a little recursive-decent parser, but even that\'s a bit much for just matching parentheses.

回答3:

This is a great example of using the wrong tool for the job. Regular expressions do not handle arbitrarily nested sub-matches very well. What you should do instead is use a real lexer and parser (a grammar for C++ should be easy to find) and look for unexpectedly empty loop bodies.

回答4:

I wouldn\'t even pay attention to the contents of the parens.

Just match any line that starts with for and ends with semi-colon:

^\\t*for.+;$

Unless you\'ve got for statements split over multiple lines, that will work fine?

回答5:

Try this regexp

^\\s*(for|while)\\s*
\\(
(?P<balanced>
[^()]*
|
(?P=balanced)
\\)
\\s*;\\s

I removed the wrapping \\( \\) around (?P=balanced) and moved the * to behind the any not paren sequence. I have had this work with boost xpressive, and rechecked that website (Xpressive) to refresh my memory.

回答6:

Greg is absolutely correct. This kind of parsing cannot be done with regular expressions. I suppose it is possible to build some horrendous monstrosity that would work for many cases, but then you\'ll just run across something that does.

You really need to use more traditional parsing techniques. For example, its pretty simple to write a recursive decent parser to do what you need.

回答7:

I don\'t know that regex would handle something like that very well. Try something like this

line = line.Trim();
if(line.StartsWith(\"for\") && line.EndsWith(\";\")){
    //your code here
}

回答8:

Another thought that ignores parentheses and treats the for as a construct holding three semicolon-delimited values:

for\\s*\\([^;]+;[^;]+;[^;]+\\)\\s*;

This option works even when split over multiple lines (once MULTILINE enabled), but assumes that for ( ... ; ... ; ... ) is the only valid construct, so wouldn\'t work with a for ( x in y ) construct, or other deviations.

Also assumes that there are no functions containing semi-colons as arguments, such as:

for ( var i = 0; i < ListLen(\'a;b;c\',\';\') ; i++ );

Whether this is a likely case depends on what you\'re actually doing this for.

回答9:

As Frank suggested, this is best without regex. Here\'s (an ugly) one-liner:

match_string = orig_string[orig_string.index(\"(\"):len(orig_string)-orig_string[::-1].index(\")\")]

Matching the troll line est mentioned in his comment:

orig_string = \"for (int i = 0; i < 10; doSomethingTo(\\\"(\\\"));\"
match_string = orig_string[orig_string.index(\"(\"):len(orig_string)-orig_string[::-1].index(\")\")]

returns (int i = 0; i < 10; doSomethingTo(\"(\"))

This works by running through the string forward until it reaches the first open paren, and then backward until it reaches the first closing paren. It then uses these two indices to slice the string.

标签： c++ python regex parsing recursion

荒废的爱情

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~

Regular expression to detect semi-colon terminated

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

回答7:

回答8:

回答9:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮