Python: How to match nested parentheses with regex

2019-01-01 07:28发布

问题:

I\'m trying to match a mathematical-expression-like string, that have nested parentheses.

import re

p = re.compile(\'\\(.+\\)\')
str = \'(((1+0)+1)+1)\'
print p.findall(s)

[\'(((1+0)+1)+1)\']

I wanted it to match all the enclosed expressions, such as (1+0), ((1+0)+1)...
I don\'t even care if it matches unwanted ones like (((1+0), I can take care of those.

Why it\'s not doing that already, and how can I do it?

回答1:

The regular expression tries to match as much of the text as possible, thereby consuming all of your string. It doesn\'t look for additional matches of the regular expression on parts of that string. That\'s why you only get one answer.

The solution is to not use regular expressions. If you are actually trying to parse math expressions, use a real parsing solutions. If you really just want to capture the pieces within parenthesis, just loop over the characters counting when you see ( and ) and increment a decrement a counter.



回答2:

As others have mentioned, regular expressions are not the way to go for nested constructs. I\'ll give a basic example using pyparsing:

import pyparsing # make sure you have this installed

thecontent = pyparsing.Word(pyparsing.alphanums) | \'+\' | \'-\'
parens     = pyparsing.nestedExpr( \'(\', \')\', content=thecontent)

Here\'s a usage example:

>>> parens.parseString(\"((a + b) + c)\")

Output:

(                          # all of str
 [
  (                        # ((a + b) + c)
   [
    (                      #  (a + b)
     [\'a\', \'+\', \'b\'], {}   
    ),                     #  (a + b)      [closed]
    \'+\',
    \'c\'
   ], {}
  )                        # ((a + b) + c) [closed]
 ], {}  
)                          # all of str    [closed]

(With newlining/indenting/comments done manually)

Edit: Modified to eliminate unnecessary Forward, as per Paul McGuire\'s suggestions.

To get the output in nested list format:

res = parens.parseString(\"((12 + 2) + 3)\")
res.asList()

Output:

[[[\'12\', \'+\', \'2\'], \'+\', \'3\']]


回答3:

There is a new regular engine module being prepared to replace the existing one in Python. It introduces a lot of new functionality, including recursive calls.

import regex

s = \'aaa(((1+0)+1)+1)bbb\'

result = regex.search(r\'\'\'
(?<rec> #capturing group rec
 \\( #open parenthesis
 (?: #non-capturing group
  [^()]++ #anyting but parenthesis one or more times without backtracking
  | #or
   (?&rec) #recursive substitute of group rec
 )*
 \\) #close parenthesis
)
\'\'\',s,flags=regex.VERBOSE)


print(result.captures(\'rec\'))

Output:

[\'(1+0)\', \'((1+0)+1)\', \'(((1+0)+1)+1)\']

Related bug in regex: http://code.google.com/p/mrab-regex-hg/issues/detail?id=78



回答4:

Regex languages aren\'t powerful enough to matching arbitrarily nested constructs. For that you need a push-down automaton (i.e., a parser). There are several such tools available, such as PLY.

Python also provides a parser library for its own syntax, which might do what you need. The output is extremely detailed, however, and takes a while to wrap your head around. If you\'re interested in this angle, the following discussion tries to explain things as simply as possible.

>>> import parser, pprint
>>> pprint.pprint(parser.st2list(parser.expr(\'(((1+0)+1)+1)\')))
[258,
 [327,
  [304,
   [305,
    [306,
     [307,
      [308,
       [310,
        [311,
         [312,
          [313,
           [314,
            [315,
             [316,
              [317,
               [318,
                [7, \'(\'],
                [320,
                 [304,
                  [305,
                   [306,
                    [307,
                     [308,
                      [310,
                       [311,
                        [312,
                         [313,
                          [314,
                           [315,
                            [316,
                             [317,
                              [318,
                               [7, \'(\'],
                               [320,
                                [304,
                                 [305,
                                  [306,
                                   [307,
                                    [308,
                                     [310,
                                      [311,
                                       [312,
                                        [313,
                                         [314,
                                          [315,
                                           [316,
                                            [317,
                                             [318,
                                              [7,
                                               \'(\'],
                                              [320,
                                               [304,
                                                [305,
                                                 [306,
                                                  [307,
                                                   [308,
                                                    [310,
                                                     [311,
                                                      [312,
                                                       [313,
                                                        [314,
                                                         [315,
                                                          [316,
                                                           [317,
                                                            [318,
                                                             [2,
                                                              \'1\']]]]],
                                                         [14,
                                                          \'+\'],
                                                         [315,
                                                          [316,
                                                           [317,
                                                            [318,
                                                             [2,
                                                              \'0\']]]]]]]]]]]]]]]],
                                              [8,
                                               \')\']]]]],
                                          [14,
                                           \'+\'],
                                          [315,
                                           [316,
                                            [317,
                                             [318,
                                              [2,
                                               \'1\']]]]]]]]]]]]]]]],
                               [8, \')\']]]]],
                           [14, \'+\'],
                           [315,
                            [316,
                             [317,
                              [318, [2, \'1\']]]]]]]]]]]]]]]],
                [8, \')\']]]]]]]]]]]]]]]],
 [4, \'\'],
 [0, \'\']]

You can ease the pain with this short function:

def shallow(ast):
    if not isinstance(ast, list): return ast
    if len(ast) == 2: return shallow(ast[1])
    return [ast[0]] + [shallow(a) for a in ast[1:]]

>>> pprint.pprint(shallow(parser.st2list(parser.expr(\'(((1+0)+1)+1)\'))))
[258,
 [318,
  \'(\',
  [314,
   [318, \'(\', [314, [318, \'(\', [314, \'1\', \'+\', \'0\'], \')\'], \'+\', \'1\'], \')\'],
   \'+\',
   \'1\'],
  \')\'],
 \'\',
 \'\']

The numbers come from the Python modules symbol and token, which you can use to build a lookup table from numbers to names:

map = dict(token.tok_name.items() + symbol.sym_name.items())

You could even fold this mapping into the shallow() function so you can work with strings instead of numbers:

def shallow(ast):
    if not isinstance(ast, list): return ast
    if len(ast) == 2: return shallow(ast[1])
    return [map[ast[0]]] + [shallow(a) for a in ast[1:]]

>>> pprint.pprint(shallow(parser.st2list(parser.expr(\'(((1+0)+1)+1)\'))))
[\'eval_input\',
 [\'atom\',
  \'(\',
  [\'arith_expr\',
   [\'atom\',
    \'(\',
    [\'arith_expr\',
     [\'atom\', \'(\', [\'arith_expr\', \'1\', \'+\', \'0\'], \')\'],
     \'+\',
     \'1\'],
    \')\'],
   \'+\',
   \'1\'],
  \')\'],
 \'\',
 \'\']


回答5:

Stack is the best tool for the job: -

import re
def matches(line, opendelim=\'(\', closedelim=\')\'):
    stack = []

    for m in re.finditer(r\'[{}{}]\'.format(opendelim, closedelim), line):
        pos = m.start()

        if line[pos-1] == \'\\\\\':
            # skip escape sequence
            continue

        c = line[pos]

        if c == opendelim:
            stack.append(pos+1)

        elif c == closedelim:
            if len(stack) > 0:
                prevpos = stack.pop()
                # print(\"matched\", prevpos, pos, line[prevpos:pos])
                yield (prevpos, pos, len(stack))
            else:
                # error
                print(\"encountered extraneous closing quote at pos {}: \'{}\'\".format(pos, line[pos:] ))
                pass

    if len(stack) > 0:
        for pos in stack:
            print(\"expecting closing quote to match open quote starting at: \'{}\'\"
                  .format(line[pos-1:]))

In the client code, since the function is written as a generator function simply use the for loop pattern to unroll the matches: -

line = \'(((1+0)+1)+1)\'
for openpos, closepos, level in matches(line):
    print(line[openpos:closepos], level)

This test code produces following on my screen, noticed the second param in the printout indicates the depth of the parenthesis.

1+0 2
(1+0)+1 1
((1+0)+1)+1 0


回答6:

From a linked answer:

From the LilyPond convert-ly utility (and written/copyrighted by myself, so I can show it off here):

def paren_matcher (n):
    # poor man\'s matched paren scanning, gives up
    # after n+1 levels.  Matches any string with balanced
    # parens inside; add the outer parens yourself if needed.
    # Nongreedy.
    return r\"[^()]*?(?:\\(\"*n+r\"[^()]*?\"+r\"\\)[^()]*?)*?\"*n

convert-ly tends to use this as paren_matcher (25) in its regular expressions which is likely overkill for most applications. But then it uses it for matching Scheme expressions.

Yes, it breaks down after the given limit, but the ability to just plug it into regular expressions still beats the \"correct\" alternatives supporting unlimited depth hands-down in usability.



回答7:

Balanced pairs (of parentheses, for example) is an example of a language that cannot be recognized by regular expressions.

What follows is a brief explanation of the math for why that is.

Regular expressions are a way of defining finite state automata (abbreviated FSM). Such a device has a finite amount of possible state to store information. How that state can be used is not particularly restricted, but it does mean that there are an absolute maximum number of distinct positions it can recognize.

For example, the state can be used for counting, say, unmatched left parentheses. But because the amount of state for that kind of counting must be completely bounded, then a given FSM can count to a maximum of n-1, where n is the number of states the FSM can be in. If n is, say, 10, then the maximum number of unmatched left parenthesis the FSM can match is 10, until it breaks. Since it\'s perfectly possible to have one more left parenthesis, there is no possible FSM that can correctly recognize the complete language of matched parentheses.

So what? Suppose you just pick a really large n? The problem is that as a way of describing FSM, regular expressions basically describe all of the transitions from one state to another. Since for any N, an FSM would need 2 state transitions (one for matching a left parenthesis, and one for matching right), the regular expression itself must grow by at least a constant factor multiple of n

By comparison, the next better class of languages, (context free grammars) can solve this problem in a totally compact way. Here\'s an example in BNF

expression ::= `(` expression `)` expression
           |    nothing
 


回答8:

You should write a proper parser for parsing such expression (e.g. using pyparsing). Regular expressions are not an appropriate tool for writing decent parsers.



回答9:

You can use regexps, but you need to do the recursion yourself. Something like the following does the trick (if you only need to find, as your question says, all the expressions enclosed into parentheses):

import re

def scan(p, string):
    found = p.findall(string)
    for substring in found:
        stripped = substring[1:-1]
        found.extend(scan(p, stripped))
    return found

p = re.compile(\'\\(.+\\)\')
string = \'(((1+0)+1)+1)\'
all_found = scan(p, string)
print all_found

This code, however, does not match the \'correct\' parentheses. If you need to do that you will be better off with a specialized parser.



回答10:

Here is a demo for your question, though it is clumsy, while it works

import re s = \'(((1+0)+1)+1)\'

def getContectWithinBraces( x , *args , **kwargs):
    ptn = r\'[%(left)s]([^%(left)s%(right)s]*)[%(right)s]\' %kwargs
    Res = []
    res = re.findall(ptn , x)
    while res != []:
        Res = Res + res
        xx = x.replace(\'(%s)\' %Res[-1] , \'%s\')
        res = re.findall(ptn, xx)
        print(res)
        if res != []:
            res[0] = res[0] %(\'(%s)\' %Res[-1])
    return Res

getContectWithinBraces(s , left=\'\\(\\[\\{\' , right = \'\\)\\]\\}\')


回答11:

my solution is that: define a function to extract content within the outermost parentheses, and then you call that function repeatedly until you get the content within the innermost parentheses.

def get_string_inside_outermost_parentheses(text):
    content_p = re.compile(r\"(?<=\\().*(?=\\))\")
    r = content_p.search(text)
    return r.group() 

def get_string_inside_innermost_parentheses(text):
    while \'(\' in text:
        text = get_string_inside_outermost_parentheses(text)
    return text


回答12:

I believe this function may suit your need, I threw this together fast so feel free to clean it up a bit. When doing nests its easy to think of it backwards and work from there =]

def fn(string,endparens=False):
    exp = []
    idx = -1
    for char in string:
        if char == \"(\":
            idx += 1
            exp.append(\"\")
        elif char == \")\":
            idx -= 1
            if idx != -1:
                exp[idx] = \"(\" + exp[idx+1] + \")\"
        else:
            exp[idx] += char
    if endparens:
        exp = [\"(\"+val+\")\" for val in exp]
    return exp


回答13:

Many posts suggest that for nested braces, REGEX IS NOT THE WAY TO DO IT. SIMPLY COUNT THE BRACES: For example, see: Regular expression to detect semi-colon terminated C++ for & while loops

Here is a complete python sample to iterate through a string and count braces:

# decided for nested braces to not use regex but brace-counting
import re, string

texta = r\'\'\'
nonexistent.\\note{Richard Dawkins, \\textit{Unweaving the Rainbow: Science, Delusion
and the Appetite for Wonder} (Boston: Houghton Mifflin Co., 1998), pp. 302, 304,
306-309.} more text and more.

 This is a statistical fact, not a
guess.\\note{Zheng Wu, \\textit{Cohabitation: An Alternative Form
of Family Living} (Ontario, Canada: Oxford University Press,
2000), p. 149; \\hbox{Judith} Treas and Deirdre Giesen, ``Title
and another title,\'\'
\\textit{Journal of Marriage and the Family}, February 2000,
p.\\,51}

more and more text.capitalize
\'\'\'
pos = 0
foundpos = 0
openBr = 0 # count open braces
while foundpos <> -1:
    openBr = 0
    foundpos = string.find(texta, r\'\\note\',pos)
    # print \'foundpos\',foundpos
    pos = foundpos + 5
    # print texta[pos]
    result = \"\"
    while foundpos > -1 and openBr >= 0:
        pos = pos + 1
        if texta[pos] == \"{\":
            openBr = openBr + 1
        if texta[pos] == \"}\":
            openBr = openBr - 1
        result = result + texta[pos]
    result = result[:-1] # drop the last } found.
    result = string.replace(result,\'\\n\', \' \') # replace new line with space
    print result