Python: How to match nested parentheses with regex

2019-01-01 07:40发布

I'm trying to match a mathematical-expression-like string, that have nested parentheses.

import re

p = re.compile('\(.+\)')
str = '(((1+0)+1)+1)'
print p.findall(s)

['(((1+0)+1)+1)']

I wanted it to match all the enclosed expressions, such as (1+0), ((1+0)+1)...
I don't even care if it matches unwanted ones like (((1+0), I can take care of those.

Why it's not doing that already, and how can I do it?

13条回答
萌妹纸的霸气范
2楼-- · 2019-01-01 07:44

Regex languages aren't powerful enough to matching arbitrarily nested constructs. For that you need a push-down automaton (i.e., a parser). There are several such tools available, such as PLY.

Python also provides a parser library for its own syntax, which might do what you need. The output is extremely detailed, however, and takes a while to wrap your head around. If you're interested in this angle, the following discussion tries to explain things as simply as possible.

>>> import parser, pprint
>>> pprint.pprint(parser.st2list(parser.expr('(((1+0)+1)+1)')))
[258,
 [327,
  [304,
   [305,
    [306,
     [307,
      [308,
       [310,
        [311,
         [312,
          [313,
           [314,
            [315,
             [316,
              [317,
               [318,
                [7, '('],
                [320,
                 [304,
                  [305,
                   [306,
                    [307,
                     [308,
                      [310,
                       [311,
                        [312,
                         [313,
                          [314,
                           [315,
                            [316,
                             [317,
                              [318,
                               [7, '('],
                               [320,
                                [304,
                                 [305,
                                  [306,
                                   [307,
                                    [308,
                                     [310,
                                      [311,
                                       [312,
                                        [313,
                                         [314,
                                          [315,
                                           [316,
                                            [317,
                                             [318,
                                              [7,
                                               '('],
                                              [320,
                                               [304,
                                                [305,
                                                 [306,
                                                  [307,
                                                   [308,
                                                    [310,
                                                     [311,
                                                      [312,
                                                       [313,
                                                        [314,
                                                         [315,
                                                          [316,
                                                           [317,
                                                            [318,
                                                             [2,
                                                              '1']]]]],
                                                         [14,
                                                          '+'],
                                                         [315,
                                                          [316,
                                                           [317,
                                                            [318,
                                                             [2,
                                                              '0']]]]]]]]]]]]]]]],
                                              [8,
                                               ')']]]]],
                                          [14,
                                           '+'],
                                          [315,
                                           [316,
                                            [317,
                                             [318,
                                              [2,
                                               '1']]]]]]]]]]]]]]]],
                               [8, ')']]]]],
                           [14, '+'],
                           [315,
                            [316,
                             [317,
                              [318, [2, '1']]]]]]]]]]]]]]]],
                [8, ')']]]]]]]]]]]]]]]],
 [4, ''],
 [0, '']]

You can ease the pain with this short function:

def shallow(ast):
    if not isinstance(ast, list): return ast
    if len(ast) == 2: return shallow(ast[1])
    return [ast[0]] + [shallow(a) for a in ast[1:]]

>>> pprint.pprint(shallow(parser.st2list(parser.expr('(((1+0)+1)+1)'))))
[258,
 [318,
  '(',
  [314,
   [318, '(', [314, [318, '(', [314, '1', '+', '0'], ')'], '+', '1'], ')'],
   '+',
   '1'],
  ')'],
 '',
 '']

The numbers come from the Python modules symbol and token, which you can use to build a lookup table from numbers to names:

map = dict(token.tok_name.items() + symbol.sym_name.items())

You could even fold this mapping into the shallow() function so you can work with strings instead of numbers:

def shallow(ast):
    if not isinstance(ast, list): return ast
    if len(ast) == 2: return shallow(ast[1])
    return [map[ast[0]]] + [shallow(a) for a in ast[1:]]

>>> pprint.pprint(shallow(parser.st2list(parser.expr('(((1+0)+1)+1)'))))
['eval_input',
 ['atom',
  '(',
  ['arith_expr',
   ['atom',
    '(',
    ['arith_expr',
     ['atom', '(', ['arith_expr', '1', '+', '0'], ')'],
     '+',
     '1'],
    ')'],
   '+',
   '1'],
  ')'],
 '',
 '']
查看更多
几人难应
3楼-- · 2019-01-01 07:46

There is a new regular engine module being prepared to replace the existing one in Python. It introduces a lot of new functionality, including recursive calls.

import regex

s = 'aaa(((1+0)+1)+1)bbb'

result = regex.search(r'''
(?<rec> #capturing group rec
 \( #open parenthesis
 (?: #non-capturing group
  [^()]++ #anyting but parenthesis one or more times without backtracking
  | #or
   (?&rec) #recursive substitute of group rec
 )*
 \) #close parenthesis
)
''',s,flags=regex.VERBOSE)


print(result.captures('rec'))

Output:

['(1+0)', '((1+0)+1)', '(((1+0)+1)+1)']

Related bug in regex: http://code.google.com/p/mrab-regex-hg/issues/detail?id=78

查看更多
谁念西风独自凉
4楼-- · 2019-01-01 07:47

You can use regexps, but you need to do the recursion yourself. Something like the following does the trick (if you only need to find, as your question says, all the expressions enclosed into parentheses):

import re

def scan(p, string):
    found = p.findall(string)
    for substring in found:
        stripped = substring[1:-1]
        found.extend(scan(p, stripped))
    return found

p = re.compile('\(.+\)')
string = '(((1+0)+1)+1)'
all_found = scan(p, string)
print all_found

This code, however, does not match the 'correct' parentheses. If you need to do that you will be better off with a specialized parser.

查看更多
只若初见
5楼-- · 2019-01-01 07:48

As others have mentioned, regular expressions are not the way to go for nested constructs. I'll give a basic example using pyparsing:

import pyparsing # make sure you have this installed

thecontent = pyparsing.Word(pyparsing.alphanums) | '+' | '-'
parens     = pyparsing.nestedExpr( '(', ')', content=thecontent)

Here's a usage example:

>>> parens.parseString("((a + b) + c)")

Output:

(                          # all of str
 [
  (                        # ((a + b) + c)
   [
    (                      #  (a + b)
     ['a', '+', 'b'], {}   
    ),                     #  (a + b)      [closed]
    '+',
    'c'
   ], {}
  )                        # ((a + b) + c) [closed]
 ], {}  
)                          # all of str    [closed]

(With newlining/indenting/comments done manually)

Edit: Modified to eliminate unnecessary Forward, as per Paul McGuire's suggestions.

To get the output in nested list format:

res = parens.parseString("((12 + 2) + 3)")
res.asList()

Output:

[[['12', '+', '2'], '+', '3']]
查看更多
君临天下
6楼-- · 2019-01-01 07:50

Balanced pairs (of parentheses, for example) is an example of a language that cannot be recognized by regular expressions.

What follows is a brief explanation of the math for why that is.

Regular expressions are a way of defining finite state automata (abbreviated FSM). Such a device has a finite amount of possible state to store information. How that state can be used is not particularly restricted, but it does mean that there are an absolute maximum number of distinct positions it can recognize.

For example, the state can be used for counting, say, unmatched left parentheses. But because the amount of state for that kind of counting must be completely bounded, then a given FSM can count to a maximum of n-1, where n is the number of states the FSM can be in. If n is, say, 10, then the maximum number of unmatched left parenthesis the FSM can match is 10, until it breaks. Since it's perfectly possible to have one more left parenthesis, there is no possible FSM that can correctly recognize the complete language of matched parentheses.

So what? Suppose you just pick a really large n? The problem is that as a way of describing FSM, regular expressions basically describe all of the transitions from one state to another. Since for any N, an FSM would need 2 state transitions (one for matching a left parenthesis, and one for matching right), the regular expression itself must grow by at least a constant factor multiple of n

By comparison, the next better class of languages, (context free grammars) can solve this problem in a totally compact way. Here's an example in BNF

expression ::= `(` expression `)` expression
           |    nothing
 
查看更多
无与为乐者.
7楼-- · 2019-01-01 07:52

Many posts suggest that for nested braces, REGEX IS NOT THE WAY TO DO IT. SIMPLY COUNT THE BRACES: For example, see: Regular expression to detect semi-colon terminated C++ for & while loops

Here is a complete python sample to iterate through a string and count braces:

# decided for nested braces to not use regex but brace-counting
import re, string

texta = r'''
nonexistent.\note{Richard Dawkins, \textit{Unweaving the Rainbow: Science, Delusion
and the Appetite for Wonder} (Boston: Houghton Mifflin Co., 1998), pp. 302, 304,
306-309.} more text and more.

 This is a statistical fact, not a
guess.\note{Zheng Wu, \textit{Cohabitation: An Alternative Form
of Family Living} (Ontario, Canada: Oxford University Press,
2000), p. 149; \hbox{Judith} Treas and Deirdre Giesen, ``Title
and another title,''
\textit{Journal of Marriage and the Family}, February 2000,
p.\,51}

more and more text.capitalize
'''
pos = 0
foundpos = 0
openBr = 0 # count open braces
while foundpos <> -1:
    openBr = 0
    foundpos = string.find(texta, r'\note',pos)
    # print 'foundpos',foundpos
    pos = foundpos + 5
    # print texta[pos]
    result = ""
    while foundpos > -1 and openBr >= 0:
        pos = pos + 1
        if texta[pos] == "{":
            openBr = openBr + 1
        if texta[pos] == "}":
            openBr = openBr - 1
        result = result + texta[pos]
    result = result[:-1] # drop the last } found.
    result = string.replace(result,'\n', ' ') # replace new line with space
    print result
查看更多
登录 后发表回答