I'm trying to match a mathematical-expression-like string, that have nested parentheses.
import re
p = re.compile('\(.+\)')
str = '(((1+0)+1)+1)'
print p.findall(s)
['(((1+0)+1)+1)']
I wanted it to match all the enclosed expressions, such as (1+0), ((1+0)+1)...
I don't even care if it matches unwanted ones like (((1+0), I can take care of those.
Why it's not doing that already, and how can I do it?
Regex languages aren't powerful enough to matching arbitrarily nested constructs. For that you need a push-down automaton (i.e., a parser). There are several such tools available, such as PLY.
Python also provides a parser library for its own syntax, which might do what you need. The output is extremely detailed, however, and takes a while to wrap your head around. If you're interested in this angle, the following discussion tries to explain things as simply as possible.
You can ease the pain with this short function:
The numbers come from the Python modules
symbol
andtoken
, which you can use to build a lookup table from numbers to names:You could even fold this mapping into the
shallow()
function so you can work with strings instead of numbers:There is a new regular engine module being prepared to replace the existing one in Python. It introduces a lot of new functionality, including recursive calls.
Output:
Related bug in
regex
: http://code.google.com/p/mrab-regex-hg/issues/detail?id=78You can use regexps, but you need to do the recursion yourself. Something like the following does the trick (if you only need to find, as your question says, all the expressions enclosed into parentheses):
This code, however, does not match the 'correct' parentheses. If you need to do that you will be better off with a specialized parser.
As others have mentioned, regular expressions are not the way to go for nested constructs. I'll give a basic example using pyparsing:
Here's a usage example:
Output:
(With newlining/indenting/comments done manually)
Edit: Modified to eliminate unnecessary
Forward
, as per Paul McGuire's suggestions.To get the output in nested list format:
Output:
Balanced pairs (of parentheses, for example) is an example of a language that cannot be recognized by regular expressions.
What follows is a brief explanation of the math for why that is.
Regular expressions are a way of defining finite state automata (abbreviated FSM). Such a device has a finite amount of possible state to store information. How that state can be used is not particularly restricted, but it does mean that there are an absolute maximum number of distinct positions it can recognize.
For example, the state can be used for counting, say, unmatched left parentheses. But because the amount of state for that kind of counting must be completely bounded, then a given FSM can count to a maximum of n-1, where n is the number of states the FSM can be in. If n is, say, 10, then the maximum number of unmatched left parenthesis the FSM can match is 10, until it breaks. Since it's perfectly possible to have one more left parenthesis, there is no possible FSM that can correctly recognize the complete language of matched parentheses.
So what? Suppose you just pick a really large n? The problem is that as a way of describing FSM, regular expressions basically describe all of the transitions from one state to another. Since for any N, an FSM would need 2 state transitions (one for matching a left parenthesis, and one for matching right), the regular expression itself must grow by at least a constant factor multiple of n
By comparison, the next better class of languages, (context free grammars) can solve this problem in a totally compact way. Here's an example in BNF
Many posts suggest that for nested braces, REGEX IS NOT THE WAY TO DO IT. SIMPLY COUNT THE BRACES: For example, see: Regular expression to detect semi-colon terminated C++ for & while loops
Here is a complete python sample to iterate through a string and count braces: