I'm writing a python regex that looks through a text document for quoted strings (quotes of airline pilots recorded from blackboxes). I started by trying to write a regex with the following rules:
Return what is between quotes.
if it opens with single, only return if it closes with single.
if it opens with double, only return if it closes with double.
For instance I don't want to match "hi there', or 'hi there", but "hi there" and 'hi there'.
I use a testing page which contains things like:
CA "Runway 18, wind 230 degrees, five knots, altimeter 30."
AA "Roger that"
18:24:10 [flap lever moving into detent]
ST: "Some passenger's pushing a switch. May I?"
So I decided to start simple:
re.findall('("|\').*?\\1', page)
########## /("|').*?\1/ <-- raw regex I think I'm going for.
This regex acts very unexpectedly.
I thought it would:
- ( " | " ) Match EITHER single OR double quotes, save as back reference /1.
- .*? Match non-greedy wildcard.
- \1 Match whatever it finds in back reference \1 (step one).
Instead, it returns an array of quotes but never anything else.
['"', '"', "'", "'"]
I'm really confused because the equivalent (afaik) regex works just fine in VIM.
\("\|'\).\{-}\1/)
My question is this:
Why does it return only what is inside parenthesis as the match? Is this a flaw in my understanding of back references? If so then why does it work in VIM?
And how do I write the regex I'm looking for in python?
Thank you for your help!