I have a large log file, and I want to extract a multi-line string between two strings: start
and end
.
The following is sample from the inputfile
:
start spam
start rubbish
start wait for it...
profit!
here end
start garbage
start second match
win. end
The desired solution should print:
start wait for it...
profit!
here end
start second match
win. end
I tried a simple regex but it returned everything from start spam
. How should this be done?
Edit: Additional info on real-life computational complexity:
- actual file size: 2GB
- occurrences of 'start': ~ 12 M, evenly distributed
- occurences of 'end': ~800, near the end of the file.
Do it with code - basic state machine:
This is tricky to do because by default, the
re
module does not look at overlapping matches. Newer versions of Python have a newregex
module that allows for overlapping matches.https://pypi.python.org/pypi/regex
You'd want to use something like
If you're stuck with Python 2.x or something else that doesn't have
regex
, it's still possible with some trickery. One brilliant person solved it here:Python regex find all overlapping matches?
Once you have all possible overlapping (non-greedy, I imagine) matches, just determine which one is shortest, which should be easy.
This regex should match what you want:
Use
re.findall
method and single-line modifierre.S
to get all the occurences in a multi-line string:See a test here.
You could do
(?s)start.*?(?=end|start)(?:end)?
, then filter out everything not ending in "end".