Find shortest matches between two strings

2020-01-22 13:46发布

I have a large log file, and I want to extract a multi-line string between two strings: start and end.

The following is sample from the inputfile:

start spam
start rubbish
start wait for it...
    profit!
here end
start garbage
start second match
win. end

The desired solution should print:

start wait for it...
    profit!
here end
start second match
win. end

I tried a simple regex but it returned everything from start spam. How should this be done?

Edit: Additional info on real-life computational complexity:

  • actual file size: 2GB
  • occurrences of 'start': ~ 12 M, evenly distributed
  • occurences of 'end': ~800, near the end of the file.

标签: python regex
4条回答
仙女界的扛把子
2楼-- · 2020-01-22 13:57

Do it with code - basic state machine:

open = False
tmp = []
for ln in fi:
    if 'start' in ln:
        if open:
            tmp = []
        else:
            open = True

    if open:
        tmp.append(ln)

    if 'end' in ln:
        open = False
        for x in tmp:
            print x
        tmp = []
查看更多
▲ chillily
3楼-- · 2020-01-22 14:04

This is tricky to do because by default, the re module does not look at overlapping matches. Newer versions of Python have a new regex module that allows for overlapping matches.

https://pypi.python.org/pypi/regex

You'd want to use something like

regex.findall(pattern, string, overlapped=True)

If you're stuck with Python 2.x or something else that doesn't have regex, it's still possible with some trickery. One brilliant person solved it here:

Python regex find all overlapping matches?

Once you have all possible overlapping (non-greedy, I imagine) matches, just determine which one is shortest, which should be easy.

查看更多
再贱就再见
4楼-- · 2020-01-22 14:05

This regex should match what you want:

(start((?!start).)*?end)

Use re.findall method and single-line modifier re.S to get all the occurences in a multi-line string:

re.findall('(start((?!start).)*?end)', text, re.S)

See a test here.

查看更多
Melony?
5楼-- · 2020-01-22 14:14

You could do (?s)start.*?(?=end|start)(?:end)?, then filter out everything not ending in "end".

查看更多
登录 后发表回答