Find shortest matches between two strings

I have a large log file, and I want to extract a multi-line string between two strings: start and end.

The following is sample from the inputfile:

start spam
start rubbish
start wait for it...
    profit!
here end
start garbage
start second match
win. end

The desired solution should print:

start wait for it...
    profit!
here end
start second match
win. end

I tried a simple regex but it returned everything from start spam. How should this be done?

Edit: Additional info on real-life computational complexity:

actual file size: 2GB
occurrences of 'start': ~ 12 M, evenly distributed
occurences of 'end': ~800, near the end of the file.

标签： python regex

4条回答

仙女界的扛把子

2楼-- · 2020-01-22 13:57

Do it with code - basic state machine:

open = False
tmp = []
for ln in fi:
    if 'start' in ln:
        if open:
            tmp = []
        else:
            open = True

    if open:
        tmp.append(ln)

    if 'end' in ln:
        open = False
        for x in tmp:
            print x
        tmp = []

0人赞添加讨论(0) 举报

▲ chillily

3楼-- · 2020-01-22 14:04

This is tricky to do because by default, the re module does not look at overlapping matches. Newer versions of Python have a new regex module that allows for overlapping matches.

https://pypi.python.org/pypi/regex

You'd want to use something like

regex.findall(pattern, string, overlapped=True)

If you're stuck with Python 2.x or something else that doesn't have regex, it's still possible with some trickery. One brilliant person solved it here:

Python regex find all overlapping matches?

Once you have all possible overlapping (non-greedy, I imagine) matches, just determine which one is shortest, which should be easy.

0人赞添加讨论(0) 举报

再贱就再见

4楼-- · 2020-01-22 14:05

This regex should match what you want:

(start((?!start).)*?end)

Use re.findall method and single-line modifier re.S to get all the occurences in a multi-line string:

re.findall('(start((?!start).)*?end)', text, re.S)

See a test here.

0人赞添加讨论(0) 举报

Melony?

5楼-- · 2020-01-22 14:14

You could do (?s)start.*?(?=end|start)(?:end)?, then filter out everything not ending in "end".

0人赞添加讨论(0) 举报

Find shortest matches between two strings

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间