How would one write a regular expression to use in python to split paragraphs?

A paragraph is defined by 2 linebreaks (\n). But one can have any amount of spaces/tabs together with the line breaks, and it still should be considered as a paragraph.

I am using python so the solution can use python's regular expression syntax which is extended. (can make use of (?P...) stuff)

Examples:

the_str = 'paragraph1\n\nparagraph2'
# splitting should yield ['paragraph1', 'paragraph2']

the_str = 'p1\n\t\np2\t\n\tstill p2\t   \n     \n\tp3'
# should yield ['p1', 'p2\t\n\tstill p2', 'p3']

the_str = 'p1\n\n\n\tp2'
# should yield ['p1', '\n\tp2']

The best I could come with is: r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', i.e.

import re
paragraphs = re.split(r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', the_str)

but that is ugly. Anything better?

EDIT:

Suggestions rejected:

r'\s*?\n\s*?\n\s*?' -> That would make example 2 and 3 fail, since \s includes \n, so it would allow paragraph breaks with more than 2 \ns.

标签： python regex parsing text split

4条回答

霸刀☆藐视天下

2楼-- · 2019-06-21 06:32

Almost the same, but using non-greedy quantifiers and taking advantage of the whitespace sequence.

\s*?\n\s*?\n\s*?

0人赞添加讨论(0) 举报

聊天终结者

3楼-- · 2019-06-21 06:37

Unfortunately there's no nice way to write "space but not a newline".

I think the best you can do is add some space with the x modifier and try to factor out the ugliness a bit, but that's questionable: (?x) (?: [ \t\r\f\v]*? \n ){2} [ \t\r\f\v]*?

You could also try creating a subrule just for the character class and interpolating it three times.

0人赞添加讨论(0) 举报

女痞

4楼-- · 2019-06-21 06:44

Not a regexp but really elegant:

from itertools import groupby

def paragraph(lines) :
    for group_separator, line_iteration in groupby(lines.splitlines(True), key = str.isspace) :
        if not group_separator :
            yield ''.join(line_iteration)

for p in paragraph('p1\n\t\np2\t\n\tstill p2\t   \n     \n\tp'): 
    print repr(p)

'p1\n'
'p2\t\n\tstill p2\t   \n'
'\tp3'

It's up to you to strip the output as you need it of course.

Inspired from the famous "Python Cookbook" ;-)

0人赞添加讨论(0) 举报

Summer. ? 凉城

5楼-- · 2019-06-21 06:55

Are you trying to deduce the structure of a document in plain test? Are you doing what docutils does?

You might be able to simply use the Docutils parser rather than roll your own.

0人赞添加讨论(0) 举报

python regular expression to split paragraphs

Examples:

Suggestions rejected:

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间