How match a paragraph using regex

2019-01-24 15:08发布

问题:

I have been struggling with python regex for a while trying to match paragraphs within a text, but I haven't been successful. I need to obtain the start and end positions of the paragraphs.

An example of a text:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod
tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At
vero eos et accusam et justo duo dolores et ea rebum. 

Stet clita kasd gubergren,
no sea takimata sanctus est Lorem ipsum dolor sit amet.

Ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod
tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At
vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren,
no sea takimata sanctus est Lorem ipsum dolor sit amet.

In this example case, I would want to separately match all the paragraphs starting with Lorem, Stet and Ipsum respectively (without the empty lines). Does anyone have any idea how to do this?

回答1:

You can split on double-newline like this:

paragraphs = re.split(r"\n\n", DATA)

Edit: To capture the paragraphs as matches, so you can get their start and end points, do this:

for match in re.finditer(r'(?s)((?:[^\n][\n]?)+)', DATA):
   print match.start(), match.end()

# Prints:
# 0 214
# 215 298
# 299 589


回答2:

Using split is one way, you can do so with regular expression also like this:

paragraphs = re.search('(.+?\n\n|.+?$)',TEXT,re.DOTALL)

The .+? is a lazy match, it will match the shortest substring that makes the whole regex matched. Otherwise, it will just match the whole string.

So basically here we want to find a sequence of characters (.+?) which ends by a blank line (\n\n) or the end of string ($). The re.DOTALL flag makes the dot to match newline also (we also want to match a paragraph consisting of three lines without blank lines within)



回答3:

What is the newline symbol? Let us suppose the newline symbol is '\r\n', if you want to match the paragraphs starting with Lorem, you can do like this:

pattern = re.compile('\r\nLorem.*\r\n')
str = '...'    # your source text
matchlist = re.findall(pattern, str)

The matchlist will contain all the paragragh start with Lorem. And the other two words are the same.



回答4:

Try

^(.+?)\n\s*\n

or

^(.+?)\r\n\s*\r\n

just do not forget append extra new line at the end of text



回答5:

i tried to use the recommended RegEx with the default Java RegEx engine. That gave me several times a StackOverflowException, so in the end i rewrote the RegEx and optimized it a little more.

So this is working fine for me in Java:

(?s)(.*?[^\:\-\,])(?:$|\n{2,})

This also handles the end of document without new lines and tries to concat lines which ends with ':', '-' or ',' to the next paragraph.

And to avoid that trailing blanks (whitespace or tabs) breaks the above described feature i am stripping them before with following regex:

(?m)[[:blank:]]+$