How would one write a regular expression to use in python to split paragraphs?
A paragraph is defined by 2 linebreaks (\n). But one can have any amount of spaces/tabs together with the line breaks, and it still should be considered as a paragraph.
I am using python so the solution can use python's regular expression syntax which is extended. (can make use of (?P...)
stuff)
Examples:
the_str = 'paragraph1\n\nparagraph2'
# splitting should yield ['paragraph1', 'paragraph2']
the_str = 'p1\n\t\np2\t\n\tstill p2\t \n \n\tp3'
# should yield ['p1', 'p2\t\n\tstill p2', 'p3']
the_str = 'p1\n\n\n\tp2'
# should yield ['p1', '\n\tp2']
The best I could come with is: r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*'
, i.e.
import re
paragraphs = re.split(r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', the_str)
but that is ugly. Anything better?
EDIT:
Suggestions rejected:
r'\s*?\n\s*?\n\s*?'
-> That would make example 2 and 3 fail, since \s
includes \n
, so it would allow paragraph breaks with more than 2 \n
s.
Almost the same, but using non-greedy quantifiers and taking advantage of the whitespace sequence.
Unfortunately there's no nice way to write "space but not a newline".
I think the best you can do is add some space with the
x
modifier and try to factor out the ugliness a bit, but that's questionable:(?x) (?: [ \t\r\f\v]*? \n ){2} [ \t\r\f\v]*?
You could also try creating a subrule just for the character class and interpolating it three times.
Not a regexp but really elegant:
It's up to you to strip the output as you need it of course.
Inspired from the famous "Python Cookbook" ;-)
Are you trying to deduce the structure of a document in plain test? Are you doing what docutils does?
You might be able to simply use the Docutils parser rather than roll your own.