So, I would like to split this text into sentences.
s = "You! Are you Tom? I am Danny."
so I get:
["You!", "Are you Tom?", "I am Danny."]
That is I want to split the text by the regex '[.!\?]'
without removing the delimiters. What is the most pythonic way to achieve this in python?
I am aware of these questions:
JS string.split() without removing the delimiters
Python split() without removing the delimiter
But my problem has various delimiters (.?!
) which complicates the problem.
You can use
re.findall
with regex.*?[.!\?]
; the lazy quantifier*?
makes sure each pattern matches up to the specific delimiter you want to match on:Easiest way is to use
nltk
.It will return a list of all your sentences without loosing delimiters.
If you prefer use split method rather than match, one solution split with group
Filter removes empty strings if any.
This will work even if there is no spaces between sentences, or if you need catch trailing sentence that ends with a different punctuation sign, such as an unicode ellipses (or does have any at all)
It even possible to keep you re as is (with escaping correction and adding parenthesis).
Then merge even and uneven elements and remove extra spaces
Python split() without removing the delimiter
If Python supported split by zero-length matches, you could achieve this by matching an empty string preceded by one of the delimiters:
Demo: https://regex101.com/r/ZLDXr1/1
Unfortunately, Python does not support split by zero-length matches. Yet the solution may still be useful in other languages that support lookbehinds.
However, based on you input/output data samples, you rather need to split by spaces preceded by one of the delimiters. So the regex would be:
Demo: https://regex101.com/r/ZLDXr1/2
Python demo: https://ideone.com/z6nZi5
If the spaces are optional, the
re.findall
solution suggested by @Psidom is the best one, I believe.Strictly speaking, you don't want to split on '!?.', but rather on the whitespace that follows those characters. The following will work:
This splits on whitespace, but only if it is preceded by either a ., !, or ? character.