Splitting on regex without removing delimiters

2020-02-02 00:31发布

So, I would like to split this text into sentences.

s = "You! Are you Tom? I am Danny."

so I get:

["You!", "Are you Tom?", "I am Danny."]

That is I want to split the text by the regex '[.!\?]' without removing the delimiters. What is the most pythonic way to achieve this in python?

I am aware of these questions:

JS string.split() without removing the delimiters

Python split() without removing the delimiter

But my problem has various delimiters (.?!) which complicates the problem.

5条回答
We Are One
2楼-- · 2020-02-02 00:49

You can use re.findall with regex .*?[.!\?]; the lazy quantifier *? makes sure each pattern matches up to the specific delimiter you want to match on:

import re

s = """You! Are you Tom? I am Danny."""
re.findall('.*?[.!\?]', s)
# ['You!', ' Are you Tom?', ' I am Danny.']
查看更多
老娘就宠你
3楼-- · 2020-02-02 00:52

Easiest way is to use nltk.

import nltk   
nltk.sent_tokenize(s)

It will return a list of all your sentences without loosing delimiters.

查看更多
我想做一个坏孩纸
4楼-- · 2020-02-02 01:00

If you prefer use split method rather than match, one solution split with group

splitted = filter(None, re.split( r'(.*?[\.!\?])', s))

Filter removes empty strings if any.

This will work even if there is no spaces between sentences, or if you need catch trailing sentence that ends with a different punctuation sign, such as an unicode ellipses (or does have any at all)

It even possible to keep you re as is (with escaping correction and adding parenthesis).

splitted = filter(None, re.split( r'([\.!\?])', s))

Then merge even and uneven elements and remove extra spaces

Python split() without removing the delimiter

查看更多
Melony?
5楼-- · 2020-02-02 01:03

If Python supported split by zero-length matches, you could achieve this by matching an empty string preceded by one of the delimiters:

(?<=[.!?])

Demo: https://regex101.com/r/ZLDXr1/1

Unfortunately, Python does not support split by zero-length matches. Yet the solution may still be useful in other languages that support lookbehinds.

However, based on you input/output data samples, you rather need to split by spaces preceded by one of the delimiters. So the regex would be:

(?<=[.!?])\s+

Demo: https://regex101.com/r/ZLDXr1/2

Python demo: https://ideone.com/z6nZi5

If the spaces are optional, the re.findall solution suggested by @Psidom is the best one, I believe.

查看更多
唯我独甜
6楼-- · 2020-02-02 01:07

Strictly speaking, you don't want to split on '!?.', but rather on the whitespace that follows those characters. The following will work:

>>> import re
>>> re.split(r'(?<=[\.\!\?])\s*', s)
['You!', 'Are you Tom?', 'I am Danny.']

This splits on whitespace, but only if it is preceded by either a ., !, or ? character.

查看更多
登录 后发表回答