How can I split at word boundaries with regexes?

2020-04-10 01:55发布

I'm trying to do this:

import re
sentence = "How are you?"
print(re.split(r'\b', sentence))

The result being

[u'How are you?']

I want something like [u'How', u'are', u'you', u'?']. How can this be achieved?

标签: python regex nlp
2条回答
放荡不羁爱自由
2楼-- · 2020-04-10 02:33
import re
split = re.findall(r"[\w']+|[.,!?;]", "How are you?")
print(split)

Output:

['How', 'are', 'you', '?']

Ideone Demo

Regex101 Demo


Regex Explanation:

"[\w']+|[.,!?;]"

    1st Alternative: [\w']+
        [\w']+ match a single character present in the list below
            Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
            \w match any word character [a-zA-Z0-9_]
            ' the literal character '
    2nd Alternative: [.,!?;]
        [.,!?;] match a single character present in the list below
            .,!?; a single character in the list .,!?; literally
查看更多
疯言疯语
3楼-- · 2020-04-10 02:41

Unfortunately, Python cannot split by empty strings.

To get around this, you would need to use findall instead of split.

Actually \b just means word boundary.

It is equivalent to (?<=\w)(?=\W)|(?<=\W)(?=\w).

That means, the following code would work:

import re
sentence = "How are you?"
print(re.findall(r'\w+|\W+', sentence))
查看更多
登录 后发表回答