Splitting a string into words and punctuation

2019-01-02 18:30发布

站内文章 / Python

26 0

孤独总比滥情好

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.

For instance:

>>> c = "help, me"
>>> print c.split()
['help,', 'me']

What I really want the list to look like is:

['help', ',', 'me']

So, I want the string split at whitespace with the punctuation split from the words.

I've tried to parse the string first and then run the split:

>>> for character in c:
...     if character in ".,;!?":
...             outputCharacter = " %s" % character
...     else:
...             outputCharacter = character
...     separatedPunctuation += outputCharacter
>>> print separatedPunctuation
help , me
>>> print separatedPunctuation.split()
['help', ',', 'me']

This produces the result I want, but is painfully slow on large files.

Is there a way to do this more efficiently?

回答1:

This is more or less the way to do it:

>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']

The trick is, not to think about where to split the string, but what to include in the tokens.

Caveats:

The underscore (_) is considered an inner-word character. Replace \w, if you don't want that.
This will not work with (single) quotes in the string.
Put any additional punctuation marks you want to use in the right half of the regular expression.
Anything not explicitely mentioned in the re is silently dropped.

回答2:

Here is a Unicode-aware version:

re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']); the second catches individual non-word characters, ignoring whitespace.

Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']). This appears to be standard in NLP, so I consider it a feature.

回答3:

In perl-style regular expression syntax, \b matches a word boundary. This should come in handy for doing a regex-based split.

edit: I have been informed by hop that "empty matches" do not work in the split function of Python's re module. I will leave this here as information for anyone else getting stumped by this "feature".

回答4:

Here's my entry.

I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).

>>> import re
>>> import string
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"
>>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0]
>>> l
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']
>>>

One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.

回答5:

Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.

This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.

import string

d = "Hello, I'm a string!"

result = []
word = ''

for char in d:
    if char not in string.whitespace:
        if char not in string.ascii_letters + "'":
            if word:
                    result.append(word)
            result.append(char)
            word = ''
        else:
            word = ''.join([word,char])

    else:
        if word:
            result.append(word)
            word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']

回答6:

I think you can find all the help you can imagine in the NLTK, especially since you are using python. There's a good comprehensive discussion of this issue in the tutorial.

回答7:

I came up with a way to tokenize all words and \W+ patterns using \b which doesn't need hardcoding:

>>> import re
>>> sentence = 'Hello, world!'
>>> tokens = [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', sentence)]
['Hello', ',', 'world', '!']

Here .*?\S.*? is a pattern matching anything that is not a space and $ is added to match last token in a string if it's a punctuation symbol.

Note the following though -- this will group punctuation that consists of more than one symbol:

>>> print [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"Oh no", she said')]
['Oh', 'no', '",', 'she', 'said']

Of course, you can find and split such groups with:

>>> for token in [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"You can", she said')]:
...     print re.findall(r'(?:\w+|\W)', token)

['You']
['can']
['"', ',']
['she']
['said']

回答8:

Try this:

string_big = "One of Python's coolest features is the string format operator  This operator is unique to strings"
my_list =[]
x = len(string_big)
poistion_ofspace = 0
while poistion_ofspace < x:
    for i in range(poistion_ofspace,x):
        if string_big[i] == ' ':
            break
        else:
            continue
    print string_big[poistion_ofspace:(i+1)]
    my_list.append(string_big[poistion_ofspace:(i+1)])
    poistion_ofspace = i+1

print my_list

回答9:

If you are going to work in English (or some other common languages), you can use NLTK (there are many other tools to do this such as FreeLing).

import nltk
sentence = "help, me"
nltk.word_tokenize(sentence)

回答10:

Have you tried using a regex?

http://docs.python.org/library/re.html#re-syntax

By the way. Why do you need the "," at the second one? You will know that after each text is written i.e.

[0]

","

[1]

","

So if you want to add the "," you can just do it after each iteration when you use the array..

标签：

孤独总比滥情好

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~