可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I'm trying to split a string up into words and punctuation, adding the punctuation to the list produced by the split.
For instance:
>>> c = "help, me"
>>> print c.split()
['help,', 'me']
What I really want the list to look like is:
['help', ',', 'me']
So, I want the string split at whitespace with the punctuation split from the words.
I've tried to parse the string first and then run the split:
>>> for character in c:
... if character in ".,;!?":
... outputCharacter = " %s" % character
... else:
... outputCharacter = character
... separatedPunctuation += outputCharacter
>>> print separatedPunctuation
help , me
>>> print separatedPunctuation.split()
['help', ',', 'me']
This produces the result I want, but is painfully slow on large files.
Is there a way to do this more efficiently?
回答1:
This is more or less the way to do it:
>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']
The trick is, not to think about where to split the string, but what to include in the tokens.
Caveats:
- The underscore (_) is considered an inner-word character. Replace \w, if you don't want that.
- This will not work with (single) quotes in the string.
- Put any additional punctuation marks you want to use in the right half of the regular expression.
- Anything not explicitely mentioned in the re is silently dropped.
回答2:
Here is a Unicode-aware version:
re.findall(r"\w+|[^\w\s]", text, re.UNICODE)
The first alternative catches sequences of word characters (as defined by unicode, so "résumé" won't turn into ['r', 'sum']
); the second catches individual non-word characters, ignoring whitespace.
Note that, unlike the top answer, this treats the single quote as separate punctuation (e.g. "I'm" -> ['I', "'", 'm']
). This appears to be standard in NLP, so I consider it a feature.
回答3:
In perl-style regular expression syntax, \b
matches a word boundary. This should come in handy for doing a regex-based split.
edit: I have been informed by hop that "empty matches" do not work in the split function of Python's re module. I will leave this here as information for anyone else getting stumped by this "feature".
回答4:
Here's my entry.
I have my doubts as to how well this will hold up in the sense of efficiency, or if it catches all cases (note the "!!!" grouped together; this may or may not be a good thing).
>>> import re
>>> import string
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"
>>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0]
>>> l
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']
>>>
One obvious optimization would be to compile the regex before hand (using re.compile) if you're going to be doing this on a line-by-line basis.
回答5:
Here's a minor update to your implementation. If your trying to doing anything more detailed I suggest looking into the NLTK that le dorfier suggested.
This might only be a little faster since ''.join() is used in place of +=, which is known to be faster.
import string
d = "Hello, I'm a string!"
result = []
word = ''
for char in d:
if char not in string.whitespace:
if char not in string.ascii_letters + "'":
if word:
result.append(word)
result.append(char)
word = ''
else:
word = ''.join([word,char])
else:
if word:
result.append(word)
word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']
回答6:
I think you can find all the help you can imagine in the NLTK, especially since you are using python. There's a good comprehensive discussion of this issue in the tutorial.
回答7:
I came up with a way to tokenize all words and \W+
patterns using \b
which doesn't need hardcoding:
>>> import re
>>> sentence = 'Hello, world!'
>>> tokens = [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', sentence)]
['Hello', ',', 'world', '!']
Here .*?\S.*?
is a pattern matching anything that is not a space and $
is added to match last token in a string if it's a punctuation symbol.
Note the following though -- this will group punctuation that consists of more than one symbol:
>>> print [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"Oh no", she said')]
['Oh', 'no', '",', 'she', 'said']
Of course, you can find and split such groups with:
>>> for token in [t.strip() for t in re.findall(r'\b.*?\S.*?(?:\b|$)', '"You can", she said')]:
... print re.findall(r'(?:\w+|\W)', token)
['You']
['can']
['"', ',']
['she']
['said']
回答8:
Try this:
string_big = "One of Python's coolest features is the string format operator This operator is unique to strings"
my_list =[]
x = len(string_big)
poistion_ofspace = 0
while poistion_ofspace < x:
for i in range(poistion_ofspace,x):
if string_big[i] == ' ':
break
else:
continue
print string_big[poistion_ofspace:(i+1)]
my_list.append(string_big[poistion_ofspace:(i+1)])
poistion_ofspace = i+1
print my_list
回答9:
If you are going to work in English (or some other common languages), you can use NLTK (there are many other tools to do this such as FreeLing).
import nltk
sentence = "help, me"
nltk.word_tokenize(sentence)
回答10:
Have you tried using a regex?
http://docs.python.org/library/re.html#re-syntax
By the way. Why do you need the "," at the second one? You will know that after each text is written i.e.
[0]
","
[1]
","
So if you want to add the "," you can just do it after each iteration when you use the array..