Split a string into a list, leaving accented chars

2019-05-27 10:33发布

If i have the string:

"O João foi almoçar :) ." 

how do i best split it into a list of words in python like so:

['O','João', 'foi', 'almoçar', ':)']

?

Thanks :)

Sofia

2条回答
放荡不羁爱自由
2楼-- · 2019-05-27 10:45
>>> import string
>>> [ i for i in s.split(' ') if i not in string.punctuation]
['O', 'João', 'foi', 'almoçar', ':)']
查看更多
爷、活的狠高调
3楼-- · 2019-05-27 10:56

If the punctuation falls into its own space-separated token as with your example, then it's easy:

>>> filter(lambda s: s not in string.punctuation, "O João foi almoçar :) .".split())
['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)']

If this is not the case, you can define a dictionary of smileys like this (you'll need to add more):

d = { ':)': '<HAPPY_SMILEY>', ':(': '<SAD_SMILEY>'}

and then replace each instance of the smiley with the place-holder that doesn't contain punctuation (we'll consider <> not to be punctuation):

for smiley, placeholder in d.iteritems():
    s = s.replace(smiley, placeholder)

Which gets us to "O João foi almoçar <HAPPY_SMILEY> .".

We then strip punctuation:

s = ''.join(filter(lambda c: c not in '.,!', list(s)))

Which gives us "O João foi almoçar <HAPPY_SMILEY>".

We do revert the smileys:

for smiley, placeholder in d.iteritems():
    s = s.replace(placeholder, smiley)

Which we then split:

s = s.split()

Giving us our final result: ['O', 'Jo\xc3\xa3o', 'foi', 'almo\xc3\xa7ar', ':)'].

Putting it all together into a function:

def split_special(s):
    d = { ':)': '<HAPPY_SMILEY>', ':(': '<SAD_SMILEY>'}
    for smiley, placeholder in d.iteritems():
        s = s.replace(smiley, placeholder)
    s = ''.join(filter(lambda c: c not in '.,!', list(s)))
    for smiley, placeholder in d.iteritems():
        s = s.replace(placeholder, smiley)
    return s.split()
查看更多
登录 后发表回答