Remove punctuation from Unicode formatted strings

I have a function that removes punctuation from a list of strings:

def strip_punctuation(input):
    x = 0
    for word in input:
        input[x] = re.sub(r'[^A-Za-z0-9 ]', "", input[x])
        x += 1
    return input

I recently modified my script to use Unicode strings so I could handle other non-Western characters. This function breaks when it encounters these special characters and just returns empty Unicode strings. How can I reliably remove punctuation from Unicode formatted strings?

标签： python unicode

4条回答

路过你的时光

2楼-- · 2019-01-01 09:32

You could use unicode.translate() method:

import unicodedata
import sys

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
                      if unicodedata.category(unichr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

You could also use r'\p{P}' that is supported by regex module:

import regex as re

def remove_punctuation(text):
    return re.sub(ur"\p{P}+", "", text)

0人赞添加讨论(0) 举报

公子世无双

3楼-- · 2019-01-01 09:34

A little shorter version based on Daenyth answer

import unicodedata

def strip_punctuation(text):
    """
    >>> strip_punctuation(u'something')
    u'something'

    >>> strip_punctuation(u'something.,:else really')
    u'somethingelse really'
    """
    punctutation_cats = set(['Pc', 'Pd', 'Ps', 'Pe', 'Pi', 'Pf', 'Po'])
    return ''.join(x for x in text
                   if unicodedata.category(x) not in punctutation_cats)

input_data = [u'somehting', u'something, else', u'nothing.']
without_punctuation = map(strip_punctuation, input_data)

0人赞添加讨论(0) 举报

闭嘴吧你

4楼-- · 2019-01-01 09:45

If you want to use J.F. Sebastian's solution in Python 3:

import unicodedata
import sys

tbl = dict.fromkeys(i for i in range(sys.maxunicode)
                      if unicodedata.category(chr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

0人赞添加讨论(0) 举报

高级女魔头

5楼-- · 2019-01-01 09:52

You can iterate through the string using the unicodedata module's category function to determine if the character is punctuation.

For possible outputs of category, see unicode.org's doc on General Category Values

import unicodedata.category as cat
def strip_punctuation(word):
    return "".join(char for char in word if cat(char).startswith('P'))
filtered = [strip_punctuation(word) for word in input]

Additionally, make sure that you're handling encodings and types correctly. This presentation is a good place to start: http://bit.ly/unipain

0人赞添加讨论(0) 举报

Remove punctuation from Unicode formatted strings

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间