I have a function that removes punctuation from a list of strings:
def strip_punctuation(input):
x = 0
for word in input:
input[x] = re.sub(r'[^A-Za-z0-9 ]', "", input[x])
x += 1
return input
I recently modified my script to use Unicode strings so I could handle other non-Western characters. This function breaks when it encounters these special characters and just returns empty Unicode strings. How can I reliably remove punctuation from Unicode formatted strings?
You could use
unicode.translate()
method:You could also use
r'\p{P}'
that is supported by regex module:A little shorter version based on Daenyth answer
If you want to use J.F. Sebastian's solution in Python 3:
You can iterate through the string using the
unicodedata
module'scategory
function to determine if the character is punctuation.For possible outputs of
category
, see unicode.org's doc on General Category ValuesAdditionally, make sure that you're handling encodings and types correctly. This presentation is a good place to start: http://bit.ly/unipain