I have a function that removes punctuation from a list of strings:
def strip_punctuation(input):
x = 0
for word in input:
input[x] = re.sub(r\'[^A-Za-z0-9 ]\', \"\", input[x])
x += 1
return input
I recently modified my script to use Unicode strings so I could handle other non-Western characters. This function breaks when it encounters these special characters and just returns empty Unicode strings. How can I reliably remove punctuation from Unicode formatted strings?
You could use unicode.translate()
method:
import unicodedata
import sys
tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
if unicodedata.category(unichr(i)).startswith(\'P\'))
def remove_punctuation(text):
return text.translate(tbl)
You could also use r\'\\p{P}\'
that is supported by regex module:
import regex as re
def remove_punctuation(text):
return re.sub(ur\"\\p{P}+\", \"\", text)
If you want to use J.F. Sebastian\'s solution in Python 3:
import unicodedata
import sys
tbl = dict.fromkeys(i for i in range(sys.maxunicode)
if unicodedata.category(chr(i)).startswith(\'P\'))
def remove_punctuation(text):
return text.translate(tbl)
You can iterate through the string using the unicodedata
module\'s category
function to determine if the character is punctuation.
For possible outputs of category
, see unicode.org\'s doc on General Category Values
import unicodedata.category as cat
def strip_punctuation(word):
return \"\".join(char for char in word if cat(char).startswith(\'P\'))
filtered = [strip_punctuation(word) for word in input]
Additionally, make sure that you\'re handling encodings and types correctly. This presentation is a good place to start: http://bit.ly/unipain
A little shorter version based on Daenyth answer
import unicodedata
def strip_punctuation(text):
\"\"\"
>>> strip_punctuation(u\'something\')
u\'something\'
>>> strip_punctuation(u\'something.,:else really\')
u\'somethingelse really\'
\"\"\"
punctutation_cats = set([\'Pc\', \'Pd\', \'Ps\', \'Pe\', \'Pi\', \'Pf\', \'Po\'])
return \'\'.join(x for x in text
if unicodedata.category(x) not in punctutation_cats)
input_data = [u\'somehting\', u\'something, else\', u\'nothing.\']
without_punctuation = map(strip_punctuation, input_data)