This question already has an answer here:
- Maintaining the consistency of strings before and after converting to ASCII 1 answer
My goal is to remove all symbols from a string and still preserve the unicode characters (alphabetical character from any language). Suppose I have the following string:
carbon copolymers—III❏£\n12- Géotechnique\n
I want to remove the —, ❏ and £ characters between copolymers and \n. I was looking at here and thought maybe I should go with regex and remove all symbols given the correct unicode characters range. The range of characters that I have in my text file varies from Latin to Russian and ... . However the regex code I've written below doesn't help.
>>> s = u'carbon copolymers—III❏£\n12- Géotechnique\n'
>>> re.sub(ur'[^\u0020-\u00FF\n]+',' ', s)
There seems to be two problems with this method:
1) Different unicode ranges still include some symbols.
2) Sometimes, for some unknown reason the returned result seems to be totally different than what it is supposed to be.
Here's the result of the code above:
carbon copolymers\xe2\x80\x94III\n12- G\xc3\xa9otechnique\n
>>> print u'carbon copolymers\xe2\x80\x94III\n12- G\xc3\xa9otechnique\n'
carbon copolymersâIII
12- Géotechnique
Do you know any better way of doing this? Is there a full list of all symbols? Do you have any other ideas rather than regex?
Thank you