Remove all symbols while preserving string consist

2019-09-11 06:27发布

问题:

This question already has an answer here:

  • Maintaining the consistency of strings before and after converting to ASCII 1 answer

My goal is to remove all symbols from a string and still preserve the unicode characters (alphabetical character from any language). Suppose I have the following string:

carbon copolymers—III❏£\n12- Géotechnique\n

I want to remove the , and £ characters between copolymers and \n. I was looking at here and thought maybe I should go with regex and remove all symbols given the correct unicode characters range. The range of characters that I have in my text file varies from Latin to Russian and ... . However the regex code I've written below doesn't help.

>>> s = u'carbon copolymers—III❏£\n12- Géotechnique\n'
>>> re.sub(ur'[^\u0020-\u00FF\n]+',' ', s)

There seems to be two problems with this method:

1) Different unicode ranges still include some symbols.

2) Sometimes, for some unknown reason the returned result seems to be totally different than what it is supposed to be.

Here's the result of the code above:

carbon copolymers\xe2\x80\x94III\n12- G\xc3\xa9otechnique\n
>>> print u'carbon copolymers\xe2\x80\x94III\n12- G\xc3\xa9otechnique\n'
carbon copolymersâIII
12- Géotechnique 

Do you know any better way of doing this? Is there a full list of all symbols? Do you have any other ideas rather than regex?

Thank you

回答1:

I think found a good solution (>99% robust I believe) to the problem:

Well here's our new, horrific string:

s = u'carbon҂ ҉ copolymers—⿴٬ٯ٪III❏£\n12-ः׶ Ǣ ܊ܔ ۩۝۞ء܅۵Géotechnique▣ऀ\n'

And here's the resulting string:

u'carbon    copolymers   \u066f III  \n      \u01e2  \u0714    \u0621  G\xe9otechnique  \n'

All the remained characters/words are in fact alphabetical characters, in different languages. Done with almost no effort!

Here's the solution:

s = ''.join([c if c.isalpha() or c.isspace() else ' ' for c in s])
s = re.sub(ur'[\u0020-\u0040]+|[\u005B-\u0060]+|[\u007B-\u00BF]+', ' ', s)
s = re.sub(r'[ ]+', ' ', s)
carbon copolymers ٯ III  
Ǣ ܔ ء Géotechnique