-->

Stripping everything but alphanumeric chars from a

2018-12-31 17:09发布

问题:

What is the best way to strip all non alphanumeric characters from a string, using Python?

The solutions presented in the PHP variant of this question will probably work with some minor adjustments, but don\'t seem very \'pythonic\' to me.

For the record, I don\'t just want to strip periods and commas (and other punctuation), but also quotes, brackets, etc.

回答1:

I just timed some functions out of curiosity. In these tests I\'m removing non-alphanumeric characters from the string string.printable (part of the built-in string module).

$ python -m timeit -s \\
     \"import string\" \\
     \"\'\'.join(ch for ch in string.printable if ch.isalnum())\" 
10000 loops, best of 3: 57.6 usec per loop

$ python -m timeit -s \\
    \"import string\" \\
    \"filter(str.isalnum, string.printable)\"                 
10000 loops, best of 3: 37.9 usec per loop

$ python -m timeit -s \\
    \"import re, string\" \\
    \"re.sub(\'[\\W_]\', \'\', string.printable)\"
10000 loops, best of 3: 27.5 usec per loop

$ python -m timeit -s \\
    \"import re, string\" \\
    \"re.sub(\'[\\W_]+\', \'\', string.printable)\"                
100000 loops, best of 3: 15 usec per loop

$ python -m timeit -s \\
    \"import re, string; pattern = re.compile(\'[\\W_]+\')\" \\
    \"pattern.sub(\'\', string.printable)\" 
100000 loops, best of 3: 11.2 usec per loop


回答2:

Regular expressions to the rescue:

import re
re.sub(r\'\\W+\', \'\', your_string)

By Python definition \'\\W == [^a-zA-Z0-9_], which excludes all numbers, letters and _



回答3:

Use the str.translate() method.

Presuming you will be doing this often:

(1) Once, create a string containing all the characters you wish to delete:

delchars = \'\'.join(c for c in map(chr, range(256)) if not c.isalnum())

(2) Whenever you want to scrunch a string:

scrunched = s.translate(None, delchars)

The setup cost probably compares favourably with re.compile; the marginal cost is way lower:

C:\\junk>\\python26\\python -mtimeit -s\"import string;d=\'\'.join(c for c in map(chr,range(256)) if not c.isalnum());s=string.printable\" \"s.translate(None,d)\"
100000 loops, best of 3: 2.04 usec per loop

C:\\junk>\\python26\\python -mtimeit -s\"import re,string;s=string.printable;r=re.compile(r\'[\\W_]+\')\" \"r.sub(\'\',s)\"
100000 loops, best of 3: 7.34 usec per loop

Note: Using string.printable as benchmark data gives the pattern \'[\\W_]+\' an unfair advantage; all the non-alphanumeric characters are in one bunch ... in typical data there would be more than one substitution to do:

C:\\junk>\\python26\\python -c \"import string; s = string.printable; print len(s),repr(s)\"
100 \'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!\"#$%&\\\'()*+,-./:;=>?@[\\\\]^_`{|}~ \\t\\n\\r\\x0b\\x0c\'

Here\'s what happens if you give re.sub a bit more work to do:

C:\\junk>\\python26\\python -mtimeit -s\"d=\'\'.join(c for c in map(chr,range(256)) if not c.isalnum());s=\'foo-\'*25\" \"s.translate(None,d)\"
1000000 loops, best of 3: 1.97 usec per loop

C:\\junk>\\python26\\python -mtimeit -s\"import re;s=\'foo-\'*25;r=re.compile(r\'[\\W_]+\')\" \"r.sub(\'\',s)\"
10000 loops, best of 3: 26.4 usec per loop


回答4:

You could try:

print \'\'.join(ch for ch in some_string if ch.isalnum())


回答5:

>>> import re
>>> string = \"Kl13@£$%[};\'\\\"\"
>>> pattern = re.compile(\'\\W\')
>>> string = re.sub(pattern, \'\', string)
>>> print string
Kl13


回答6:

How about:

def ExtractAlphanumeric(InputString):
    from string import ascii_letters, digits
    return \"\".join([ch for ch in InputString if ch in (ascii_letters + digits)])

This works by using list comprehension to produce a list of the characters in InputString if they are present in the combined ascii_letters and digits strings. It then joins the list together into a string.



回答7:

As a spin off from some other answers here, I offer a really simple and flexible way to define a set of characters that you want to limit a string\'s content to. In this case, I\'m allowing alphanumerics PLUS dash and underscore. Just add or remove characters from my PERMITTED_CHARS as suits your use case.

PERMITTED_CHARS = \"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-\" 
someString = \"\".join(c for c in someString if c in PERMITTED_CHARS)


回答8:

for char in my_string:
    if not char.isalnum():
        my_string = my_string.replace(char,\"\")


回答9:

sent = \"\".join(e for e in sent if e.isalpha())


标签: python string