What is the best way to strip all non alphanumeric characters from a string, using Python?
The solutions presented in the PHP variant of this question will probably work with some minor adjustments, but don\'t seem very \'pythonic\' to me.
For the record, I don\'t just want to strip periods and commas (and other punctuation), but also quotes, brackets, etc.
I just timed some functions out of curiosity. In these tests I\'m removing non-alphanumeric characters from the string string.printable
(part of the built-in string
module).
$ python -m timeit -s \\
\"import string\" \\
\"\'\'.join(ch for ch in string.printable if ch.isalnum())\"
10000 loops, best of 3: 57.6 usec per loop
$ python -m timeit -s \\
\"import string\" \\
\"filter(str.isalnum, string.printable)\"
10000 loops, best of 3: 37.9 usec per loop
$ python -m timeit -s \\
\"import re, string\" \\
\"re.sub(\'[\\W_]\', \'\', string.printable)\"
10000 loops, best of 3: 27.5 usec per loop
$ python -m timeit -s \\
\"import re, string\" \\
\"re.sub(\'[\\W_]+\', \'\', string.printable)\"
100000 loops, best of 3: 15 usec per loop
$ python -m timeit -s \\
\"import re, string; pattern = re.compile(\'[\\W_]+\')\" \\
\"pattern.sub(\'\', string.printable)\"
100000 loops, best of 3: 11.2 usec per loop
Regular expressions to the rescue:
import re
re.sub(r\'\\W+\', \'\', your_string)
By Python definition \'\\W
== [^a-zA-Z0-9_]
, which excludes all numbers
, letters
and _
Use the str.translate() method.
Presuming you will be doing this often:
(1) Once, create a string containing all the characters you wish to delete:
delchars = \'\'.join(c for c in map(chr, range(256)) if not c.isalnum())
(2) Whenever you want to scrunch a string:
scrunched = s.translate(None, delchars)
The setup cost probably compares favourably with re.compile; the marginal cost is way lower:
C:\\junk>\\python26\\python -mtimeit -s\"import string;d=\'\'.join(c for c in map(chr,range(256)) if not c.isalnum());s=string.printable\" \"s.translate(None,d)\"
100000 loops, best of 3: 2.04 usec per loop
C:\\junk>\\python26\\python -mtimeit -s\"import re,string;s=string.printable;r=re.compile(r\'[\\W_]+\')\" \"r.sub(\'\',s)\"
100000 loops, best of 3: 7.34 usec per loop
Note: Using string.printable as benchmark data gives the pattern \'[\\W_]+\' an unfair advantage; all the non-alphanumeric characters are in one bunch ... in typical data there would be more than one substitution to do:
C:\\junk>\\python26\\python -c \"import string; s = string.printable; print len(s),repr(s)\"
100 \'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!\"#$%&\\\'()*+,-./:;=>?@[\\\\]^_`{|}~ \\t\\n\\r\\x0b\\x0c\'
Here\'s what happens if you give re.sub a bit more work to do:
C:\\junk>\\python26\\python -mtimeit -s\"d=\'\'.join(c for c in map(chr,range(256)) if not c.isalnum());s=\'foo-\'*25\" \"s.translate(None,d)\"
1000000 loops, best of 3: 1.97 usec per loop
C:\\junk>\\python26\\python -mtimeit -s\"import re;s=\'foo-\'*25;r=re.compile(r\'[\\W_]+\')\" \"r.sub(\'\',s)\"
10000 loops, best of 3: 26.4 usec per loop
You could try:
print \'\'.join(ch for ch in some_string if ch.isalnum())
>>> import re
>>> string = \"Kl13@£$%[};\'\\\"\"
>>> pattern = re.compile(\'\\W\')
>>> string = re.sub(pattern, \'\', string)
>>> print string
Kl13
How about:
def ExtractAlphanumeric(InputString):
from string import ascii_letters, digits
return \"\".join([ch for ch in InputString if ch in (ascii_letters + digits)])
This works by using list comprehension to produce a list of the characters in InputString
if they are present in the combined ascii_letters
and digits
strings. It then joins the list together into a string.
As a spin off from some other answers here, I offer a really simple and flexible way to define a set of characters that you want to limit a string\'s content to. In this case, I\'m allowing alphanumerics PLUS dash and underscore. Just add or remove characters from my PERMITTED_CHARS
as suits your use case.
PERMITTED_CHARS = \"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-\"
someString = \"\".join(c for c in someString if c in PERMITTED_CHARS)
for char in my_string:
if not char.isalnum():
my_string = my_string.replace(char,\"\")
sent = \"\".join(e for e in sent if e.isalpha())