-->

How do you filter a string such that only characte

2019-01-24 00:22发布

问题:

Imagine a string, like 'Agh#$%#%2341- -!zdrkfd' and I only wish to perform some operating on it such that only the lowercase letters are returned (as an example), which in this case would bring 'ghzdrkfd'.

How do you do this in Python? The obvious way would be to create a list, of characters, 'a' through 'z', then iterate over the characters in my string and build a new string, character by character, of those in my list only. This seems primitive.

I was wondering if regular expressions are appropriate. Replacing unwanted characters seems problematic and I tend to prefer whitelisting over blacklisting. The .match function does not seem appropriate. I have looked over the appropriate page on the Python site, but have not found a method which seems to fit.

If regular expressions are not appropriate and the correct approach is looping, is there a simple function which "explodes" a string into a list? Or am I just hitting another for loop there?

回答1:

If you are looking for efficiency. Using the translate function is the fastest you can get.

It can be used to quickly replace characters and/or delete them.

import string
delete_table  = string.maketrans(
    string.ascii_lowercase, ' ' * len(string.ascii_lowercase)
)
table = string.maketrans('', '')

"Agh#$%#%2341- -!zdrkfd".translate(table, delete_table)

In python 2.6: you don't need the second table anymore

import string
delete_table  = string.maketrans(
    string.ascii_lowercase, ' ' * len(string.ascii_lowercase)
)
"Agh#$%#%2341- -!zdrkfd".translate(None, delete_table)

This is method is way faster than any other. Of course you need to store the delete_table somewhere and use it. But even if you don't store it and build it every time, it is still going to be faster than other suggested methods so far.

To confirm my claims here are the results:

for i in xrange(10000):
    ''.join(c for c in s if c.islower())

real    0m0.189s
user    0m0.176s
sys 0m0.012s

While running the regular expression solution:

for i in xrange(10000):
    re.sub(r'[^a-z]', '', s)

real    0m0.172s
user    0m0.164s
sys 0m0.004s

[Upon request] If you pre-compile the regular expression:

r = re.compile(r'[^a-z]')
for i in xrange(10000):
    r.sub('', s)

real    0m0.166s
user    0m0.144s
sys 0m0.008s

Running the translate method the same number of times took:

real    0m0.075s
user    0m0.064s
sys 0m0.012s


回答2:

s = 'Agh#$%#%2341- -!zdrkfd'  
print ''.join(c for c in s if c.islower())

String objects are iterable; there is no need to "explode" the string into a list. You can put whatever condition you want in the list comprehension, and it will filter characters accordingly.

You could also implement this using a regex, but this will only hide the loop. The regular expressions library will still have to loop through the characters of the string in order to filter them.



回答3:

Using a regular expression is easy enough, especially for this scenario:

>>> import re
>>> s = 'ASDjifjASFJ7364'
>>> re.sub(r'[^a-z]+', '', s)
'jifj'

If you plan on doing this many times, it is best to compile the regular expression before hand:

>>> import re
>>> s = 'ASDjifjASFJ7364'
>>> r = re.compile(r'[^a-z]+')
>>> r.sub('', s)
'jifj'


回答4:

>>> s = 'Agh#$%#%2341- -!zdrkfd'
>>> ''.join(i for i in s if  i in 'qwertyuiopasdfghjklzxcvbnm')
'ghzdrkfd'


回答5:

s = 'ASDjifjASFJ7364'
s_lowercase = ''.join(filter(lambda c: c.islower(), s))
print s_lowercase #print 'jifj'


回答6:

Here's one solution if you are specifically interested in working on strings:

 s = 'Agh#$%#%2341- -!zdrkfd'
 lowercase_chars = [chr(i) for i in xrange(ord('a'), ord('z') + 1)]
 whitelist = set(lowercase_chars)
 filtered_list = [c for c in s if c in whitelist]

The whitelist is actually a set (not a list) for efficiency.

If you need a string, use join():

filtered_str = ''.join(filtered_list)

filter() is a more generic solution. From the documentation (http://docs.python.org/library/functions.html):

filter(function, iterable)

Construct a list from those elements of iterable for which function returns true. iterable may be either a sequence, a container which supports iteration, or an iterator. If iterable is a string or a tuple, the result also has that type; otherwise it is always a list. If function is None, the identity function is assumed, that is, all elements of iterable that are false are removed.

This would be one way of using filter():

filtered_list = filter(lambda c: c.islower(), s)


回答7:

A more generic and understandable solution to take an inputstring and filter it against a whitelist of characters :

inputstring = "Agh#$%#%2341- -!zdrkfd"
whitelist = "abcdefghijklmnopqrstuvwxyz"
remove = inputstring.translate(None, whitelist)
result = inputstring.translate(None, remove)
print result

This prints

ghzdrkfd

The first string.translate removes all characters in the whitelist from the inputstring. This gives us the characters we want to remove. The second string.translate call removes those from the inputstring and produces the desired result.



回答8:

I'd use a regex. For lowercase match [a-z].



回答9:

import string
print "".join([c for c in "Agh#$%#%2341- -!zdrkfd" if c in string.lowercase])


回答10:

import string

print filter(string.lowercase.__contains__, "lowerUPPER")
print filter("123".__contains__, "a1b2c3")