Keeping only certain characters in a string using

2019-02-13 23:44发布

问题:

In my program I have a string like this:

ag ct oso gcota

Using python, my goal is to get rid of the white space and keep only the a,t,c,and g characters. I understand how to get rid of the white space (I'm just using line = line.replace(" ", "")). But how can I get rid of the characters that I don't need when they could be any other letter in the alphabet?

回答1:

A very elegant and fast way is to use regular expressions:

import re

str = 'ag ct oso gcota'
str = re.sub('[^atcg]', '', str)

"""str is now 'agctgcta"""


回答2:

I might do something like:

chars_i_want = set('atcg')
final_string = ''.join(c for c in start_string if c in chars_i_want)

This is probably the easiest way to do this.


Another option would be to use str.translate to do the work:

import string
chars_to_remove = string.printable.translate(None,'acgt')
final_string = start_string.translate(None,chars_to_remove)

I'm not sure which would perform better. It'd need to be timed via timeit to know definitively.


update: Timings!

import re
import string

def test_re(s,regex=re.compile('[^atgc]')):
    return regex.sub(s,'')

def test_join1(s,chars_keep=set('atgc')):
    return ''.join(c for c in s if c in chars_keep)

def test_join2(s,chars_keep=set('atgc')):
    """ list-comp is faster, but less 'idiomatic' """
    return ''.join([c for c in s if c in chars_keep])

def translate(s,chars_to_remove = string.printable.translate(None,'acgt')):
    return s.translate(None,chars_to_remove)

import timeit

s = 'ag ct oso gcota'
for func in "test_re","test_join1","test_join2","translate":
    print func,timeit.timeit('{0}(s)'.format(func),'from __main__ import s,{0}'.format(func))

Sadly (for me), regex wins on my machine:

test_re 0.901512145996
test_join1 6.00346088409
test_join2 3.66561293602
translate 1.0741918087


回答3:

Did people test mgilson's test_re() function before upvoting? The arguments to re.sub() are reversed, so it was doing substitution in an empty string, and always returns empty string.

I work in python 3.4; string.translate() only takes one argument, a dict. Because there is overhead in building this dict, I moved it out of the function. To be fair, I also moved the regex compilation out of the function (this didn't make a noticeable difference).

import re
import string

regex=re.compile('[^atgc]')

chars_to_remove = string.printable.translate({ ord('a'): None, ord('c'): None, ord('g'): None, ord('t'): None })
cmap = {}
for c in chars_to_remove:
    cmap[ord(c)] = None

def test_re(s):
    return regex.sub('',s)

def test_join1(s,chars_keep=set('atgc')):
    return ''.join(c for c in s if c in chars_keep)

def test_join2(s,chars_keep=set('atgc')):
    """ list-comp is faster, but less 'idiomatic' """
    return ''.join([c for c in s if c in chars_keep])

def translate(s):
    return s.translate(cmap)

import timeit

s = 'ag ct oso gcota'
for func in "test_re","test_join1","test_join2","translate":
    print(func,timeit.timeit('{0}(s)'.format(func),'from __main__ import s,{0}'.format(func)))

Here are the timings:

test_re 3.3141989699797705
test_join1 2.4452173250028864
test_join2 2.081048655003542
translate 1.9390292020107154

It's too bad string.translate() doesn't have an option to control what to do with characters that aren't in the map. The current implementation is to keep them, but we could just as well have the option to remove them, in cases where the characters we want to keep are far fewer than the ones we want to remove (oh hello, unicode).