Filtering a FASTA file based on sequence with BioP

2019-07-15 08:37发布

I have a fasta file. From that file, I need to get the only sequences containing GTACAGTAGG and CAACGGTTTTGCC at the end and/or start of the sequence and put them in a new fasta file. So here's an example:

>m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/2516_3269
***GTACAGTAGG***GTACACACAGAACGCGACAAGGCCAGGCGCTGGAGGAACTCCAGCAGCTAGATGCAAGCGACTA
TCAGAGCGTTGGGTCCAGAACGAAGAACAGTCACTCAAGACTGCTTT***CAACGGTTTTGCC***

>m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/3312_3597
CGCGGCATCGAATTAATACGACTCACTATAGGTTTTTTTATTGGTATTTTCAGTTAGATTCTTTCTTCTTAGAGGGTACA
GAGAAAGGGAGAAAATAGCTACAGACATGGGAGTGAAAGGTAGGAAGAAGAGCGAAGCAGACATTATTCA

>m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/3708_4657
***CAACGGTTTTGCC***ACAAGATCAGGAACATAAGTCACCAGACTCAATTCATCCCCATAAGACCTCGGACCTCTCA
ATCCTCGAATTAGGATGTTCTCCCCATGGCGTACGGTCTATCAGTATATAAACCTGACATACTATAAAAAAGTATACCAT
TCTTATCATGTACAGTAGG***GTACAGTAGG***

>m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/4704_5021
***GTACAGTAGG***GTGGGAGAGATGGCAGAAAGGCAGAAAGGAGAAAGATTCAGGATAACTCTCCTGGAGGGGCGAG
GTGCCATTCCCTGTGGTCACTTATTCTAAAGGCCCCAACCCTTCAAC***CAACGGTTTTGCC***

>m121012_054644_42133_c100390582550000001523038311021245_s1_p0/8/4223_4358
AAATATTGGGTCAAAGAACCGTTACTTTTCTTATATATGCGGCGCGAGGTTTTATATACTGATAAGAACCTACGCCATGG
GACATCTAATTCAGAGGGAAGAAGGTCCATGTCTGTTTGGATGAAATTGAGTCTG

(* added for highlighting)

I need some way to get the only sequences containing GTACAGTAGG and CAACGGTTTTGCC at the end and/or start of the sequences and get them out in a new fasta file. I'm very new to this. I'm not even sure if it can be done. Thanks in advance for any help you can give.

4条回答
我欲成王,谁敢阻挡
2楼-- · 2019-07-15 09:33

Here's one way to do it in Biopython:

from Bio import SeqIO

source = 'fasta_file_name.fa'
outfile = 'filtered.fa'
sub1 ='GTACAGTAGG'
sub2 = 'CAACGGTTTTGCC'

def seq_check(seq, sub1, sub2):
    # basically a function to check whether seq starts or ends with sub1 or sub2
    return seq.startswith(sub1) or seq.startswith(sub2) or \
        seq.endswith(sub1) or seq.endswith(sub2)

seqs = SeqIO.parse(source, 'fasta')
filtered = (seq for seq in seqs if seq_check(seq.seq, sub1, sub2))
SeqIO.write(filtered, outfile, 'fasta')

We're using generator expression (the line that begins with 'filtered') so the filtering is done as the program's reading through the source file. This has the advantage of being memory-efficient.

We're also creating a new function to check the starts and ends of the sequence to make the program more readable. In theory, you could do the check inside the generator expression, but that would make the line unecessarily long.

Hope that helps :).

查看更多
啃猪蹄的小仙女
3楼-- · 2019-07-15 09:37

Python has a method built in to strings called startswith(), and also one called endswith(). So you could test to see if it starts with one and ends with the other, and vice versa.

查看更多
男人必须洒脱
4楼-- · 2019-07-15 09:41

While checking that a string has a certain sequence at the end and/or beginning is indeed a very simple task for Python (refer to str.startswith and str.endswith), this does not address the problem of extracting the sequence from the FASTA file as a string. There are certain issues here, e.g. the sequences must be separated from their annotations and also they can span multiple lines. So applying the string methods directly to the lines of the file will not produce the desired result.

This is why you need an actual parser for the FASTA format. The parser will process a FASTA file and give you annotations and sequences as Python strings. BioPython indeed provides one, and you can do something like this:

from Bio import SeqIO
def filter_records(source, substrings):
    for rec in source:
        if any(rec.seq.startswith(sub) or rec.seq.endswith(sub)
                 for sub in substrings):
             yield rec

substrings = ('GTACAGTAGG', 'CAACGGTTTTGCC')
SeqIO.write(filter_recors(SeqIO.parse('my.fasta', 'fasta'),
            'filtered.fasta', 'fasta')

I can also recommend using pyteomics (a Python-based proteomics microframework I'm involved in developing) for manipulating FASTA files:

from pyteomics import fasta
def filter_fasta(source, substrings):
    for descr, seq in source:
        if any(seq.startswith(sub) or seq.endswith(sub) for sub in substrings):
            yield descr, seq

substrings = ('GTACAGTAGG', 'CAACGGTTTTGCC')
fasta.write(filter_fasta(fasta.read('my.fasta'), substrings), 'filtered.fasta')
查看更多
疯言疯语
5楼-- · 2019-07-15 09:43

Probably not the best way, depending on size of your sequences, but this will get the job done.

import re
data_file ="location_of_fasta_file"
sequence = ''
Valid = False
for line in open(data_file):
    line = line.rstrip()
    if re.match("^>",line):
        if re.findall('^GTACAGTAGG',sequence) or re.findall('GTACAGTAGG$',sequence) or re.findall('^CAACGGTTTTGCC',sequence) or re.findall('CAACGGTTTTGCC$',sequence):
            print header_line
            print sequence
        header_line=line
        sequence = ''
        continue
    else:
        sequence += line
# below is needed to allow printing of final sequence which is not followed by a new fasta entry
if re.findall('^GTACAGTAGG',sequence) or re.findall('GTACAGTAGG$',sequence) or re.findall('^CAACGGTTTTGCC',sequence) or re.findall('CAACGGTTTTGCC$',sequence):
    print header_line
    print sequence
查看更多
登录 后发表回答