Extract sequences from a FASTA file based on entri

2020-02-29 23:55发布

I have two files.

File 1: a FASTA file with gene sequences, formated like this example:

>PITG_00002 | Phytophthora infestans T30-4 conserved hypothetical protein (426 nt)
ATGCATCGCTCGGGTTCCGCACGGAAAGCCCAAGGTCTGGGATTACGGGGTGGTGGTCGG
TTACACTTGGAATAACCTCGCAAATTCAGAATCTCTACAGGCTACGTTCGCGGATGGAAC
>PITG_00003 | Phytophthora infestans T30-4 protein kinase (297 nt)
ATGACGGCTGGGGTCGGTACGCCCTACTGGATCGCACCGGAGATTCTTGAAGGCAAACGG
TACACTGAGCAAGCGGATATTTACTCGTTCGGAGTGGTTTTATCCGAGCTGGACACGTGC
AAGATGCCGTTCTCTGACGTCGTTACGGCAGAGGGAAAGAAACCCAAACCAGTTCAGATC
>PITG_00004 | Phytophthora infestans T30-4 protein kinase, putative (1969 nt)
ATGCGCGTGTCTGGTCTCCTTTCAATTCTTGCAGCCACTTTGACCACGGCCCAAGACTAC

File 2: A simple text file with JUST the accession identification of the gene. Like so.

PITG_00003
PITG_00005
PITG_00023

Every entry in File 2 is somewhere in File 1, but not every entry in File 1 is in File 2. I need to remove all the entries from File 1 that are not in File 2. I feel like there must be something in the biopython module that could help me, I just don't know what. For instance, I originally thought that I could extract just the accessions from my FASTA file using the SeqIO.parse function, but this really just lands me with two files of accession numbers. I don't know how to selectively extract the accessions that are in the other file. Maybe like reading all the entries from File 2 into a dictionary and then associated that entry with its matching entry in File 1 and use SeqIO.parse to extract the whole sequence...But I really don't know....Any help anyone could give me is extremely appreciated!

1条回答
不美不萌又怎样
2楼-- · 2020-03-01 00:11

Try this:

f2 = open('accessionids.txt','r')
f1 = open('fasta.txt','r')
f3 = open('fasta_parsed.txt','w')

AI_DICT = {}
for line in f2:
    AI_DICT[line[:-1]] = 1

skip = 0
for line in f1:
    if line[0] == '>':
        _splitline = line.split('|')
        accessorIDWithArrow = _splitline[0]
        accessorID = accessorIDWithArrow[1:-1]
        # print accessorID
        if accessorID in AI_DICT:
            f3.write(line)
            skip = 0
        else:
            skip = 1
    else:
        if not skip:
            f3.write(line)

f1.close()
f2.close()
f3.close()

To briefly explain what's going on here... accessionids.txt is your File 2, whereas fasta.txt is your File 1. Obviously you'll need to replace these filenames with your actual filenames within the code.

First, we create a dictionary (sometimes referred to as a hash or associative array) and for every Accession ID in File 2 we create an entry where the key is the Accession ID and the value is set to 1 (not that the value really matters in this case).

Next we look in File 1 and again look at each line in that file. If the line in the file starts with > then we know that it contains an Accession ID. We take that line and split it along the | since every line with an Accession ID will have a | in the string. Next, take the first part of the split as specified by _splitline[0]. We use accessorIDWithArrow[1:-1] to chop off the first and last characters in the string which are the > symbol in the front and a blank space in the rear.

At this point, accessorID now contains the Accession ID in the format that we expect from File 2.

Next, we check if the dictionary we created and populated earlier has this Accession ID defined as a key. If it does, we immediately write the line with the Accession ID to a new file, fasta_parsed.txt, and set/reset the skip 'flag' variable to 0. The else statement containing the if not skip segment will then allow subsequent lines associated with the Accession ID that we found to be printed to the fasta_parsed.txt file.

For Accession ID from File 1 not found in the dictionary (not in File 2), we don't write the line to fasta_parsed.txt and we set the skip flag to 0. Thus, until another Accession ID is found in File 1 that exists in File 2, all subsequent lines will be skipped.

查看更多
登录 后发表回答