I am trying to parse a large fasta file and I am encountering out of memory errors. Some suggestions to improve the data handling would be appreciated. Currently the program correctly prints out the names however partially through the file I get a MemoryError
Here is the generator
def readFastaEntry( fp ):
name = ""
seq = ""
for line in fp:
if line.startswith( ">" ):
tmp = []
tmp.append( name )
tmp.append( seq )
name = line
seq = ""
yield tmp
else:
seq = seq.join( line )
and here is the caller stub more will be added after this part works
fp = open( sys.argv[1], 'r' )
for seq in readFastaEntry( fp ) :
print seq[0]
For those not fimilar with the fasta format here is an example
>1 (PB2)
AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATAC
TCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCC
TGCACTCAGGATGAAGTGGATGATG
>2 (PB1)
AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACC
ACATTTCCCTATACTGGAGACCCTCC
each entry starts with a ">" stating the name etc then the next N lines are data. There is no defined ending of the data other than the next line having a ">" at the beginning.
Without having a great understanding of what you are doing, I would have written the code like this:
This gathers up the data after a starting line up to the next starting line. Making
seq
an array means that you minimize the string joining until the last possible moment. Yielding a tuple makes more sense than a list.Have you considered using BioPython. They have a sequence reader that can read fasta files. And if you are interested in coding one yourself, you can take a look at BioPython's code.
Edit: Code added
A pyparsing parser for this format is only a few lines long. See the annotations in the following code:
Prints: