I have a FASTA file that can easily be parsed by SeqIO.parse
.
I am interested in extracting sequence ID's and sequence lengths. I used these lines to do it, but I feel it's waaaay too heavy (two iterations, conversions, etc.)
from Bio import SeqIO
import pandas as pd
# parse sequence fasta file
identifiers = [seq_record.id for seq_record in SeqIO.parse("sequence.fasta",
"fasta")]
lengths = [len(seq_record.seq) for seq_record in SeqIO.parse("sequence.fasta",
"fasta")]
#converting lists to pandas Series
s1 = Series(identifiers, name='ID')
s2 = Series(lengths, name='length')
#Gathering Series into a pandas DataFrame and rename index as ID column
Qfasta = DataFrame(dict(ID=s1, length=s2)).set_index(['ID'])
I could do it with only one iteration, but I get a dict :
records = SeqIO.parse(fastaFile, 'fasta')
and I somehow can't get DataFrame.from_dict
to work...
My goal is to iterate the FASTA file, and get ids and sequences lengths into a DataFrame
through each iteration.
Here is a short FASTA file for those who want to help.
David has given you a nice answer on the
pandas
side, on the Biopython side you don't need to useSeqRecord
objects viaBio.SeqIO
if all you want is the record identifiers and their sequence length - this should be faster:You're spot on - you definitely shouldn't be parsing the file twice, and storing the data in a dictionary is a waste of computing resources when you'll just be converting it to
numpy
arrays later.SeqIO.parse()
returns a generator, so you can iterate record-by-record, building a list like so:See Peter Cock's answer for a more efficient way of parsing just ID's and sequences from a FASTA file.
The rest of your code looks pretty good to me. However, if you really want to optimize for use with
pandas
, you can read below:On minimizing memory usage
Consulting the source of
panda.Series
, we can see thatdata
is stored interally as anumpy
ndarray
:If you make
identifiers
anndarray
, it can be used directly inSeries
without constructing a new array (the parametercopy
, defaultFalse
) will prevent a newndarray
being created if not needed. By storing your sequences in a list, you'll force Series to coerce said list to anndarray
.Avoid initializing lists
If you know in advance exactly how many sequences you have (and how long the longest ID will be), you could initialize an empty
ndarray
to hold identifiers like so:Of course, it's pretty hard to know exactly how many sequences you'll have, or what the largest ID is, so it's easiest to just let
numpy
convert from an existing list. However, this is technically the fastest way to store your data for use inpandas
.