I have an input_file.fa file like this (FASTA format):
> header1 description
data data
data
>header2 description
more data
data
data
I want to read in the file one chunk at a time, so that each chunk contains one header and the corresponding data, e.g. block 1:
> header1 description
data data
data
Of course I could just read in the file like this and split:
with open("1.fa") as f:
for block in f.read().split(">"):
pass
But I want to avoid the reading the whole file into memory, because the files are often large.
I can read in the file line by line of course:
with open("input_file.fa") as f:
for line in f:
pass
But ideally what I want is something like this:
with open("input_file.fa", newline=">") as f:
for block in f:
pass
But I get an error:
ValueError: illegal newline value: >
I've also tried using the csv module, but with no success.
I did find this post from 3 years ago, which provides a generator based solution to this issue, but it doesn't seem that compact, is this really the only/best solution? It would be neat if it is possible to create the generator with a single line rather than a separate function, something like this pseudocode:
with open("input_file.fa") as f:
blocks = magic_generator_split_by_>
for block in blocks:
pass
If this is impossible, then I guess you could consider my question a duplicate of the other post, but if that is so, I hope people can explain to me why the other solution is the only one. Many thanks.
One approach that I like is to use
itertools.groupby
together with a simplekey
fuction:Use it as:
You need the
join
only if you have to handle the contents of a whole section as a single string. If you are only interested in reading the lines one by one you can simply do:A general solution here will be write a generator function for this that yields one group at a time. This was you will be storing only one group at a time in memory.
Output:
For FASTA formats in general I would recommend using Biopython package.
This will read in the file line by line and you will get back the blocks with the yield statement. This is lazy so you don't have to worry about large memory footprint.