Background:
Python 2.6.6 on Linux. First part of a DNA sequence analysis pipeline.
I want to read a possibly gzipped file from a mounted remote storage (LAN) and if it is gzipped; gunzip it to a stream (i.e. using gunzip FILENAME -c
) and if the first character of the stream (file) is "@", route that entire stream into a filtering program that takes input on standard input, otherwise just pipe it directly to a file on local disk. I'd like to minimize the number of file reads/seeks from remote storage (just a single pass through the file shouldn't be impossible?).
Contents of an example input file, first four lines corresponding to one record in FASTQ format:
@I328_1_FC30MD2AAXX:8:1:1719:1113/1
GTTATTATTATAATTTTTTACCGCATTTATCATTTCTTCTTTATTTTCATATTGATAATAAATATATGCAATTCG
+I328_1_FC30MD2AAXX:8:1:1719:1113/1
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhahhhhhhfShhhYhhQhh]hhhhffhU\UhYWc
Files that should not be piped into the filtering program contain records that look like this (first two lines corresponding to one record in FASTA format):
>I328_1_FC30MD2AAXX:8:1:1719:1113/1
GTTATTATTATAATTTTTTACCGCATTTATCATTTCTTCTTTATTTTCATATTGATAATAAATATATGCAATTCG
Some made up semi-pseudo code effort to visualize what I want to do (I know this isn't possible the way I've written it). I hope it makes some sense:
if gzipped:
gunzip = Popen(["gunzip", "-c", "remotestorage/file.gz"], stdout=PIPE)
if gunzip.stdout.peek(1) == "@": # This isn't possible
fastq = True
else:
fastq = False
if fastq:
filter = Popen(["filter", "localstorage/outputfile.fastq"], stdin=gunzip.stdout).communicate()
else:
# Send the gunzipped stream to another file
Disregard the fact that the code won't run like I've written it here and that I have no error handling etc, all that is already in my other code. I just want help with peeking into the stream or finding a way around that. I would be great if you could gunzip.stdout.peek(1)
but I realize that's not possible.
What I've tried so far:
I figured subprocess.Popen might help me achieve this, and I've tried a lot of different ideas, amongst others trying to use some kind of io.BufferedRandom() object to write the stream to but I can't figure out how that would work. I know streams are non-seekable but maybe a workaround might be to read the first character of the gunzip-stream and then create a new stream where you first input a "@" or ">" depending on file contents and then stuff the rest of the gunzip.stdout-stream into the new stream. This new stream would then be fed into filter's Popen stdin.
Note that the file sizes might be several times larger than available memory. I do not want to perform more than one single read of the source file from remote storage and no unnecessary file accessing.
Any ideas are welcome! Please ask me questions so I can clarify if I didn't make it clear enough.