I have a data in that always comes in block of four in the following format (called FASTQ):
@SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
+SRR018006.2016 GA2:6:1:20:650 length=36
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!
@SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR018006.19405469 GA2:6:100:1793:611 length=36
7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/
Is there a simple sed/awk/bash way to convert them into this format (called FASTA):
>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
In principle, we want to extract the first two lines in each block-of-4
and replace @
with >
.
Here's the solution to the "skip every other line" part of the problem that I just learned from SO:
If all that needs to be done is change one
@
to>
, then I reckonwill do the job.
Something like:
should work.
just awk , no need other tools
See fastq2fasta.pl in http://www.ringtail.tsl.ac.uk/david-studholme/scripts/
below
where data is your data file. I've received:
I know I'm way in the future, but for the benefit of googlers:
You may want to use fastq_to_fasta from the fastx toolkit. It will keep the @ sign, though. It will also remove lines with Ns unless you tell it not to.