I'm using bash on a mac to run some grep and I'm using GNU grep installed with macports. Anyway, I'm trying to query a fasta file (DNA sequences - sequence ID on one line with the DNA sequence on the following line) with grep to output a subset of the file based on a file list of strings to query. Currently I have a list which is single words separated on a newline and the fasta file and am using the command
grep -A1 -f query_list.txt initial_file.fasta > query_subset.fasta
This almost produces the output I'm after but in the output file, after each sequence set that matches a string in the query file there is a double dash on a newline. Not sure why it's happening . I've tried removing them with sed
sed 's/\n--\n/\n' query_subset.fasta > final.fasta
but that doesn't work. If I use that same find and replace in textwrangler it works fine.
Anyway, as an example the files look like
query_list.txt
SpeciesA
SpeciesC
initial_file.fasta
>SpeciesA
ACGTGATCGATCGAT
>SpeciesB
ACGGGTCTTAGTATCG
>SpeciesC
ACGTACGATCTTCAGT
>SpeciesD
ACGTTCAGTCAGTTCAG
query_subset.fasta
>SpeciesA
ACGTGATCGATCGAT
--
>SpeciesC
ACGTACGATCTTCAGT
--
I need this to be done via the command line as I'm trying to implement it into a script to automate some sample processing.
Any input is greatly appreciated! Cheers, Tris