Grep output adds extra dashes and newlines

2019-07-28 18:05发布

问题:

I'm using bash on a mac to run some grep and I'm using GNU grep installed with macports. Anyway, I'm trying to query a fasta file (DNA sequences - sequence ID on one line with the DNA sequence on the following line) with grep to output a subset of the file based on a file list of strings to query. Currently I have a list which is single words separated on a newline and the fasta file and am using the command

grep -A1 -f query_list.txt initial_file.fasta > query_subset.fasta

This almost produces the output I'm after but in the output file, after each sequence set that matches a string in the query file there is a double dash on a newline. Not sure why it's happening . I've tried removing them with sed

sed 's/\n--\n/\n' query_subset.fasta > final.fasta

but that doesn't work. If I use that same find and replace in textwrangler it works fine.

Anyway, as an example the files look like

query_list.txt

SpeciesA
SpeciesC

initial_file.fasta

>SpeciesA
ACGTGATCGATCGAT
>SpeciesB
ACGGGTCTTAGTATCG
>SpeciesC
ACGTACGATCTTCAGT
>SpeciesD
ACGTTCAGTCAGTTCAG

query_subset.fasta

>SpeciesA
ACGTGATCGATCGAT
--
>SpeciesC
ACGTACGATCTTCAGT
--

I need this to be done via the command line as I'm trying to implement it into a script to automate some sample processing.

Any input is greatly appreciated! Cheers, Tris

回答1:

grep -A1 -f query_list.txt initial_file.fasta | sed '/^--/d' > final.fasta

or

grep -A1 -f query_list.txt initial_file.fasta | grep -v '^--' > final.fasta


回答2:

According to the man pages:

-A NUM, --after-context=NUM Print NUM lines of trailing context after matching lines. Places a line containing -- between contiguous groups of matches.

I tried the following and it worked:

cat query_list.txt | xargs -I {} grep -A1 {} initial_file.txt  > query_subset.fasta

I'm not quite sure why the input patterns are treated differently when they come from std in, so it's probably better to just strip off the offending lines:

grep -A1 -f query_list.txt initial_file.txt | grep -v "\-\-" >  query_subset.fasta


标签: grep