I am having problems with grep and awk. I think it's because my input file contains text that looks like code.
The input file contains ID names and looks like this:
SNORD115-40
MIR432
RNU6-2
The reference file looks like this:
Ensembl Gene ID HGNC symbol
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000266661
ENSG00000243133
ENSG00000207447 RNU6-2
I want to match the ID names from my source file with my reference file and print out the corresponding ensg ID numbers so that the output file looks like this:
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2
I have tried this loop:
exec < source.file
while read line
do
grep -w $line reference.file > outputfile
done
I've also tried playing around with the reference file using awk
awk 'NF == 2 {print $0}' reference file
awk 'NF >2 {print $0}' reference file
but I only get one of the grep'd IDs.
Any suggestions or easier ways of doing this would be great.
$ fgrep -f source.file reference.file
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2
fgrep
is equivalent to grep -F
:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by
newlines, any of which is to be matched. (-F is specified by
POSIX.)
The -f
option is for taking PATTERN
from a file:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing. (-f is
specified by POSIX.)
As noted in the comments, this can produce false positives if an ID in reference.file
contains an ID in source.file
as a substring. You can construct a more definitive pattern for grep
on the fly with sed
:
grep -f <( sed 's/.*/ &$/' input.file) reference.file
But this way the patterns are interpreted as regular expressions and not as fixed strings, which is potentially vulnerable (although may be OK if the IDs only contain alphanumeric characters). The better way, though (thanks to @sidharthcnadhan), is to use the -w
option:
-w, --word-regexp
Select only those lines containing matches that form whole
words. The test is that the matching substring must either be
at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end
of the line or followed by a non-word constituent character.
Word-constituent characters are letters, digits, and the
underscore.
So the final answer to your question is:
grep -Fwf source.file reference.file
This will do the trick:
$ awk 'NR==FNR{a[$0];next}$NF in a{print}' input reference
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2
This was a nice bash
ish try. The problem was that You always overwrite the result file. Use '>>' instead of >
or move the >
after done
grep -w $line reference.file >> outputfile
or
done > outputfile
But I would prefer Lev's solution as it starts an external process only once.
If You want to solve it in pure bash
, you could try this:
ID=($(<IDfile))
while read; do
for((i=0;i<${#ID[*]};++i)) {
[[ $REPLY =~ [[:space:]]${ID[$i]}$ ]] && echo $REPLY && break
}
done <RefFile >outputfile
cat outputfile
Output:
ENSG00000199537 SNORD115-40
ENSG00000207793 MIR432
ENSG00000207447 RNU6-2
Newer bash
supports associative arrays. It can be used to simplify and speed up the search for a key:
declare -A ID
for i in $(<IDfile); { ID[$i]=1;}
while read v; do
[[ $v =~ [[:space:]]([^[:space:]]+)$ && ${ID[${BASH_REMATCH[1]}]} = 1 ]] && echo $v
done <RefFile