Working in linux/shell env, how can I accomplish the following:
text file 1 contains:
1
2
3
4
5
text file 2 contains:
6
7
1
2
3
4
I need to extract the entries in file 2 which are not in file 1. So '6' and '7' in this example.
How do I do this from the command line?
many thanks!
I was wondering which of the following solutions was the "fastest" for "larger" files:
Results of my benchmarks in short:
grep -Fxf
, it's much slower (2-4 times in my tests).comm
is slightly faster thanjoin
.comm
andjoin
are much faster than awk1 + awk2. (Of course, they do not assume sorted files.)comm
probably due to the fact that it uses more threads. CPU times are lower for awk1 + awk2.For the sake of brevity I omit full details. However, I assume that anyone interested can contact me or just repeat the tests. Roughly, the setup was
Typical results of fastest runs
BTW, for the awkies: It seems that
a[$0]=1
is faster thana[$0]++
, and(!($0 in a))
is faster than(!a[$0])
. So, for an awk solution I suggest: