Working in linux/shell env, how can I accomplish the following:
text file 1 contains:
1
2
3
4
5
text file 2 contains:
6
7
1
2
3
4
I need to extract the entries in file 2 which are not in file 1. So '6' and '7' in this example.
How do I do this from the command line?
many thanks!
Using some lesser-known utilities:
This will output duplicates, so if there is 1
3
infile1
, but 2 infile2
, this will still output 13
. If this is not what you want, pipe the output fromsort
throughuniq
before writing it to a file:There are lots of utilities in the GNU coreutils package that allow for all sorts of text manipulations.
here's another awk solution
How about:
This would print the entries in file_2 which are not in file_1. For the opposite result one just has to replace '>' with '<'. 'cut' removes the first two characters added by 'diff', that are not part of the original content.
The files don't even need to be sorted.
Explanation of how the code works:
Explanation of details:
FNR
is the current file's record numberNR
is the current overall record number from all input filesFNR==NR
is true only when we are reading file1$0
is the current line of texta[$0]
is a hash with the key set to the current line of texta[$0]++
tracks that we've seen the current line of text!a[$0]
is true only when we have not seen the line textwith grep:
If you are really set on doing this from the command line, this site (search for "no duplicates found") has an
awk
example that searches for duplicates. It may be a good starting point to look at that.However, I'd encourage you to use Perl or Python for this. Basically, the flow of the program would be:
This isn't the most elegant way of doing this, since it has a O(n^2) time complexity, but it will do the job.