extracting unique values between 2 sets/files

2019-01-16 10:40发布

Working in linux/shell env, how can I accomplish the following:

text file 1 contains:

1
2
3
4
5

text file 2 contains:

6
7
1
2
3
4

I need to extract the entries in file 2 which are not in file 1. So '6' and '7' in this example.

How do I do this from the command line?

many thanks!

7条回答
Anthone
2楼-- · 2019-01-16 10:57

Using some lesser-known utilities:

sort file1 > file1.sorted
sort file2 > file2.sorted
comm -1 -3 file1.sorted file2.sorted

This will output duplicates, so if there is 1 3 in file1, but 2 in file2, this will still output 1 3. If this is not what you want, pipe the output from sort through uniq before writing it to a file:

sort file1 | uniq > file1.sorted
sort file2 | uniq > file2.sorted
comm -1 -3 file1.sorted file2.sorted

There are lots of utilities in the GNU coreutils package that allow for all sorts of text manipulations.

查看更多
我命由我不由天
3楼-- · 2019-01-16 10:59

here's another awk solution

$ awk 'FNR==NR{a[$0]++;next}(!($0 in a))' file1 file2
6
7
查看更多
混吃等死
4楼-- · 2019-01-16 11:00

How about:

diff file_1 file_2 | grep '^>' | cut -c 3-

This would print the entries in file_2 which are not in file_1. For the opposite result one just has to replace '>' with '<'. 'cut' removes the first two characters added by 'diff', that are not part of the original content.

The files don't even need to be sorted.

查看更多
SAY GOODBYE
5楼-- · 2019-01-16 11:01
$ awk 'FNR==NR {a[$0]++; next} !a[$0]' file1 file2
6
7

Explanation of how the code works:

  • If we're working on file1, track each line of text we see.
  • If we're working on file2, and have not seen the line text, then print it.

Explanation of details:

  • FNR is the current file's record number
  • NR is the current overall record number from all input files
  • FNR==NR is true only when we are reading file1
  • $0 is the current line of text
  • a[$0] is a hash with the key set to the current line of text
  • a[$0]++ tracks that we've seen the current line of text
  • !a[$0] is true only when we have not seen the line text
  • Print the line of text if the above pattern returns true, this is the default awk behavior when no explicit action is given
查看更多
成全新的幸福
6楼-- · 2019-01-16 11:16

with grep:

grep -F -x -v -f file_1 file_2 
查看更多
冷血范
7楼-- · 2019-01-16 11:17

If you are really set on doing this from the command line, this site (search for "no duplicates found") has an awk example that searches for duplicates. It may be a good starting point to look at that.

However, I'd encourage you to use Perl or Python for this. Basically, the flow of the program would be:

findUniqueValues(file1, file2){
    contents1 = array of values from file1
    contents2 = array of values from file2
    foreach(value2 in contents2){
        found=false
        foreach(value1 in contents1){
            if (value2 == value1) found=true
        }
        if(!found) print value2
    }
}

This isn't the most elegant way of doing this, since it has a O(n^2) time complexity, but it will do the job.

查看更多
登录 后发表回答