extracting unique values between 2 sets/files

Working in linux/shell env, how can I accomplish the following:

text file 1 contains:

text file 2 contains:

I need to extract the entries in file 2 which are not in file 1. So '6' and '7' in this example.

How do I do this from the command line?

many thanks!

标签： linux perl bash scripting command-line

7条回答

Anthone

2楼-- · 2019-01-16 10:57

Using some lesser-known utilities:

sort file1 > file1.sorted
sort file2 > file2.sorted
comm -1 -3 file1.sorted file2.sorted

This will output duplicates, so if there is 1 3 in file1, but 2 in file2, this will still output 1 3. If this is not what you want, pipe the output from sort through uniq before writing it to a file:

sort file1 | uniq > file1.sorted
sort file2 | uniq > file2.sorted
comm -1 -3 file1.sorted file2.sorted

There are lots of utilities in the GNU coreutils package that allow for all sorts of text manipulations.

0人赞添加讨论(0) 举报

我命由我不由天

3楼-- · 2019-01-16 10:59

here's another awk solution

$ awk 'FNR==NR{a[$0]++;next}(!($0 in a))' file1 file2
6
7

0人赞添加讨论(0) 举报

混吃等死

4楼-- · 2019-01-16 11:00

How about:

diff file_1 file_2 | grep '^>' | cut -c 3-

This would print the entries in file_2 which are not in file_1. For the opposite result one just has to replace '>' with '<'. 'cut' removes the first two characters added by 'diff', that are not part of the original content.

The files don't even need to be sorted.

0人赞添加讨论(0) 举报

SAY GOODBYE

5楼-- · 2019-01-16 11:01

$ awk 'FNR==NR {a[$0]++; next} !a[$0]' file1 file2
6
7

Explanation of how the code works:

If we're working on file1, track each line of text we see.
If we're working on file2, and have not seen the line text, then print it.

Explanation of details:

FNR is the current file's record number
NR is the current overall record number from all input files
FNR==NR is true only when we are reading file1
$0 is the current line of text
a[$0] is a hash with the key set to the current line of text
a[$0]++ tracks that we've seen the current line of text
!a[$0] is true only when we have not seen the line text
Print the line of text if the above pattern returns true, this is the default awk behavior when no explicit action is given

0人赞添加讨论(0) 举报

成全新的幸福

6楼-- · 2019-01-16 11:16

with grep:

grep -F -x -v -f file_1 file_2

0人赞添加讨论(0) 举报

冷血范

7楼-- · 2019-01-16 11:17

If you are really set on doing this from the command line, this site (search for "no duplicates found") has an awk example that searches for duplicates. It may be a good starting point to look at that.

However, I'd encourage you to use Perl or Python for this. Basically, the flow of the program would be:

findUniqueValues(file1, file2){
    contents1 = array of values from file1
    contents2 = array of values from file2
    foreach(value2 in contents2){
        found=false
        foreach(value1 in contents1){
            if (value2 == value1) found=true
        }
        if(!found) print value2
    }
}

This isn't the most elegant way of doing this, since it has a O(n^2) time complexity, but it will do the job.

0人赞添加讨论(0) 举报

1 2 下一页

extracting unique values between 2 sets/files

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间