I have two files A-nodes_to_delete and B-nodes_to_keep. Each file has a many lines with numeric ids.

I want to have the list of numeric ids that are in nodes_to_delete but NOT in nodes_to_keep, e.g. alt text http://mathworld.wolfram.com/images/equations/SetDifference/Inline1.gif.

Doing it within a PostgreSQL database is unreasonably slow. Any neat way to do it in bash using Linux CLI tools?

UPDATE: This would seem to be a Pythonic job, but the files are really, really large. I have solved some similar problems using uniq, sort and some set theory techniques. This was about two or three orders of magnitude faster than the database equivalents.

标签： bash file-io set-difference

5条回答

何必那么认真

2楼-- · 2019-01-30 22:38

use comm - it will compare two sorted files line by line

The answer to OP's question using this example setup appears below. This command will return lines unique to deleteNodes, not in keepNodes
comm -1 -3 <(sort keepNodes) <(sort deleteNodes)
explanation: show lines unique to deleteNodes, hide other lines

example setup

We'll use keepNodes and deleteNodes. They're are used as unsorted input.

$ cat > keepNodes <(echo bob; echo amber;)
$ cat > deleteNodes <(echo bob; echo ann;)

By default without arguments, comm prints 3 columns

unique_to_FILE1
    unique_to_FILE2
        lines_appear_in_both

This is a barebones example of comm without arguments. Note the three columns.

$ comm <(sort keepNodes) <(sort deleteNodes)
amber
    ann
        bob

Suppressing column output

Suppress column 1, 2 or 3 with -N; note that when a column is hidden, the whitespace shrinks up.

$ comm -1 <(sort keepNodes) <(sort deleteNodes)
ann
    bob
$ comm -2 <(sort keepNodes) <(sort deleteNodes)
amber
    bob
$ comm -3 <(sort keepNodes) <(sort deleteNodes)
amber
    ann
$ comm -1 -3 <(sort keepNodes) <(sort deleteNodes)
ann
$ comm -2 -3 <(sort keepNodes) <(sort deleteNodes)
amber
$ comm -1 -2 <(sort keepNodes) <(sort deleteNodes)
bob

It will fail gracefully when you forget to sort

comm: file 1 is not in sorted order

0人赞添加讨论(0) 举报

混吃等死

3楼-- · 2019-01-30 22:45

The comm command does that.

0人赞添加讨论(0) 举报

▲ chillily

4楼-- · 2019-01-30 22:45

Somebody showed me how to do exactly this in sh a couple months ago, and then I couldn't find it for a while... and while looking I stumbled onto your question. Here it is :

set_union () {
   sort $1 $2 | uniq
}

set_difference () {
   sort $1 $2 $2 | uniq -u
}

set_symmetric_difference() {
   sort $1 $2 | uniq -u
}

0人赞添加讨论(0) 举报

兄弟一词,经得起流年.

5楼-- · 2019-01-30 22:45

comm was specifically designed for this kind of use case, but it requires sorted input.

awk is arguably a better tool for this as it's fairly straight forward to find set difference, doesn't require sort, and offers additional flexibility.

awk 'NR == FNR { a[$0]; next } !($0 in a)' nodes_to_keep nodes_to_delete

Perhaps, for example, you'd like to only find the difference in lines that represent non-negative numbers:

awk -v r='^[0-9]+$' 'NR == FNR && $0 ~ r {
    a[$0]
    next
} $0 ~ r && !($0 in a)' nodes_to_keep nodes_to_delete

0人赞添加讨论(0) 举报

做自己的国王

6楼-- · 2019-01-30 22:54

Maybe you need a better way to do it in postgres, I can pretty much bet that you won't find a faster way to do it using flat files. You should be able to do a simple inner join and assuming that both id cols are indexed that should be very fast.

0人赞添加讨论(0) 举报

bash, Linux: Set difference between two text files

example setup

Suppressing column output

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间