I have two files A
-nodes_to_delete
and B
-nodes_to_keep
. Each file has a many lines with numeric ids.
I want to have the list of numeric ids that are in nodes_to_delete
but NOT in nodes_to_keep
, e.g. alt text http://mathworld.wolfram.com/images/equations/SetDifference/Inline1.gif.
Doing it within a PostgreSQL database is unreasonably slow. Any neat way to do it in bash using Linux CLI tools?
UPDATE: This would seem to be a Pythonic job, but the files are really, really large. I have solved some similar problems using uniq
, sort
and some set theory techniques. This was about two or three orders of magnitude faster than the database equivalents.
use
comm
- it will compare two sorted files line by lineexample setup
We'll use keepNodes and deleteNodes. They're are used as unsorted input.
By default without arguments, comm prints 3 columns
This is a barebones example of
comm
without arguments. Note the three columns.Suppressing column output
Suppress column 1, 2 or 3 with -N; note that when a column is hidden, the whitespace shrinks up.
It will fail gracefully when you forget to sort
comm: file 1 is not in sorted order
The comm command does that.
Somebody showed me how to do exactly this in sh a couple months ago, and then I couldn't find it for a while... and while looking I stumbled onto your question. Here it is :
comm
was specifically designed for this kind of use case, but it requires sorted input.awk
is arguably a better tool for this as it's fairly straight forward to find set difference, doesn't requiresort
, and offers additional flexibility.Perhaps, for example, you'd like to only find the difference in lines that represent non-negative numbers:
Maybe you need a better way to do it in postgres, I can pretty much bet that you won't find a faster way to do it using flat files. You should be able to do a simple inner join and assuming that both id cols are indexed that should be very fast.