I need to compare two directory structures with around one billion files each (directory deepness up to 20 levels)
I found usual diff -r /location/one /location/two
slow.
Is there any implementation of multithreading diff? Or is it doable via combining shell
and diff
together? If so, how?
Your disk is gonna be the bottleneck.
Unless you are working on tmpfs, you will probably only loose speed. That said:
find -maxdepth 1 -type d -print0 |
xargs -0P4 -n1 -iDIRNAME diff -EwburqN "DIRNAME/" "/tmp/othertree/DIRNAME/"
should do a pretty decent job of comparing trees (in this case .
to /tmp/othertree
).
It has a flaw right now, in that it won't detect toplevel directories in otherthree
that don't exist in .
. I leave that as an exercise for the reader - though you could easily repeat the comparison in reverse
The argument -P4
to xargs specifies that you want at most 4 concurrent processes.
Also have look at the xjobs
utitlity which does a better job at separating the output. I think with GNU xargs (like shown) you cannot drop the -q
option because it will intermix the diffs (?).