diff folders recursively vs. multithreading

2019-05-12 16:47发布

问题:

I need to compare two directory structures with around one billion files each (directory deepness up to 20 levels)

I found usual diff -r /location/one /location/two slow.

Is there any implementation of multithreading diff? Or is it doable via combining shell and diff together? If so, how?

回答1:

Your disk is gonna be the bottleneck.

Unless you are working on tmpfs, you will probably only loose speed. That said:

find -maxdepth 1 -type d -print0 |
    xargs -0P4 -n1 -iDIRNAME diff -EwburqN "DIRNAME/" "/tmp/othertree/DIRNAME/"

should do a pretty decent job of comparing trees (in this case . to /tmp/othertree).

It has a flaw right now, in that it won't detect toplevel directories in otherthree that don't exist in .. I leave that as an exercise for the reader - though you could easily repeat the comparison in reverse

The argument -P4 to xargs specifies that you want at most 4 concurrent processes.

Also have look at the xjobs utitlity which does a better job at separating the output. I think with GNU xargs (like shown) you cannot drop the -q option because it will intermix the diffs (?).