I have got 2 files. Let us call them md5s1.txt and md5s2.txt. Both contain the output of a
find -type f -print0 | xargs -0 md5sum | sort > md5s.txt
command in different directories. Many files were renamed, but the content stayed the same. Hence, they should have the same md5sum. I want to generate a diff like
diff md5s1.txt md5s2.txt
but it should compare only the first 32 characters of each line, i.e. only the md5sum, not the filename. Lines with equal md5sum should be considered equal. The output should be in normal diff format.
Easy starter:
diff <(cut -d' ' -f1 md5s1.txt) <(cut -d' ' -f1 md5s2.txt)
Also, consider just
diff -EwburqN folder1/ folder2/
Compare only the md5 column using diff
on <(cut -c -32 md5sums.sort.XXX)
, and tell diff
to print just the line numbers of added or removed lines, using --old/new-line-format='%dn'$'\n'
. Pipe this into ed md5sums.sort.XXX
so it will print only those lines from the md5sums.sort.XXX
file.
diff \
--new-line-format='%dn'$'\n' \
--old-line-format='' \
--unchanged-line-format='' \
<(cut -c -32 md5sums.sort.old) \
<(cut -c -32 md5sums.sort.new) \
| ed md5sums.sort.new \
> files-added
diff \
--new-line-format='' \
--old-line-format='%dn'$'\n' \
--unchanged-line-format='' \
<(cut -c -32 md5sums.sort.old) \
<(cut -c -32 md5sums.sort.new) \
| ed md5sums.sort.old \
> files-removed
The problem with ed
is that it will load the entire file into memory, which can be a problem if you have a lot of checksums. Instead of piping the output of diff into ed
, pipe it into the following command, which will use much less memory.
diff … | (
lnum=0;
while read lprint; do
while [ $lnum -lt $lprint ]; do read line <&3; ((lnum++)); done;
echo $line;
done
) 3<md5sums.sort.XXX
If you are looking for duplicate files fdupes can do this for you:
$ fdupes --recurse
On ubuntu you can install it by doing
$ apt-get install fdupes