I need to work with large files and must find differences between two. And I don't need the different bits, but the number of differences.
To find the number of different rows I come up with
diff --suppress-common-lines --speed-large-files -y File1 File2 | wc -l
And it works, but is there a better way to do it?
And how to count the exact number of differences (with standard tools like bash, diff, awk, sed some old version of perl)?
If using Linux/Unix, what about
comm -1 file1 file2
to print lines in file1 that aren't in file2,comm -1 file1 file2 | wc -l
to count them, and similarly forcomm -2 ...
?Since every output line that differs starts with
<
or>
character, I would suggest this:By using only
\<
or\>
in the script line you can count differences only in one of the files.If you want to count the number of lines that are different use this:
Doesn't John's answer double count the different lines?
If you're dealing with files with analogous content that should be sorted the same line-for-line (like CSV files describing similar things) and you would e.g. want to find 2 differences in the following files:
you could implement it in Python like this:
I believe the correct solution is in this answer, that is:
That minus 2 for the two file names at the top of the
diff
listing. Unified format is probably a bit faster than side-by-side format.