This question already has an answer here:
I have two files (let's say a.txt
and b.txt
), both of which has a list of names. I have already run sort
on both the files.
Now I want to find lines from a.txt
which are not present in b.txt
.
(I spent lot of time to find the answer for this question, so documenting it for future reference)
I am not sure why it has been said
diff
should not be used. I would use it to compare the two files and then output only lines that are in the left file but not in right one. Such lines are flagged by diff with<
so it suffices to grep that symbol at the beginning of the lineIn the case the files wouldn't be sorted yet, you can use:
The simple answer did not work for me because I didn't realize
comm
matches line for line, so duplicate lines in one file will be printed as not-existing in the other. For example, if file1 contained:And file2 contained:
Then
comm -13 file1 file2
would output:In my case, I wanted to know only that every string in file2 existed in file1, regardless of how many times that line occurred in each file.
Solution 1: use the
-u
(unique) flag tosort
:comm -13 <(sort -u file1) <(sort -u file2)
Solution 2: (the first "working" answer I found) from unix.stackexchange:
fgrep -v -f file1 file2
Note that if file2 contains duplicate lines that don't exist at all in file1,
fgrep
will output each of the duplicate lines. Also note that my totally non-scientific tests on a single laptop for a single (fairly large) dataset showed Solution 1 (usingcomm
) to be almost 5 times faster than Solution 2 (usingfgrep
).The command you have to use is not
diff
butcomm
By default,
comm
outputs 3 columns: left-only, right-only, both. The-1
,-2
and-3
switches suppress these columns.So,
-23
hides the right-only and both columns, showing the lines that appear only in the first (left) file.If you want to find lines that appear in both, you can use
-12
, which hides the left-only and right-only columns, leaving you with just the both column.