I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.
For example, if this is file1:
line1
line2
line3
And this is file2:
line1
line4
line5
Then my result/output should be:
line2
line3
This works:
grep -v -f file2 file1
But it is very, very slow when used on my large files.
I suspect there is a good way to do this using diff(), but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.
Can anyone help me find a fast way of doing this, using bash and basic linux binaries?
EDIT: To follow up on my own question, this is the best way I have found so far using diff():
diff file2 file1 | grep '^>' | sed 's/^>\ //'
Surely, there must be a better way?
The way I usually do this is using the
--suppress-common-lines
flag, though note that this only works if your do it in side-by-side format.diff -y --suppress-common-lines file1.txt file2.txt
The comm command (short for "common") may be useful
comm - compare two sorted files line by line
The
man
file is actually quite readable for this.I found that for me using a normal if and for loop statement worked perfectly.
You can use Python:
The
-t
makes sure that it compares the whole line, if you had a space in some of the lines.If you're short of "fancy tools", e.g. in some minimal Linux distribution, there is a solution with just
cat
,sort
anduniq
:Test:
This is also relatively fast, compared to
grep
.