I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.
For example, if this is file1:
line1
line2
line3
And this is file2:
line1
line4
line5
Then my result/output should be:
line2
line3
This works:
grep -v -f file2 file1
But it is very, very slow when used on my large files.
I suspect there is a good way to do this using diff(), but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.
Can anyone help me find a fast way of doing this, using bash and basic linux binaries?
EDIT: To follow up on my own question, this is the best way I have found so far using diff():
diff file2 file1 | grep '^>' | sed 's/^>\ //'
Surely, there must be a better way?
You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU
diff
output:The input files should be sorted for this to work. With
bash
(andzsh
) you can sort in-place with process substitution<( )
:In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output. You may also use a few
diff
options that other solutions don't offer, such as-i
to ignore case, or various whitespace options (-E
,-b
,-v
etc) for less strict matching.Explanation
The options
--new-line-format
,--old-line-format
and--unchanged-line-format
let you control the waydiff
formats the differences, similar toprintf
format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.If you are familiar with unified diff format, you can partly recreate it with:
The
%L
specifier is the line in question, and we prefix each with "+" "-" or " ", likediff -u
(note that it only outputs differences, it lacks the---
+++
and@@
lines at the top of each grouped change). You can also use this to do other useful things like number each line with%dn
.The
diff
method (along with other suggestionscomm
andjoin
) only produce the expected output with sorted input, though you can use<(sort ...)
to sort in place. Here's a simpleawk
(nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.This stores the entire contents of file1 line by line in a line-number indexed array
ll1[]
, and the entire contents of file2 line by line in a line-content indexed associative arrayss2[]
. After both files are read, iterate overll1
and use thein
operator to determine if the line in file1 is present in file2. (This will have have different output to thediff
method if there are duplicates.)In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.
The above stores the entire contents of file1 in two arrays, one indexed by line number
ll1[]
, one indexed by line contentss1[]
. Then as file2 is read, each matching line is deleted fromll1[]
andss1[]
. At the end the remaining lines from file1 are output, preserving the original order.In this case, with the problem as stated, you can also divide and conquer using GNU
split
(filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:Note the use and placement of
-
meaningstdin
on thegawk
command line. This is provided bysplit
from file1 in chunks of 20000 line per-invocation.For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU
diff
,awk
, though only a POSIX/BSDsplit
rather than a GNU version.Like konsolebox suggested, the posters grep solution
actually works great (fast) if you simply add the
-F
option, to treat the patterns as fixed strings instead of regular expressions. I verified this on a pair of ~1000 line file lists I had to compare. With-F
it took 0.031 s (real), while without it took 2.278 s (real), when redirecting grep output towc -l
.These tests also included the
-x
switch, which are necessary part of the solution in order to ensure totally accuracy in cases where file2 contains lines which match part of, but not all of, one or more lines in file1.So a solution that does not require the inputs to be sorted, is fast, flexible (case sensitivity, etc) and also (I think) works on any POSIX system is:
Using of fgrep or adding -F option to grep could help. But for faster calculations you could use Awk.
You could try one of these Awk methods:
http://www.linuxquestions.org/questions/programming-9/grep-for-huge-files-826030/#post4066219
whats the speed of as sort and diff?