I have two files, file1.txt
and file2.txt
. file1.txt
has about 14K lines and file2.txt
has about 2 billions. file1.txt
has a single field f1
per line while file2.txt
has 3 fields, f1
through f3
, delimited by |
.
I want to find all lines from file2.txt
where f1
of file1.txt
matches f2
of file2.txt
(or anywhere on the line if we don't want to spend extra time splitting the values of file2.txt
).
file1.txt (about 14K lines, not sorted):
foo1
foo2
...
bar1
bar2
...
file2.txt (about 2 billion lines, not sorted):
date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...
Output expected:
date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...
Here is what I have tried and it seems to be taking several hours to run:
fgrep -F -f file1.txt file2.txt > file.matched
I wonder if there is a better and faster way of doing this operation with the common Unix commands or with a small script.
Can you give a try to
join
? Files must be sorted though...Small Update:
By using LC_ALL=C in front of join, things are really speed up as can be seen in the benchmark of Håkon Hægland
PS1: I have my doubts if join can be faster than grep -f ...
A small piece of Perl code solved the problem. This is the approach taken:
file1.txt
in a hashfile2.txt
line by line, parse and extract the second fieldHere is the code:
I ran the above script with 14K lines in file1.txt and 1.3M lines in file2.txt. It finished in about 13 seconds, producing 126K matches. Here is the
time
output for the same:I ran @Inian's
awk
code:It was way slower than the Perl solution, since it is looping 14K times for each line in file2.txt - which is really expensive. It aborted after processing 592K records of
file2.txt
and producing 40K matched lines. Here is how long it took:Using @Inian's other
awk
solution, which eliminates the looping issue:awk
is very impressive here, given that we didn't have to write an entire program to do it.I ran @oliv's Python code as well. It took about 15 hours to complete the job, and looked like it produced the right results. Building a huge regex isn't as efficient as using a hash lookup. Here the
time
output:I tried to follow the suggestion to use parallel. However, it failed with
fgrep: memory exhausted
error, even with very small block sizes.What surprised me was that
fgrep
was totally unsuitable for this. I aborted it after 22 hours and it produced about 100K matches. I wishfgrep
had an option to force the content of-f file
to be kept in a hash, just like what the Perl code did.I didn't check
join
approach - I didn't want the additional overhead of sorting the files. Also, givenfgrep
's poor performance, I don't believejoin
would have done better than the Perl code.Thanks everyone for your attention and responses.
Here is Perl solution that uses
Inline::C
to speed up the search for matching fields in the large file:The
search()
sub routine is implemented in pure C usingperlapi
to look up keys in the small file dictionary%words
:search.c:
Tests indicate that it is approximately 3 times faster than the fastest pure Perl solution (see method
zdim2
in my other answer) presented here.IMHO, grep is a good tool highly optimised for huge file2.txt but maybe not for so many patterns to search. I suggest to combine all the strings of file1.txt into a single huge regexp like \|bar1|bar2|foo1|foo2\|
And of course LANG=C may help. Please give feedback or send your files so I can test myself.
Here is a Python solution using sets -- roughly equivalent to a Perl key only hash or awk array in concept.
When I run this on files of similar size, it runs in about 8 seconds.
Same speed as:
Both the Python and awk solution here are full string match only; not a partial regex style match.
Since the awk solution is fast and POSIX compliant, that is the better answer.
I have tried to do a comparison between some of the methods presented here.
First I created a Perl script to generate the input files
file1.txt
andfile2.txt
. In order to compare some of the solutions, I made sure that the the words fromfile1.txt
only could appear in the second field infile2.txt
. Also to be able to use thejoin
solution presented by @GeorgeVasiliou, I sortedfile1.txt
andfile2.txt
. Currently I generated the input files based on only 75 random words ( taken from https://www.randomlists.com/random-words ). Only 5 of these 75 words was used infile1.txt
the remaining 70 words was used to fill up the fields infile2.txt
. It might be necessary to increase the number of words substantially to get realistic results ( according to the OP, the originalfile1.txt
contained 14000 words). In the tests below I used afile2.txt
with 1000000 ( 1 million ) lines. The script also generates the fileregexp1.txt
required by the grep solution of @BOC.gen_input_files.pl:
Next, I created a sub folder
solutions
with all the test cases:Here the files
out.txt
is the output from the greps for each solution. The scriptsrun.sh
runs the solution for the given test case.Notes on the different solutions
BOC1
: First solution presented by @BOCBOC2
: Second solution suggested by @BOC:codeforester
: Accepted Perl solution by @codeforester ( see source )codeforester_orig
: Original solution presented by @codeforested:dawg
: Python solution using dictionary and split line proposed by @dawg ( see source )gregory1
: solution using Gnu Parallel suggested by @gregorySee note below regarding how to choose
$block_size
.hakon1
: Perl solution provided by @HåkonHægland (see source). This solution requires compilation of the c-extension the first time the code is run. It does not require recompilation whenfile1.txt
orfile2.txt
changes. Note: The time used to compile the c-extension at the initial run is not included in the run times presented below.ikegami
: Solution using assembled regexp and usinggrep -P
as given by @ikegami. Note: The assembled regexp was written to a separate fileregexp_ikegami.txt
, so the runtime of generating the regexp is not included in the comparison below. This is the code used:inian1
: First solution by @Inian usingmatch()
inian2
: Second solution by @Inian usingindex()
inian3
: Third solution by @Inian checking only$2
field:inian4
: 4th soultion by @Inian ( basically the same ascodeforester_orig
withLC_ALL
) :inian5
: 5th solution by @Inian (same asinian1
but withLC_ALL
):inian6
: Same asinian3
but withLC_ALL=C
. Thanks to @GeorgeVasiliou for suggestion.jjoao
: Compiled flex-generated C code as proposed by @JJoao (see source ). Note: Recompilation of the exectuable must be done each timefile1.txt
changes. The time used to compile the executable is not included in the run times presented below.oliv
: Python script provided by @oliv ( see source )Vasiliou
: Usingjoin
as suggested by @GeorgeVasiliou:Vasiliou2
: Same asVasiliou
but withLC_ALL=C
.zdim
: Using Perl script provided by @zdim ( see source ). Note: This uses the regexp search version ( instead of split line solution ).zdim2
: Same aszdim
except that it uses thesplit
function instead of regexp search for the field infile2.txt
.Notes
I experimented a little bit with Gnu parallel (see
gregory1
solution above) to determine the optimal block size for my CPU. I have 4 cores, and and currently it seems that the optimal choice is to devide the file (file2.txt
) into 4 equal sized chunks, and run a single job on each of the 4 processors. More testing might be needed here. So for the first test case wherefile2.txt
is 20M, I set$block_size
to 5M ( seegregory1
solution above), whereas for the more realistic case presented below wherefile2.txt
is 268M, a$block_size
of 67M was used.The solutions
BOC1
,BOC2
,codeforester_orig
,inian1
,inian4
,inian5
, andgregory1
all used loose matching. Meaning that the words fromfile1.txt
did not have to match exactly in field #2 offile2.txt
. A match anywhere on the line was accepted. Since this behavior made it more difficult to compare them with the other methods, some modified methods were also introduced. The first two methods calledBOC1B
andBOC2B
used a modifiedregexp1.txt
file. The lines in the originalregexp1.txt
where on the form\|foo1|foo2|...|fooN\|
which would match the words at any field boundary. The modified file,regexp1b.txt
, anchored the match to field #2 exclusively using the form^[^|]*\|foo1|foo2|...|fooN\|
instead.Then the rest of the modified methods
codeforester_origB
,inian1B
,inian4B
,inian5B
, andgregory1B
used a modifiedfile1.txt
. Instead of a literal word per line, the modified filefile1b.txt
used one regex per line on the form:and in addition,
fgrep -f
was replaced bygrep -E -f
for these methods.Running the tests
Here is the script used for running all the tests. It uses the Bash
time
command to record the time spent for each script. Note that thetime
command returns three different times callreal
,user
, andsys
. First I useduser
+sys
, but realized that this was incorrect when using Gnu parallel command, so the time reported below is now thereal
part returned bytime
. See this question for more information about the different times returned bytime
.The first test is run with
file1.txt
containing 5 lines, andfile2.txt
containing1000000
lines. Here is the first 52 lines of therun_all.pl
script, the rest of the script is available here.run_all.pl
Results
Here is the output from running the tests:
Summary
[Results obtained by @Vasiliou are shown in the middle column.]
A more realistic test case
I then created a more realistic case with
file1.txt
having 100 words andfile2.txt
having 10 million lines (268Mb file size). I extracted 1000 random words from the dictionary at/usr/share/dict/american-english
usingshuf -n1000 /usr/share/dict/american-english > words.txt
then extracted 100 of these words intofile1.txt
and then constructedfile2.txt
the same way as described above for the first test case. Note that the dictionary file was UTF-8 encoded, and I stripped away all non-ASCII characters from thewords.txt
.Then I run the test without the three slowest methods from the previous case. I.e.
inian1
,inian2
, andinian5
were left out. Here are the new results:Note
The
grep
based solutions were looking for a match on the whole line, so in this case they contained some false matches: the methodscodeforester_orig
,BOC1
,BOC2
,gregory1
,inian4
, andoliv
extracted 1,087,609 lines out of 10,000,000 lines, whereas the other methods extracted the correct 997,993 lines fromfile2.txt
.Notes
I tested this on my Ubuntu 16.10 laptop (Intel Core i7-7500U CPU @ 2.70GHz)
The whole benchmark study is available here.