I have two files, file1.txt
and file2.txt
. file1.txt
has about 14K lines and file2.txt
has about 2 billions. file1.txt
has a single field f1
per line while file2.txt
has 3 fields, f1
through f3
, delimited by |
.
I want to find all lines from file2.txt
where f1
of file1.txt
matches f2
of file2.txt
(or anywhere on the line if we don't want to spend extra time splitting the values of file2.txt
).
file1.txt (about 14K lines, not sorted):
foo1
foo2
...
bar1
bar2
...
file2.txt (about 2 billion lines, not sorted):
date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...
Output expected:
date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...
Here is what I have tried and it seems to be taking several hours to run:
fgrep -F -f file1.txt file2.txt > file.matched
I wonder if there is a better and faster way of doing this operation with the common Unix commands or with a small script.
Did you try
Awk
that could speed up things a bit:(or) using
index()
function inAwk
as suggested by comments from Benjamin W., below(or) a more direct regex match as suggested by Ed Morton in comments,
is all you need. I'm guessing this will be faster but not exactly sure on files with million+ entries. Here the problem is with the possibility match in anywhere along the line. Had the same been in any particular column (e.g. say
$2
alone), a faster approach could beAlso you could speed-things up by playing with the
locale
set in your system. Paraphrasing from this wonderful Stéphane Chazelas's answer on the subject, you could speed up things pretty quickly by setting passing the localeLC_ALL=C
to the command locally being run.On any
GNU
based system, the defaults for thelocale
With one variable
LC_ALL
, you can set allLC_
type variables at once to a specified localeSimply put, when using the
locale C
it will default to the server's base Unix/Linux language ofASCII
. Basically when yougrep
something, by default your locale is going to be internationalized and set toUTF-8
, which can represent every character in the Unicode character set to help display any of the world's writing systems, currently over more than110,000
unique characters, whereas withASCII
each character is encoded in a single byte sequence and its character set comprises of no longer than128
unique characters.So it translates to this, when using
grep
on a file encoded inUTF-8
character-set, it needs to match each character with any of the hundred thousand unique characters, but just128
inASCII
, so use yourfgrep
asAlso, the same can be adapted to the
Awk
, since it uses aregex
match with thematch($0,i)
call, setting theC
locale could speed up the string match.Using flex:
1: build the flex processor:
2: compile it
3: and run
Compiling (cc ...) is a slow process; this approach will pay just for cases of stable file1.txt
(In my machine) The times taken to run a search "100 in 10_000_000" test in this approach is 3 times faster than
LC_ALL=C fgrep...
Assumptions: 1. You want to run this search on just your local workstation. 2. Your have multiple cores/cpus to take advantage of a parallel search.
Some further tweaks depending on the context: A. Disable NLS with LANG=C (this is mentioned already in another answer) B. Set a max number of matches with the -m flag.
Note: I'm guessing that file2 is ~4GB and the 10M block size is ok, but you may need to optimize the block size to get the fastest run.
You can also use Perl for this:
Please note that this will hog memory and your machine/server better has some.
Sample Data:
Script Output: Script will produce final output in a file named
output_comp
.Script:
Thanks.