Fastest way to find lines of a file from another l

2019-01-03 02:43发布

I have two files, file1.txt and file2.txt. file1.txt has about 14K lines and file2.txt has about 2 billions. file1.txt has a single field f1 per line while file2.txt has 3 fields, f1 through f3, delimited by |.

I want to find all lines from file2.txt where f1 of file1.txt matches f2 of file2.txt (or anywhere on the line if we don't want to spend extra time splitting the values of file2.txt).

file1.txt (about 14K lines, not sorted):

foo1
foo2
...
bar1
bar2
...

file2.txt (about 2 billion lines, not sorted):

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

Output expected:

date1|foo1|number1
date2|foo2|number2
...
date1|bar1|number1
date2|bar2|number2
...

Here is what I have tried and it seems to be taking several hours to run:

fgrep -F -f file1.txt file2.txt > file.matched

I wonder if there is a better and faster way of doing this operation with the common Unix commands or with a small script.

16条回答
贪生不怕死
2楼-- · 2019-01-03 03:44

Did you try Awk that could speed up things a bit:

awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt

(or) using index() function in Awk as suggested by comments from Benjamin W., below

awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (index($0,i)) {print; break}}' file1.txt FS='|' file2.txt

(or) a more direct regex match as suggested by Ed Morton in comments,

awk 'FNR==NR{hash[$1]; next}{for (i in hash) if ($0~i) {print; break}}' file1.txt FS='|' file2.txt

is all you need. I'm guessing this will be faster but not exactly sure on files with million+ entries. Here the problem is with the possibility match in anywhere along the line. Had the same been in any particular column (e.g. say $2 alone), a faster approach could be

awk 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt

Also you could speed-things up by playing with the locale set in your system. Paraphrasing from this wonderful Stéphane Chazelas's answer on the subject, you could speed up things pretty quickly by setting passing the locale LC_ALL=C to the command locally being run.

On any GNU based system, the defaults for the locale

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

With one variable LC_ALL, you can set all LC_ type variables at once to a specified locale

$ LC_ALL=C locale
LANG=en_US.UTF-8
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"       
LC_ALL=C

So what does this impact?

Simply put, when using the locale C it will default to the server's base Unix/Linux language of ASCII. Basically when you grep something, by default your locale is going to be internationalized and set to UTF-8, which can represent every character in the Unicode character set to help display any of the world's writing systems, currently over more than 110,000 unique characters, whereas with ASCII each character is encoded in a single byte sequence and its character set comprises of no longer than 128 unique characters.

So it translates to this, when using grep on a file encoded in UTF-8 character-set, it needs to match each character with any of the hundred thousand unique characters, but just 128 in ASCII, so use your fgrep as

LC_ALL=C fgrep -F -f file1.txt file2.txt

Also, the same can be adapted to the Awk, since it uses a regex match with the match($0,i) call, setting the C locale could speed up the string match.

LC_ALL=C awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt
查看更多
等我变得足够好
3楼-- · 2019-01-03 03:46

Using flex:

1: build the flex processor:

$ awk 'NR==1{ printf "%%%%\n\n.*\\|(%s",$0 } 
            { printf "|%s",$0 } 
       END  { print ")\\|.*\\n ECHO;\n.*\\n ;\n%%\n" }' file1.txt > a.fl

2: compile it

$ flex -Ca -F a.fl ; cc -O lex.yy.c -lfl

3: and run

$ a.out < file2.txt  > out

Compiling (cc ...) is a slow process; this approach will pay just for cases of stable file1.txt

(In my machine) The times taken to run a search "100 in 10_000_000" test in this approach is 3 times faster than LC_ALL=C fgrep...

查看更多
Viruses.
4楼-- · 2019-01-03 03:48

Assumptions: 1. You want to run this search on just your local workstation. 2. Your have multiple cores/cpus to take advantage of a parallel search.

parallel --pipepart -a file2.txt --block 10M fgrep -F -f file1.txt

Some further tweaks depending on the context: A. Disable NLS with LANG=C (this is mentioned already in another answer) B. Set a max number of matches with the -m flag.

Note: I'm guessing that file2 is ~4GB and the 10M block size is ok, but you may need to optimize the block size to get the fastest run.

查看更多
做个烂人
5楼-- · 2019-01-03 03:48

You can also use Perl for this:

Please note that this will hog memory and your machine/server better has some.

Sample Data:

%_STATION@gaurav * /root/ga/pl> head file1.txt file2.txt
==> file1.txt <==
foo1
foo2
...
bar1
bar2
...

==> file2.txt <==
date1|foo1|number1
date2|foo2|number2
date3|foo3|number3
...
date1|bar1|number1
date2|bar2|number2
date3|bar3|number3
%_STATION@gaurav * /root/ga/study/pl>

Script Output: Script will produce final output in a file named output_comp.

%_STATION@gaurav * /root/ga/pl> ./comp.pl  file1.txt file2.txt ; cat output_comp
date1|bar1|number1
date2|bar2|number2
date2|foo2|number2
date1|foo1|number1
%_STATION@gaurav * /root/ga/pl>

Script:

%_STATION@gaurav * /root/ga/pl> cat comp.pl
#!/usr/bin/perl

use strict ;
use warnings ;
use Data::Dumper ;

my ($file1,$file2) = @ARGV ;
my $output = "output_comp" ;
my %hash ;    # This will store main comparison data.
my %tmp ;     # This will store already selected results, to be skipped.
(scalar @ARGV != 2 ? (print "Need 2 files!\n") : ()) ? exit 1 : () ;

# Read all files at once and use their name as the key.
for (@ARGV) {
  open FH, "<$_" or die "Cannot open $_\n" ;
  while  (my $line = <FH>) {chomp $line ;$hash{$_}{$line} = "$line"}
  close FH ;
}

# Now we churn through the data and compare to generate
# the sorted output in the output file.
open FH, ">>$output" or die "Cannot open outfile!\n" ;
foreach my $k1 (keys %{$hash{$file1}}){
  foreach my $k2 (keys %{$hash{$file2}}){
    if ($k1 =~ m/^.+?$k2.+?$/) {
      if (!defined $tmp{"$hash{$file2}{$k2}"}) {
        print FH "$hash{$file2}{$k2}\n" ;
        $tmp{"$hash{$file2}{$k2}"} = 1 ;
      }
    }
  }
}
close FH  ;
%_STATION@gaurav * /root/ga/pl>

Thanks.

查看更多
登录 后发表回答