search for multiple strings

2019-04-08 14:08发布

站内文章 / 前端开发

18 0

三岁会撩人

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I know of efficient ways to look for one string in a file (kmp), or various strings in a file (trie)

But, for years now, I've been wondering if there is a way (and ocasionally thinking it impossible) to search multiple files for multiple strings

Say I have a million files, and I want to answer queries like "find files that have the strings "banana", "motorboat" and "the white fox"". What would be an efficient algorithm ? Is there one ?

Of course, it is possible to do such a search in linear time on the size of the files to search. But that seems very infeasible for a big amount of big files. The existence of google seems to indicate that there actually is a very fast algorithm to do this. Maybe even one such that each query just depends on the query size, and not the database of texts size (of course, such an algorithm would involve some pre-processing of the input files)

I think there must be one such algorithm (google does it!) but my searches found nothing.

回答1:

Parallel Programming

This is in large scale definitely a task for parallel programming: Distribute the files to different computing units, let them search, then gather the result. This is actually what Google does, e.g. they solved some translation problem once through combining thousand commodity hardware PCs. (Although they might be using other hardware for real Google search results.) You can read popular articles on the internet.

"MapReduce" as one concept

Google invented for example a paradigm called MapReduce, which they wrote down in a whitepaper. This basically boils down to map input to output (widely distributed) in a first step. Then reducing all little results into one main result in a second step.

One could implement the search like that:

map: Distribute the documents together with the keyword to search for. If the search word is found in the current file, return the filename from the computing node. Otherwise return nothing.
reduce: Gather all filenames in a list from all nodes.

(This is practically the same as the "distributed grep" problem they presented in their paper.)

The problem to find out if a given string exists in a given text is well studied under the name "string matching", see for example the Rabin-Karp algorithm or the Knuth-Morris-Karp algorithm (just to get your hands on anything). So implementation of map is fairly easy.

For distribution of files one can use a lot of different techniques. If one wants to get a proper view on what is possible with distributed filesystems, one could gather information about Google File System (GFS), e.g. in the corresponding whitepaper.

reduce pretty much does nothing, so this is really easy.

Finished.

That is the best advantage about the MapReduce paradigm: Once one understood how map and reduce combine to one result, it is fairly easy to implement those two functions. If the MapReduce framework was implemented before, one does not have to worry at all about the parallelism of the computation -- which can cause severe headache otherwise.

Other concepts

This is definitely not the only possible concept.

It is possible to vary on what hardware you use (independent PCs like MapReduce supposes, or is it more like a supercomputer with dozens of CPU).
It is possible to vary on what distributed (or not distributed) filesystem you use.
It is possible to vary the programming language, which too can make a huge difference.

If you are interested in this field of study you will find lots of other possibilities and I am sure there will come up a lot more in near future, as distributed system arise more than ever, but I hope I could provide some insight in what is possible, what to watch out for, and even a direction in how one could implement this right away.

回答2:

(The question is worded fairly broad. Any efficient solution highly depends on the specific assumptions made. For the sake of discussion, I will make some assumptions you didn't explicitly mention.)

Model

Let's say...

There are f files,
w words in total in those files,
d unique words (d is the minimal size of a dictionary required to cover all files),
q words in the query, and
r files are in the result set of the query.

I'll assume that q << d << f << w (i.e. each variable is 'orders of magnitudes smaller' than its successor), and further that q is essentially constant, i.e. O(1). I'll also assume that you're primarily interested in minimizing the amortized computation time measured in O(f) or O(w), that you're willing to invest more memory for less computation time and that you expect to get queries fairly often.

Note that the runtime of the algorithm cannot be better than O(r), since we need to output each file belonging to the result set.

Algorithm

Create an index based on a hashmap from words to sets of files as follows:

index = {}
for file in files:
  for word in file:
    index[word] += file

This code runs in O(w), which is minimal (since you need to look at the complete input at least once). To find all files that contain all of the words in query, run:

wordWithLeastFilesMatching = min(query, key=lambda word: len(index[word]))
result = set(index[wordWithLeastFilesMatching])
for word in query:
  result = result.intersection(index[word])
return result

This code's runtime is essentially determined by the q set intersections it needs to perform. In a typical case, each of the sets is probably O(log(f)) large and the overlap of the individual sets is moderate. In this case, the computation takes O(log(f)).

In the worst case however, each of the sets is O(f) large, even though the overlap (and therefore r) is small. In this case, the computation would still take O(f).

回答3:

As no-one else has answered, I will start the ball rolling with my simplistic ideas and hopefully someone clever will help further.

Ok, first off, this is readily parallelisable simply by splitting the 1,000,000 files across a number of servers as the first 250,000 files, if say, you had 4 servers, can be searched independently of the remaining files.

Then each server could run something like this, assuming your documents end in ".txt":

#!/bin/bash
find . -name "*.txt" | while IFS= read a
do
  grep -l banana "$a" | while IFS= read b
  do
    grep -l motorboat "$b" | while IFS= read c
    do
      grep -l "the white fox" "$c"
    done
  done
done

Performance could be improved by searching for rarer words before commonly occurring words.

Also, you could use awk and pass in all 3 search patterns and quit as soon as they had all been found rather than continuing processing till end of file.

Of course, if you are going to be doing multiple, repeated queries, it becomes worth spending more time up front to load the files into an efficient structure, such as a hash. So, if your input files contained the word "motorboat" then there would be an entry in your hash for that and it would be very quick to test if the file contained that word merely by testing for presence within a hash. This could then prune the files that needed to go into a method like the one outlined above and massively improve the performance.

So, the following code will parse all ".txt" files and note, for each word, which files it is in. So, when a search needs to be made, you can simply pass the search terms in and find the files that confain the words (not necessarily adjacent to each other) and pass that list of files to the script above:

#!/usr/bin/perl
use strict;
use warnings;

my %words;

# Load all files ending in ".txt"
my @files=<*.txt>;
foreach my $file (@files){
   print "Loading: $file\n";
   open my $fh, '<', $file or die "Could not open $file";
   while (my $line = <$fh>) {
     chomp $line;
     foreach my $str (split /\s+/, $line) {
        $words{$str}{$file}=1;
     }
   }
   close($fh);
}

foreach my $str1 (keys %words) {
  print "Word: \"$str1\" is in : ";
  foreach my $str2 (keys $words{$str1}) {
    print "$str2 ";
  }
  print "\n";
}

The output for the little test files I have created is as follows:

./go
Loading: a.txt
Loading: b.txt
Loading: c.txt
Loading: d.txt
Word: "the" is in : c.txt d.txt 
Word: "motorboat" is in : b.txt d.txt 
Word: "white" is in : c.txt d.txt 
Word: "banana" is in : c.txt d.txt a.txt 
Word: "fox" is in : c.txt d.txt

回答4:

If you could for example serialize each file into a trie periodically, you could than deserialize each trie as needed by search and perform queries on all tries?

It would be very fast but would of course require you to have a process constantly updating the tries of the files. I am pretty sure google also keeps indexing it's data somehow and you have to make some trade offs - increase performance at the cost of memory in this case.

回答5:

Break text in each file into a set of lexemes and capture text matching each lexeme. Reverse index each lexeme to the set of matching files. For each search term, convert to lexeme and return each matching captured text in each file.