I know of efficient ways to look for one string in a file (kmp), or various strings in a file (trie)
But, for years now, I've been wondering if there is a way (and ocasionally thinking it impossible) to search multiple files for multiple strings
Say I have a million files, and I want to answer queries like "find files that have the strings "banana", "motorboat" and "the white fox"". What would be an efficient algorithm ? Is there one ?
Of course, it is possible to do such a search in linear time on the size of the files to search. But that seems very infeasible for a big amount of big files. The existence of google seems to indicate that there actually is a very fast algorithm to do this. Maybe even one such that each query just depends on the query size, and not the database of texts size (of course, such an algorithm would involve some pre-processing of the input files)
I think there must be one such algorithm (google does it!) but my searches found nothing.
Break text in each file into a set of lexemes and capture text matching each lexeme. Reverse index each lexeme to the set of matching files. For each search term, convert to lexeme and return each matching captured text in each file.
As no-one else has answered, I will start the ball rolling with my simplistic ideas and hopefully someone clever will help further.
Ok, first off, this is readily parallelisable simply by splitting the 1,000,000 files across a number of servers as the first 250,000 files, if say, you had 4 servers, can be searched independently of the remaining files.
Then each server could run something like this, assuming your documents end in ".txt":
Performance could be improved by searching for rarer words before commonly occurring words.
Also, you could use awk and pass in all 3 search patterns and quit as soon as they had all been found rather than continuing processing till end of file.
Of course, if you are going to be doing multiple, repeated queries, it becomes worth spending more time up front to load the files into an efficient structure, such as a hash. So, if your input files contained the word "motorboat" then there would be an entry in your hash for that and it would be very quick to test if the file contained that word merely by testing for presence within a hash. This could then prune the files that needed to go into a method like the one outlined above and massively improve the performance.
So, the following code will parse all ".txt" files and note, for each word, which files it is in. So, when a search needs to be made, you can simply pass the search terms in and find the files that confain the words (not necessarily adjacent to each other) and pass that list of files to the script above:
The output for the little test files I have created is as follows:
Parallel Programming
This is in large scale definitely a task for parallel programming: Distribute the files to different computing units, let them search, then gather the result. This is actually what Google does, e.g. they solved some translation problem once through combining thousand commodity hardware PCs. (Although they might be using other hardware for real Google search results.) You can read popular articles on the internet.
"MapReduce" as one concept
Google invented for example a paradigm called MapReduce, which they wrote down in a whitepaper. This basically boils down to map input to output (widely distributed) in a first step. Then reducing all little results into one main result in a second step.
One could implement the search like that:
(This is practically the same as the "distributed grep" problem they presented in their paper.)
The problem to find out if a given string exists in a given text is well studied under the name "string matching", see for example the Rabin-Karp algorithm or the Knuth-Morris-Karp algorithm (just to get your hands on anything). So implementation of map is fairly easy.
For distribution of files one can use a lot of different techniques. If one wants to get a proper view on what is possible with distributed filesystems, one could gather information about Google File System (GFS), e.g. in the corresponding whitepaper.
reduce pretty much does nothing, so this is really easy.
Finished.
That is the best advantage about the MapReduce paradigm: Once one understood how map and reduce combine to one result, it is fairly easy to implement those two functions. If the MapReduce framework was implemented before, one does not have to worry at all about the parallelism of the computation -- which can cause severe headache otherwise.
Other concepts
This is definitely not the only possible concept.
If you are interested in this field of study you will find lots of other possibilities and I am sure there will come up a lot more in near future, as distributed system arise more than ever, but I hope I could provide some insight in what is possible, what to watch out for, and even a direction in how one could implement this right away.
If you could for example serialize each file into a trie periodically, you could than deserialize each trie as needed by search and perform queries on all tries?
It would be very fast but would of course require you to have a process constantly updating the tries of the files. I am pretty sure google also keeps indexing it's data somehow and you have to make some trade offs - increase performance at the cost of memory in this case.
(The question is worded fairly broad. Any efficient solution highly depends on the specific assumptions made. For the sake of discussion, I will make some assumptions you didn't explicitly mention.)
Model
Let's say...
f
files,w
words in total in those files,d
unique words (d
is the minimal size of a dictionary required to cover all files),q
words in the query, andr
files are in the result set of the query.I'll assume that
q
<<d
<<f
<<w
(i.e. each variable is 'orders of magnitudes smaller' than its successor), and further thatq
is essentially constant, i.e.O(1)
. I'll also assume that you're primarily interested in minimizing the amortized computation time measured inO(f)
orO(w)
, that you're willing to invest more memory for less computation time and that you expect to get queries fairly often.Note that the runtime of the algorithm cannot be better than
O(r)
, since we need to output each file belonging to the result set.Algorithm
Create an index based on a hashmap from words to sets of files as follows:
This code runs in
O(w)
, which is minimal (since you need to look at the complete input at least once). To find all files that contain all of the words inquery
, run:This code's runtime is essentially determined by the
q
set intersections it needs to perform. In a typical case, each of the sets is probablyO(log(f))
large and the overlap of the individual sets is moderate. In this case, the computation takesO(log(f))
.In the worst case however, each of the sets is
O(f)
large, even though the overlap (and thereforer
) is small. In this case, the computation would still takeO(f)
.