How do search engines merge results from an inverted index?
For example, if I searched for the inverted indexes of the words "dog" and "bat", there would be two huge lists of every document which contained one of the two words.
I doubt that a search engine walks through these lists, one document at a time, and tries to find matches with the results of the lists. What is done algorithmically to make this merging process blazing fast?
Since inverted indices are ordered by docId they can be merged very fast. [if one of the word starts at docId 23 and second at docId 100001, you can immediately fast forward to docId 100001 or higher in first list as well. ]
Since the typical document intersections are atmost few million they can be sorted for rank very fast. I searched for 'dog cat' [very common 2 words] which returned only 54 million hits.
Sorting of 10 millon random integers took only 2.3 seconds in my Mac with single threaded code [1 million took 206 ms!] and since we need to typically pick only the top 10 not even the full sort is required.
Here is the code if someone wants to try the speed of sort and too lazy to write code!
}
Actually search engines do merge these document lists. They gain good performance by using other techniques, the most important of which is pruning: for example, for every word the documents are stored in order of decreasing pagerank, and to get results that have a chance of getting into the first 10 (which will be shown to the user) you may traverse just a fairly small portion of the dog and bat lists, say, the first thousand. (and, of course, there's caching, but that's not related to the very query execution algorithm)
Besides, after all, there are not that many documents about dogs and about bats: even if it's millions, it turns into split seconds with a good implementation.
P.S. I worked at our country's leading search engine, however, not in the very engine of our flagship search product, but I talked to its developers and was surprised to know that query execution algorithms are actually fairly dumb: it turns out that one may squash a huge amount of computation into acceptable time bounds. It is all very optimized of course, but there's no magic and no miracles.