The related questions that appear after entering the title, and those that are in the right side bar when viewing a question seem to suggest very apt questions.
Stack Overflow only does a SQL search for it and uses no special algorithms, said Spolsky in a talk.
What algorithms exist to give good answers in such a case.
How do U do database search in such a case? Make the title searchable and search on the keywords or search on tags and those questions with many votes on top?
The related questions sidebar will be building on the tags for each question (probably by ranking them based on tag overlap, so 5 tags in common > 4 tags in common etc).
The rest will be building on heuristics and algorithms suitable for natural language processing. These aren't normally very good in general purpose language, but most of them are VERY good once the vocabulary is reduced down to a single technical area such as programming.
If you listen to the Stack Overflow podcast 32 (unfortunately the transcript doesn't have much in) you can hear Jeff Atwood say a little about how he does it.
It seems like the algorithm is something like:
- Take the question
- Remove the most common words in English (from a list he got from google)
- submit a full text search to the SQL server 2008 full text search engine
More details about the full text search can be found here: http://msdn.microsoft.com/en-us/library/ms142571.aspx
This may be out of date by now - they were talking about moving to a better/faster full text search such as Lucene, and I vaguely remember Jeff saying in the podcast that this had been done.
Have a look at Porter stemming for a stemming algorithm if you are looking to get into "related" algorithms.
A stemmer for English, for example,
should identify the string "cats" (and
possibly "catlike", "catty" etc.) as
based on the root "cat", and
"stemmer", "stemming", "stemmed" as
based on "stem". A stemming algorithm
reduces the words "fishing", "fished",
"fish", and "fisher" to the root word,
"fish".
Once you have processed a document and done stemming on it, you can index the stemmed words by count and then compare against other documents. This is the most basic approach to tackling this problem.
Also take care to ignore stop words like "the", "an", "of" etc.
This post will help you Is there an algorithm that tells the semantic similarity of two phrases
I do not know how SO implements it, but my hunch is that they use a variant of approximate string matching.
Such problems are solved by making a "bag of words" of the stemmed words. That is basically a word count vector. Those words are preprocessed (stemmed) and weighted with their probability to occur in a sentence ("the" has a higher probability than "probability" and should thus be weighted less). You then can perceive this bag of words either as a vector in euclidean space or as a sample of a probability density.
You can apply algorithms as nearest neighbour search or semantic hashing. The latter seems to be SOTA (see http://www.cs.toronto.edu/~rsalakhu/papers/semantic_final.pdf).
Use the fulltext search feature of SQL Server.