Use Case
When a user goes to my website, they will be confronted with a search box much like SO. They can search for results using plan text. ".net questions", "closed questions", ".net and java", etc.. The search will function a bit different that SO, in that it will try to as much as possible of the schema of the database rather than a straight fulltext search. So ".net questions" will only search for .net questions as opposed to .net answers (probably not applicable to SO case, just an example here), "closed questions" will return questions that are closed, ".net and java" questions will return questions that relate to .net and java and nothing else.
Problem
I'm not too familiar with the words but I basically want to do a keyword to SQL driven search. I know the schema of the database and I also can datamine the database. I want to know any current approaches there that existing out already before I try to implement this. I guess this question is for what is a good design for the stated problem.
Proposed
My proposed solution so far looks something like this
- Clean the input. Just remove any special characters
- Parse the input into chunks of data. Break an input of "c# java" into c# and java Also handle the special cases like "'c# java' questions" into 'c# java' and "questions".
- Build a tree out of the input
- Bind the data into metadata. So convert stuff like closed questions and relate it to the isclosed column of a table.
- Convert the tree into a sql query.
Thoughts/suggestions/links?
You could use a hybrid approach, take the full text search results and further filter them based on the meta data from your #4. For something more intelligent you could create a simple supervised learning solution by tracking what links the user clicks on after the search and store that choice with the key search words in a decision tree. Searches would then be mined from this decision tree
If you're using SQL Server, you can simply use its Full-Text Search feature, which is specifically designed to solve your problem.
What you are looking for is Natural Language Processing. Strangely enough this used to be included free as English Query in SQL Server 2000 and prior. But it's gone now
Some other sources are :
The concept is a meta data dictionary mapping words to table, columns, relationships etc and an English sentence parser combined together to convert a English sentence ( or just some keywords) into a real query
Some people even user English Query with speech recognition for some really cool demos, never saw it used in anger though!
I run a digital music store with a "single search" that weights keywords based on their occurrences and the schema in which Products appear, eg. with different columns like "Artist", "Title" or "Publisher".
Products are also related to albums and playlists, but for simpler explanation, I will only elaborate on the indexing and querying of Products' Keywords.
Database Schema
Keywords
table - a weighted table for every word that could possibly be searched for (hence, it is referenced somewhere) with the following data for each record:ProductKeywords
table - a weighted table for every keyword referenced by any of a product's fields (or columns) with the following data for each record:Keyword Weighting
The weighting value is an indication of how often the words occurs. Matching keywords with a lower weight are "more unique" and are more likely to be what is being searched for. In this way, words occurring often are automatically "down-weighted", eg. "the", "a" or "I". However, it is best to strip out atomic occurrences of those common words before indexing.
I used integers for weighting, but using a decimal value will offer more versatility, possibly with slightly slower sorting.
Indexing
Whenever any product field is updated, eg. Artist or Title (which does not happen that often), a database trigger re-indexes the product's keywords like so inside a transaction:
ProductKeywords
table for a direct match.ProductKeyword
weight adjustment.Querying