I want to convert Natural Language questions to SQL queries using opennlp Java library, i.e.
Who won women figure skating in Sochi?
should be converted to
select name from winners where event='skating_woman'
Anybody know which classes will be useful and how to achieve this?
Also pasting code which I tried.
I have converted question into statements and later into tokens.
/////////1st part
String paragraph = "Did Matt win the men slalom?";
InputStream is1 = new FileInputStream("bins/en-sent.bin");
SentenceModel model1 = new SentenceModel(is1);
SentenceDetectorME sdetector = new SentenceDetectorME(model1);
String sentences[] = sdetector.sentDetect(paragraph);
System.out.println(sentences[0]);
is1.close();
////////2nd part
InputStream is2 = new FileInputStream("bins/en-token.bin");
TokenizerModel model2 = new TokenizerModel(is2);
Tokenizer tokenizer = new TokenizerME(model2);
String tokens[] = tokenizer.tokenize(sentences[0]);
for (String a : tokens)
System.out.println(a);
is2.close();
I don't have time to post code right now, but please take a look at the opennlp sentence chunker and document categorizer. I think you could -very creatively- use the doccat to establish your "key" and the sentence chunker to establish noun and verb phrases (not tokens, but actualy multi word phrases) and combine the results. So at query time, you would categorize the sentence to establish a key, then chunk the sentence, then do a query that joins to the keys table and then fuzzily (full text index maybe) the phrases table. Just a thought, if it's interesting I'll post code as an edit. you would have to build the doccat model using samples.
EDIT
Here is how to get a probability dist over a set of categories using the opennlp document categorizer, youll need to supply a properties file that has the path to the doccat model:
import java.io.File;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
/**
*
* @author Owner
*/
public class SentimentFinder {
private DoccatModel doccatModel;
private DocumentCategorizerME documentCategorizerME;
Properties props =null;
public void init() {
try {
if (doccatModel == null) {
doccatModel = new DoccatModel(new File(props.getProperty("opennlp.sentiment.model.generic")));
documentCategorizerME = new DocumentCategorizerME(doccatModel);
}
} catch (IOException ex) {
ex.printStackTrace();
}
}
/**
* Classifies text via a maxent model. Try to keep chunks of text small, or
* typically there will be all low scores with little difference.
*
* @param text the string to be classified
* @return
*/
public Map<String, Double> probDist(String text) {
Map<String, Double> probDist = new HashMap<String, Double>();
if (doccatModel == null) {
init();
}
double[] categorize = documentCategorizerME.categorize(text);
int catSize = documentCategorizerME.getNumberOfCategories();
for (int i = 0; i < catSize; i++) {
String category = documentCategorizerME.getCategory(i);
probDist.put(category, categorize[documentCategorizerME.getIndex(category)]);
}
return probDist;
}
}
And here's how to chunk the sentence with a sentence chunker and get noun phrases
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.HashMap;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;
/**
*
* Extracts noun phrases from a sentence. To create sentences using OpenNLP use
* the SentenceDetector classes.
*/
public class OpenNLPNounPhraseExtractor {
static final int N = 2;
public static void main(String[] args) {
try {
HashMap<String, Integer> termFrequencies = new HashMap<>();
String modelPath = "c:\\temp\\opennlpmodels\\";
TokenizerModel tm = new TokenizerModel(new FileInputStream(new File(modelPath + "en-token.zip")));
TokenizerME wordBreaker = new TokenizerME(tm);
POSModel pm = new POSModel(new FileInputStream(new File(modelPath + "en-pos-maxent.zip")));
POSTaggerME posme = new POSTaggerME(pm);
InputStream modelIn = new FileInputStream(modelPath + "en-chunker.zip");
ChunkerModel chunkerModel = new ChunkerModel(modelIn);
ChunkerME chunkerME = new ChunkerME(chunkerModel);
//this is your sentence
String sentence = "Barack Hussein Obama II is the 44th awesome President of the United States, and the first African American to hold the office.";
//words is the tokenized sentence
String[] words = wordBreaker.tokenize(sentence);
//posTags are the parts of speech of every word in the sentence (The chunker needs this info of course)
String[] posTags = posme.tag(words);
//chunks are the start end "spans" indices to the chunks in the words array
Span[] chunks = chunkerME.chunkAsSpans(words, posTags);
//chunkStrings are the actual chunks
String[] chunkStrings = Span.spansToStrings(chunks, words);
for (int i = 0; i < chunks.length; i++) {
String np = chunkStrings[i];
if (chunks[i].getType().equals("NP")) {
if (termFrequencies.containsKey(np)) {
termFrequencies.put(np, termFrequencies.get(np) + 1);
} else {
termFrequencies.put(np, 1);
}
}
}
System.out.println(termFrequencies);
} catch (IOException e) {
}
}
}
so what I was thinking is to classify the input text, and extract and store noun phrases, you could then, at query time, classify the input, get a category, then do something like this in SQL
select * from categories a inner join nounphrases b on a.id = b.catid where catname = @thecatIjustgotfromtheclassifier and contains(text,'search term')
or something like that
Simple answer to this is that you "Could" do it - but its impractical and close to impossible.. Simple reason is that you can have many similar Questions that can mean exactly the same. With SQL query you have specific syntax you have to keep in order to get data... with standard language you have potentially 100's of different syntax to get same thing. Mapping this into SQL language is very impractical and close to impossible.
Yes you could force the user to use specific words or specific question format so your program can interpret them as SQL queries - but than again this is breaking the paradigm of what you want to achieve isn't it
Edit:
Since you are soooo desperate ;) you could "potentially" do something like this
Lets imagine a very simple Query
SELECT * FROM USERS WHERE AGE = '20';
What are the possible Human language questions?
- Can you show me all people with age of 20?
- Display humans with the age of 20
- How many people in this group have age of 20?
- How many people in this table have 20 years ?
- Do we have any people aged 20 ?
- Search for all people with age of twenty
What you could then do is create some sort of Map<key,value>
where ie.
Key = USERS;
Value = people, humans,
And another Map
Key = SELECT;
Value = Can you show me, Display, Do we have, Search, How many;
And so on so forth - This will create a complex map with all possible phrases that can mean the same thing and correspond to a given SQL query syntax element. This may not be the best solution but it's what I would probably do - at least this is what I would start with
Check out Kueri.me
This product provides technology which powers a search-box which converts English to SQL, It's built to construct an Answer Engine to relational databases.
It can recognizes various filters, groupings, and other analytics elements, so it can answer questions like
You do have to configure the system to recognize your tables, fields and jargon.
Note : I work at kueri.me