I am trying to create the java implementation for maxent classifier. I need to classify the sentences into n
different classes.
I had a look at ColumnDataClassifier in stanford maxent classifier. But I am not able to understand how to create training data. I need training data in the form where training data includes POS Tags for words for sentence, so that the features used for classifier will be like previous word, next word etc.
I am looking for training data which has sentences with POS TAGGING and sentence class mentioned. example :
My/(POS) name/(POS) is/(POS) XYZ/(POS) CLASS
Any help will be appreciated.
If I understand it correctly, you are trying to treat sentences as a set of POS tags.
In your example, the sentence "My name is XYZ" would be represented as a set of (PRP$, NN, VBZ, NNP). That would mean, every sentence is actually a binary vector of length 37 (because there are 36 possible POS tags according to this page + the CLASS outcome feature for the whole sentence)
This can be encoded for OpenNLP Maxent as follows:
or simply:
(For working code-snippet see my answer here: Training models using openNLP maxent)
Some more sample data would be:
This would yield samples:
However, I don't expect that such a classification yields good results. It would be better to make use of other structural features of a sentence, such as the parse tree or dependency tree that can be obtained using e.g. Stanford parser.
Edited on 28.3.2016: You can also use the whole sentence as a training sample. However, be aware that: - two sentences might contain same words but have different meaning - there is a pretty high chance of overfitting - you should use short sentences - you need a huge training set
According to your example, I would encode the training samples as follows:
Notice that the outcome variable comes as the first element on each line.
Here is a fully working minimal example using
opennlp-maxent-3.0.3.jar
.And some dummy training data (stored as
training-file.txt
):This yields the following output: