I am analysing a few million emails. My aim is to be able to classify then into groups. Groups could be e.g.:
- Delivery problems (slow delivery, slow handling before dispatch, incorrect availability information, etc.)
- Customer service problems (slow email response time, impolite response, etc.)
- Return issues (slow handling of return request, lack of helpfulness from the customer service, etc.)
- Pricing complaint (hidden fee's discovered, etc.)
In order to perform this classification, I need a NLP that can recognize the combination of word groups like:
- "[they|the company|the firm|the website|the merchant]"
- "[did not|didn't|no]"
- "[response|respond|answer|reply]"
- "[before the next day|fast enough|at all]"
- etc.
A few of these exemplified groups in combination should then match sentences like:
- "They didn't respond"
- "They didn't respond at all"
- "There was no response at all"
- "I received no response from the website"
And then classify the sentence as Customer service problems.
Which NLP would be able to handle such a task? From what I read these are the most relevant:
- Stanford CoreNLP
- OpenNLP
Check also these suggested NLP's.
Not entirely sure, but I can think of two ways of trying to solve your problem:
Standard Machine Learning
As stated in the comment, extract only keywords from each mail and train a classifier using them. Define your relevant keyword set beforehand and extract only those keywords from the email if they are present.
This is a simple but powerful technique and not to be underestimated as it yields very good results in many cases. You might want to try this one out first as more complex algorithms might be overkill.
Grammars
If you really want to delve into NLP, based on your question description, you might try defining some sort of grammar and parse the email based on that grammar. I don't have too much experience in ruby, but I'm sure some sort of lex-yacc equivalent tools exist. A quick web search gives this SO question and this. By identifying these phrases, you could judge which category an email falls under by calculating the proportion of phrases found for each category.
For example, intuitively, some productions within the grammar could be defined as:
where
organization = [they|the company|the firm|the website|the merchant]
, etc.These approaches might be a way to start.
Using the OpenNLP doccat api, you can create training data and then a model from the training data. The advantage of this over something like a naive bayes classifier is that it returns a probability distribution over your set of categories.
so if you create a file with this format:
etc.... provide as many samples as possible and make sure each line ends with a \n newline
using this appoach you can add anything you want that means "customer service problems" and you can also add any other categories as well, so you don't have to be too deterministic about what data falls into what categories
here is what the java looks like to build the model
Once you have the model, you can then use it something like this:
then in the returned hashmap you have each category that you modeled and a score, you can use the scores to decide which category the input text belongs to.