If you have a corpus of text, how can you identify all the categories (from a list of pre-defined categories) and the associated sentiment (positive/negative writing) with it?
I will be doing this in Python but at this stage I am not necessarily looking for a language specific solution.
Let's look at this question with an example to try and clarify what I am asking.
If I have a whole corpus of reviews for products e.g.:
Microsoft's Xbox One offers impressive graphics and a solid list of exclusive 2015 titles. The Microsoft console currently edges ahead of the PS4 with a better selection of media apps. The console's fall-2015 dashboard update is a noticeable improvement. The console has backward compatibility with around 100 Xbox 360 titles, and that list is poised to grow. The Xbox One's new interface is still more convoluted than the PS4's. In general, the PS4 delivers slightly better installation times, graphics and performance on cross-platform games. The Xbox One also lags behind the PS4 in its selection of indie games. The Kinect's legacy is still a blemish. While the PS4 remains our overall preferred choice in the game console race, the Xbox One's significant course corrections and solid exclusives make it a compelling alternative.
And I have a list of pre-defined categories e.g. :
- Graphics
- Game Play
- Game Selection
- Apps
- Performance
- Irrelevant/Other
I could take my big corpus of reviews and break them down by sentence. For each sentence in my training data I can hand tag them with the appropriate categories. The problem is that there could be various categories in 1 sentence.
If it was 1 category per sentence then any classification algorithm from scikit-learn would do the trick. When working with multi-classes I could use something like multi-label classification.
Adding in the sentiment is the trickier part. Identifying sentiment in a sentence is a fairly simple task but if there is a mix of sentiment on different labels that becomes different.
The example sentence "The Xbox One has a good selection of games but the performance is worse than the PS4". We can identify two of our pre-defined categories (game selection, performance) but we have positive sentiment towards game selection and a negative sentiment towards performance.
What would be a way to identify all categories in text (from our pre-defined list) with their associated sentiment?
One simple method is to break your training set into minimal sentences using a parser and use that as the input for labelling and sentiment classification.
Your example sentence:
Using the Stanford Parser, take S tags that don't have child S tags (and thus are minimal sentences) and put the tokens back together. For the above sentence that would give you these:
Sentiment within an S tag should be consistent most of the time. If sentences like
The XBox has good games and terrible graphics
are common in your dataset you may need to break it down to NP tags but that seems unlikely.Regarding labelling, as you mentioned any multi-label classification method should work.
For more sophisticated methods, there's a lot of research on join topic-sentiment models - a search for "topic sentiment model" turns up a lot of papers and code. Here's sample training data from a paper introducing a Hidden Topic Sentiment Model that looks right up your alley. Note how in the first sentence with labels there are two topics.
Hope that helps!
The only approach I could think of would consists of a set of steps.
1) Use some library to extract entities from text and their relationships. For example, check this article:
http://www.nltk.org/book/ch07.html
By parsing each text you may figure out which entities you have in each text and which chunks of text are related to the entity.
2) Use NLTKs sentiment extraction to analyze chunks specifically related to each entity and obtain their sentiment. That gives you sentiment of each entity.
3) After that you need to come of with a way to map entities which you may face in text to what you call 'topics'. Unfortunately, I don't see a way to automate it since you clearly not define topics conventionally, through word frequency (like in topic modelling algorithms - LDA, NMF etc).