If you have a corpus of text, how can you identify all the categories (from a list of pre-defined categories) and the associated sentiment (positive/negative writing) with it?
I will be doing this in Python but at this stage I am not necessarily looking for a language specific solution.
Let's look at this question with an example to try and clarify what I am asking.
If I have a whole corpus of reviews for products e.g.:
Microsoft's Xbox One offers impressive graphics and a solid list of exclusive 2015 titles. The Microsoft console currently edges ahead of the PS4 with a better selection of media apps. The console's fall-2015 dashboard update is a noticeable improvement. The console has backward compatibility with around 100 Xbox 360 titles, and that list is poised to grow. The Xbox One's new interface is still more convoluted than the PS4's. In general, the PS4 delivers slightly better installation times, graphics and performance on cross-platform games. The Xbox One also lags behind the PS4 in its selection of indie games. The Kinect's legacy is still a blemish. While the PS4 remains our overall preferred choice in the game console race, the Xbox One's significant course corrections and solid exclusives make it a compelling alternative.
And I have a list of pre-defined categories e.g. :
- Graphics
- Game Play
- Game Selection
- Apps
- Performance
- Irrelevant/Other
I could take my big corpus of reviews and break them down by sentence. For each sentence in my training data I can hand tag them with the appropriate categories. The problem is that there could be various categories in 1 sentence.
If it was 1 category per sentence then any classification algorithm from scikit-learn would do the trick. When working with multi-classes I could use something like multi-label classification.
Adding in the sentiment is the trickier part. Identifying sentiment in a sentence is a fairly simple task but if there is a mix of sentiment on different labels that becomes different.
The example sentence "The Xbox One has a good selection of games but the performance is worse than the PS4". We can identify two of our pre-defined categories (game selection, performance) but we have positive sentiment towards game selection and a negative sentiment towards performance.
What would be a way to identify all categories in text (from our pre-defined list) with their associated sentiment?