Using, amongst other sources, various posts here on Stackoverflow, I'm trying to implement my own PHP classier to classify tweets into a positive, neutral and negative class. Before coding, I need to get the process straigt. My train-of-thought and an example are as follows:
p(class) * p(words|class)
Bayes theorem: p(class|words) = ------------------------- with
p(words)
assumption that p(words) is the same for every class leads to calculating
arg max p(class) * p(words|class) with
p(words|class) = p(word1|class) * p(word2|topic) * ... and
p(class) = #words in class / #words in total and
p(word, class) 1
p(word|class) = -------------- = p(word, class) * -------- =
p(class) p(class)
#times word occurs in class #words in total #times word occurs in class
--------------------------- * --------------- = ---------------------------
#words in total #words in class #words in class
Example:
------+----------------+-----------------+
class | words | #words in class |
------+----------------+-----------------+
pos | happy win nice | 3 |
neu | neutral middle | 2 |
neg | sad loose bad | 3 |
------+----------------+-----------------+
p(pos) = 3/8
p(neu) = 2/8
p(meg) = 3/8
Calculate: argmax(sad loose)
p(sad loose|pos) = p(sad|pos) * p(loose|pos) = (0+1)/3 * (0+1)/3 = 1/9
p(sad loose|neu) = p(sad|neu) * p(loose|neu) = (0+1)/3 * (0+1)/3 = 1/9
p(sad loose|neg) = p(sad|neg) * p(loose|neg) = 1/3 * 1/3 = 1/9
p(pos) * p(sad loose|pos) = 3/8 * 1/9 = 0.0416666667
p(neu) * p(sad loose|neu) = 2/8 * 1/9 = 0.0277777778
p(neg) * p(sad loose|neg) = 3/8 * 1/9 = 0.0416666667 <-- should be 100% neg!
As you can see, I have "trained" the classifier with a positive ("happy win nice"), a neutral ("neutral middle") and a negative ("sad loose bad") tweet. In order to prevent problems of having probabilities of zero because of one word missing in all classes, I'm using LaPlace (or ädd one") smoothing, see "(0+1)".
I basically have two questions:
- Is this a correct blueprint for implementation? Is there room for improvement?
- When classifying a tweet ("sad loose"), it is expected to be 100% in class "neg" because it only contains negative words. The LaPlace smoothing is however making things more complicated: class pos and neg have an equal probability. Is there a workaround for this?