Pointwise mutual information on text

2020-05-12 05:18发布

I was wondering how one would calculate the pointwise mutual information for text classification. To be more exact, I want to classify tweets in categories. I have a dataset of tweets (which are annotated), and I have a dictionary per category of words which belong to that category. Given this information, how is it possible to calculate the PMI for each category per tweet, to classify a tweet in one of these categories.

标签： statistics machine-learning nlp

1条回答

我命由我不由天

2楼-- · 2020-05-12 05:46

PMI is a measure of association between a feature (in your case a word) and a class (category), not between a document (tweet) and a category. The formula is available on Wikipedia:

                  P(x, y)
pmi(x ,y) = log ------------ 
                  P(x)P(y)

In that formula, X is the random variable that models the occurrence of a word, and Y models the occurrence of a class. For a given word x and a given class y, you can use PMI to decide if a feature is informative or not, and you can do feature selection on that basis. Having less features often improves the performance of your classification algorithm and speeds it up considerably. The classification step, however, is separate- PMI only helps you select better features to feed into your learning algorithm.

Edit: One thing I didn't mention in the original post is that PMI is sensitive to word frequencies. Let's rewrite the formula as

                  P(x, y)             P(x|y)
pmi(x ,y) = log ------------ = log ------------ 
                  P(x)P(y)             P(x)

When x and y are perfectly correlated, P(x|y) = P(y|x) = 1, so pmi(x,y) = 1/P(x). Less frequent x-es (words) will have a higher PMI score than frequent x-es, even if both are perfectly correlated with y.

0人赞添加讨论(0) 举报

Pointwise mutual information on text

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间