I was wondering how one would calculate the pointwise mutual information for text classification. To be more exact, I want to classify tweets in categories. I have a dataset of tweets (which are annotated), and I have a dictionary per category of words which belong to that category. Given this information, how is it possible to calculate the PMI for each category per tweet, to classify a tweet in one of these categories.
相关问题
- How to get a list of antonyms lemmas using Python,
- How to conditionally scale values in Keras Lambda
- Trying to understand Pytorch's implementation
- How to match dependency patterns with spaCy?
- LUIS - Can we use phrases list for new values in t
相关文章
- How to use cross_val_score with random_state
- How to measure overfitting when train and validati
- McNemar's test in Python and comparison of cla
- How to disable keras warnings?
- Is there an API to get statictics on Google Play d
- Invert MinMaxScaler from scikit_learn
- What's the difference between WordNet 3.1 and
- How should I vectorize the following list of lists
PMI is a measure of association between a feature (in your case a word) and a class (category), not between a document (tweet) and a category. The formula is available on Wikipedia:
In that formula,
X
is the random variable that models the occurrence of a word, andY
models the occurrence of a class. For a given wordx
and a given classy
, you can use PMI to decide if a feature is informative or not, and you can do feature selection on that basis. Having less features often improves the performance of your classification algorithm and speeds it up considerably. The classification step, however, is separate- PMI only helps you select better features to feed into your learning algorithm.Edit: One thing I didn't mention in the original post is that PMI is sensitive to word frequencies. Let's rewrite the formula as
When
x
andy
are perfectly correlated,P(x|y) = P(y|x) = 1
, sopmi(x,y) = 1/P(x)
. Less frequentx
-es (words) will have a higher PMI score than frequentx
-es, even if both are perfectly correlated withy
.