Why is the value of TF-IDF different from IDF_?

2020-04-17 03:16发布

问题:

Why is the value of the vectorized corpus different from the value obtained through the idf_ attribute? Should not the idf_ attribute just return the inverse document frequency (IDF) in the same way it appears in the corpus vectorized?

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer()
corpus = vectorizer.fit_transform(corpus)

print(corpus)

Corpus vectorized:

  (0, 2)    0.6300993445179441
  (0, 4)    0.44832087319911734
  (0, 0)    0.44832087319911734
  (0, 3)    0.44832087319911734
  (1, 1)    0.6300993445179441
  (1, 4)    0.44832087319911734
  (1, 0)    0.44832087319911734
  (1, 3)    0.44832087319911734

Vocabulary and idf_ values:

print(dict(zip(vectorizer.vocabulary_, vectorizer.idf_)))

Output:

{'this': 1.0, 
 'is': 1.4054651081081644, 
 'very': 1.4054651081081644, 
 'strange': 1.0, 
 'nice': 1.0}

Vocabulary index:

print(vectorizer.vocabulary_)

Output:

{'this': 3, 
 'is': 0, 
 'very': 4, 
 'strange': 2, 
 'nice': 1}

Why is the IDF value of the word this is 0.44 in the corpus and 1.0 when obtained by idf_?

回答1:

This is because of l2 normalization, which is applied by default in TfidfVectorizer(). If you set the norm param as None, you will get the same values as idf_.


>>> vectorizer = TfidfVectorizer(norm=None)

#output

  (0, 2)    1.4054651081081644
  (0, 4)    1.0
  (0, 0)    1.0
  (0, 3)    1.0
  (1, 1)    1.4054651081081644
  (1, 4)    1.0
  (1, 0)    1.0
  (1, 3)    1.0

Also, your way to computing the feature's corresponding idf values is wrong because dict does not preserve the order.

use:

 >>>> print(dict(zip(vectorizer.get_feature_names(), vectorizer.idf_)))

     {'is': 1.0,
      'nice': 1.4054651081081644, 
      'strange': 1.4054651081081644, 
      'this': 1.0, 
      'very': 1.0}