How do I properly combine numerical features with

I am writing a classifier for web pages, so I have a mixture of numerical features, and I also want to classify the text. I am using the bag-of-words approach to transform the text into a (large) numerical vector. The code ends up being like this:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np

numerical_features = [
  [1, 0],
  [1, 1],
  [0, 0],
  [0, 1]
]
corpus = [
  'This is the first document.',
  'This is the second second document.',
  'And the third one',
  'Is this the first document?',
]
bag_of_words_vectorizer = CountVectorizer(min_df=1)
X = bag_of_words_vectorizer.fit_transform(corpus)
words_counts = X.toarray()
tfidf_transformer = TfidfTransformer()
tfidf = tfidf_transformer.fit_transform(words_counts)

bag_of_words_vectorizer.get_feature_names()
combinedFeatures = np.hstack([numerical_features, tfidf.toarray()])

This works, but I'm concerned about the accuracy. Notice that there are 4 objects, and only two numerical features. Even the simplest text results in a vector with nine features (because there are nine distinct words in the corpus). Obviously, with real text, there will be hundreds, or thousands of distinct words, so the final feature vector would be < 10 numerical features but > 1000 words based ones.

Because of this, won't the classifier (SVM) be heavily weighting the words over the numerical features by a factor of 100 to 1? If so, how can I compensate to make sure the bag of words is weighted equally against the numerical features?

标签： python scikit-learn classification text-classification

1条回答

时光不老，我们不散

2楼-- · 2019-03-20 16:23

You can weight the counts by using the Tf–idf:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

np.set_printoptions(linewidth=200)

corpus = [
  'This is the first document.',
  'This is the second second document.',
  'And the third one',
  'Is this the first document?',
]

vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)

words = vectorizer.get_feature_names()
print(words)
words_counts = X.toarray()
print(words_counts)

transformer = TfidfTransformer()
tfidf = transformer.fit_transform(words_counts)
print(tfidf.toarray())

The output is this:

# words
[u'and', u'document', u'first', u'is', u'one', u'second', u'the', u'third', u'this']

# words_counts
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]

# tfidf transformation
[[ 0.          0.43877674  0.54197657  0.43877674  0.          0.          0.35872874  0.          0.43877674]
 [ 0.          0.27230147  0.          0.27230147  0.          0.85322574  0.22262429  0.          0.27230147]
 [ 0.55280532  0.          0.          0.          0.55280532  0.          0.28847675  0.55280532  0.        ]
 [ 0.          0.43877674  0.54197657  0.43877674  0.          0.          0.35872874  0.          0.43877674]]

With this representation you should be able to merge further binary features to train a SVC.

0人赞添加讨论(0) 举报

How do I properly combine numerical features with

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间