I am trying to create a classifier to categorize websites. I am doing this for the very first time so it's all quite new to me. Currently I am trying to do some Bag of Words on a couple of parts of the web page (e.g. title, text, headings). It looks like this:
from sklearn.feature_extraction.text import CountVectorizer
countvect_text = CountVectorizer(encoding="cp1252", stop_words="english")
countvect_title = CountVectorizer(encoding="cp1252", stop_words="english")
countvect_headings = CountVectorizer(encoding="cp1252", stop_words="english")
X_tr_text_counts = countvect_text.fit_transform(tr_data['text'])
X_tr_title_counts = countvect_title.fit_transform(tr_data['title'])
X_tr_headings_counts = countvect_headings.fit_transform(tr_data['headings'])
from sklearn.feature_extraction.text import TfidfTransformer
transformer_text = TfidfTransformer(use_idf=True)
transformer_title = TfidfTransformer(use_idf=True)
transformer_headings = TfidfTransformer(use_idf=True)
X_tr_text_tfidf = transformer_text.fit_transform(X_tr_text_counts)
X_tr_title_tfidf = transformer_title.fit_transform(X_tr_title_counts)
X_tr_headings_tfidf = transformer_headings.fit_transform(X_tr_headings_counts)
from sklearn.naive_bayes import MultinomialNB
text_nb = MultinomialNB().fit(X_tr_text_tfidf, tr_data['class'])
title_nb = MultinomialNB().fit(X_tr_title_tfidf, tr_data['class'])
headings_nb = MultinomialNB().fit(X_tr_headings_tfidf, tr_data['class'])
X_te_text_counts = countvect_text.transform(te_data['text'])
X_te_title_counts = countvect_title.transform(te_data['title'])
X_te_headings_counts = countvect_headings.transform(te_data['headings'])
X_te_text_tfidf = transformer_text.transform(X_te_text_counts)
X_te_title_tfidf = transformer_title.transform(X_te_title_counts)
X_te_headings_tfidf = transformer_headings.transform(X_te_headings_counts)
accuracy_text = text_nb.score(X_te_text_tfidf, te_data['class'])
accuracy_title = title_nb.score(X_te_title_tfidf, te_data['class'])
accuracy_headings = headings_nb.score(X_te_headings_tfidf, te_data['class'])
This works fine, and I get the accuracies as expected. However, as you might have guessed, this looks cluttered and is filled with duplication. My question then is, is there a way to write this more concisely?
Additionally, I am not sure how I can combine these three features into a single multinomial classifier. I tried passing a list of tfidf values to MultinomialNB().fit()
, but apparently that's not allowed.
Optionally, it would also be nice to add weights to the features, so that in the final classifier some vectors have a higher importance than others.
I'm guessing I need pipeline
but I'm not at all sure how I should use it in this case.
First, CountVectorizer and TfidfTransformer can be removed by using TfidfVectorizer (which is essentially combination of both).
Second, the TfidfVectorizer and MultinomialNB can be combined in a Pipeline.
A pipeline sequentially apply a list of transforms and a final estimator. When fit()
is called on a Pipeline
, it fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. And when score()
or predict()
is called, it only call transform()
on all transformers and score()
or predict()
on last one.
So the code will look like:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('vectorizer', TfidfVectorizer(encoding="cp1252",
stop_words="english",
use_idf=True)),
('nb', MultinomialNB())])
accuracy={}
for item in ['text', 'title', 'headings']:
# No need to save the return of fit(), it returns self
pipeline.fit(tr_data[item], tr_data['class'])
# Apply transforms, and score with the final estimator
accuracy[item] = pipeline.score(te_data[item], te_data['class'])
EDIT:
Edited to include the combining of all features to get single accuracy:
To combine the results, we can follow multiple approaches. One that is easily understandable (but a bit of again going to the cluttery side) is the following:
# Using scipy to concatenate, because tfidfvectorizer returns sparse matrices
from scipy.sparse import hstack
def get_tfidf(tr_data, te_data, columns):
train = None
test = None
tfidfVectorizer = TfidfVectorizer(encoding="cp1252",
stop_words="english",
use_idf=True)
for item in columns:
temp_train = tfidfVectorizer.fit_transform(tr_data[item])
train = hstack((train, temp_train)) if train is not None else temp_train
temp_test = tfidfVectorizer.transform(te_data[item])
test = hstack((test , temp_test)) if test is not None else temp_test
return train, test
train_tfidf, test_tfidf = get_tfidf(tr_data, te_data, ['text', 'title', 'headings'])
nb = MultinomialNB()
nb.fit(train_tfidf, tr_data['class'])
nb.score(test_tfidf, te_data['class'])
Second approach (and more preferable) will be to include all these in pipeline. But due to selecting the different columns ('text', 'title', 'headings') and concatenating the results, its not that straightforward. We need to use FeatureUnion for them. And specifically the following example:
- http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py
Third, if you are open to use other libraries, then DataFrameMapper
from sklearn-pandas
can simplify the usage of FeatureUnions used in previous example.
If you do want to go the second or third way, please feel free to contact if having any difficulties.
NOTE: I have not checked the code, but it should work (less some syntax errors, if any). Will check as soon as on my pc.
The snippet below is a possible way to simplify your code:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
cv = CountVectorizer(encoding="cp1252", stop_words="english")
tt = TfidfTransformer(use_idf=True)
mnb = MultinomialNB()
accuracy = {}
for item in ['text', 'title', 'headings']:
X_tr_counts = cv.fit_transform(tr_data[item])
X_tr_tfidf = tt.fit_transform(X_tr_counts)
mnb.fit(X_tr_tfidf, tr_data['class'])
X_te_counts = cv.transform(te_data[item])
X_te_tfidf = tt.transform(X_te_counts)
accuracy[item] = mnb.score(X_te_tfidf, te_data['class'])
The classification success rates are stored in a dictionary accuracy
with keys 'text
, 'title'
, and 'headings'
.
EDIT
A more elegant solution - not necessarily simpler though - would consist in using Pipeline and FeatureUnion as pointed out by @Vivek Kumar. This approach would also allow you to combine all the features into a single model and apply weighting factors to the features extracted from the different items of your dataset.
First we import the necessary modules.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import FeatureUnion, Pipeline
Then we define a transformer class (as suggested in this example) to select the different items of your dataset:
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, foo, bar=None):
return self
def transform(self, data_dict):
return data_dict[self.key]
We are now ready to define the pipeline:
pipeline = Pipeline([
('features', FeatureUnion(
transformer_list=[
('text_feats', Pipeline([
('text_selector', ItemSelector(key='text')),
('text_vectorizer', TfidfVectorizer(encoding="cp1252",
stop_words="english",
use_idf=True))
])),
('title_feats', Pipeline([
('title_selector', ItemSelector(key='text')),
('title_vectorizer', TfidfVectorizer(encoding="cp1252",
stop_words="english",
use_idf=True))
])),
('headings_feats', Pipeline([
('headings_selector', ItemSelector(key='text')),
('headings_vectorizer', TfidfVectorizer(encoding="cp1252",
stop_words="english",
use_idf=True))
])),
],
transformer_weights={'text': 0.5, #change weights as appropriate
'title': 0.3,
'headings': 0.2}
)),
('classifier', MultinomialNB())
])
And finally, we can classify data in a straightforward manner:
pipeline.fit(tr_data, tr_data['class'])
pipeline.score(te_data, te_data['class'])