I have a bunch of files with articles. For each article there should be some features, like: text length, text_spam (all are ints or floats, and in most cases they should be loaded from csv). And what I want to do is - to combine these features with CountVectorizer and then classify those texts.
I have watched some tutorials, but still I have no idea how to implement this stuff. Found something here, but can't actually implement this for my needs.
Any ideas how that could be done with scikit?
Thank you.
What I came across right now is:
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import FeatureUnion
measurements = [
{'text_length': 1000, 'text_spam': 4.3},
{'text_length': 2000, 'text_spam': 4.1},
]
corpus = [
'some text',
'some text 2 hooray',
]
vectorizer = DictVectorizer()
count_vectorizer = CountVectorizer(min_df=1)
first_x = vectorizer.fit_transform(measurements)
second_x = count_vectorizer.fit_transform(corpus)
combined_features = FeatureUnion([('first', first_x), ('second', second_x)])
For this bunch of code I do not understand how to load "real"-data, since training sets are already loaded. And the second one - how to load categories (y parameter for fit function)?
You're misunderstanding
FeatureUnion
. It's supposed to take two transformers, not two batches of samples.You can force it into dealing with the vectorizers you have, but it's much easier to just throw all your features into one big bag per sample and use a single
DictVectorizer
to make vectors out of those bags.Don't forget to normalize this with
sklearn.preprocessing.Normalizer
, and be aware that even after normalization, thosetext_length
features are bound to dominate the other features in terms of scale. It might be wiser to use1. / text_length
ornp.log(text_length)
instead.Depends on how your data is organized. scikit-learn has a lot of helper functions and classes, but it does expect you to write code if your setup is non-standard.