I have a bunch of files with articles. For each article there should be some features, like: text length, text_spam (all are ints or floats, and in most cases they should be loaded from csv). And what I want to do is - to combine these features with CountVectorizer and then classify those texts.
I have watched some tutorials, but still I have no idea how to implement this stuff. Found something here, but can't actually implement this for my needs.
Any ideas how that could be done with scikit?
Thank you.
What I came across right now is:
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import FeatureUnion
measurements = [
{'text_length': 1000, 'text_spam': 4.3},
{'text_length': 2000, 'text_spam': 4.1},
]
corpus = [
'some text',
'some text 2 hooray',
]
vectorizer = DictVectorizer()
count_vectorizer = CountVectorizer(min_df=1)
first_x = vectorizer.fit_transform(measurements)
second_x = count_vectorizer.fit_transform(corpus)
combined_features = FeatureUnion([('first', first_x), ('second', second_x)])
For this bunch of code I do not understand how to load "real"-data, since training sets are already loaded. And the second one - how to load categories (y parameter for fit function)?