Concatenate custom features with CountVectorizer

2020-05-26 10:26发布

问题:

I have a bunch of files with articles. For each article there should be some features, like: text length, text_spam (all are ints or floats, and in most cases they should be loaded from csv). And what I want to do is - to combine these features with CountVectorizer and then classify those texts.

I have watched some tutorials, but still I have no idea how to implement this stuff. Found something here, but can't actually implement this for my needs.

Any ideas how that could be done with scikit?

Thank you.

What I came across right now is:

from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import FeatureUnion

measurements = [
    {'text_length': 1000, 'text_spam': 4.3},
    {'text_length': 2000, 'text_spam': 4.1},
]

corpus = [
    'some text',
    'some text 2 hooray',
]

vectorizer = DictVectorizer()
count_vectorizer = CountVectorizer(min_df=1)

first_x = vectorizer.fit_transform(measurements)
second_x = count_vectorizer.fit_transform(corpus)

combined_features = FeatureUnion([('first', first_x), ('second', second_x)])

For this bunch of code I do not understand how to load "real"-data, since training sets are already loaded. And the second one - how to load categories (y parameter for fit function)?

回答1:

You're misunderstanding FeatureUnion. It's supposed to take two transformers, not two batches of samples.

You can force it into dealing with the vectorizers you have, but it's much easier to just throw all your features into one big bag per sample and use a single DictVectorizer to make vectors out of those bags.

# make a CountVectorizer-style tokenizer
tokenize = CountVectorizer().build_tokenizer()

def features(document):
    terms = tokenize(document)
    d = {'text_length': len(terms), 'text_spam': whatever_this_means}
    for t in terms:
        d[t] = d.get(t, 0) + 1
    return d

vect = DictVectorizer()
X_train = vect.fit_transform(features(d) for d in documents)

Don't forget to normalize this with sklearn.preprocessing.Normalizer, and be aware that even after normalization, those text_length features are bound to dominate the other features in terms of scale. It might be wiser to use 1. / text_length or np.log(text_length) instead.

And the second one - how to load categories (y parameter for fit function)?

Depends on how your data is organized. scikit-learn has a lot of helper functions and classes, but it does expect you to write code if your setup is non-standard.