Sklearn: Text and Numeric features with ColumnTran

2019-07-30 13:55发布

问题:

I'm trying to use SKLearn 0.20.2 to make a pipeline while using the new ColumnTransformer feature. My problem is that when I run my classifier: clf.fit(x_train, y_train) I keep getting the error:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

I have a column of blocks of text called, text. All of my other columns are numerical in nature. I'm trying to use the Countvectorizer in my pipeline and I think that's where the trouble is. Would much appreciate a hand with this.

After I run the pipeline and I check my x_train/y_train it looks like this if helpful (omitting the row numbers that normally show in the left column, and the text column runs taller than is shown in the image).


from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# plus other necessary modules

# mapped to column names from dataframe
numeric_features = ['hasDate', 'iterationCount', 'hasItemNumber', 'isEpic']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

# mapped to column names from dataframe
text_features = ['text']
text_transformer = Pipeline(steps=[
    ('vect', CountVectorizer())
])

preprocessor = ColumnTransformer(
    transformers=[('num', numeric_transformer, numeric_features),('text', text_transformer, text_features)]
)

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', MultinomialNB())
                     ])

x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.33)
clf.fit(x_train,y_train)

回答1:

I suppose you shouldn't use Pipeline if you need to understand or debug the code. The issue is with your text_transformer. The output of numeric_transformer is as expected:

# example
df = pd.DataFrame([['(0,17569)\t1\n(0,8779)\t0\n', 1, 13, 1, 0],
                   ['(0,16118)\t1\n(0,9480)\t1\n', 1, None, 0, 1],
                   ['(0,123)\t1\n(0,456)\t1\n', 1, 15, 0, 0]],
                  columns=('text', 'hasDate', 'iterationCount', 'hasItemNumber', 'isEpic'))

numeric_features = ['hasDate', 'iterationCount', 'hasItemNumber', 'isEpic']
numeric_transformer = SimpleImputer(strategy='median')

num = numeric_transformer.fit_transform(df[numeric_features])

print(num)

#[[ 1. 13.  1.  0.]
# [ 1. 14.  0.  1.]
# [ 1. 15.  0.  0.]]

But text_transformer gives you an array of shape (1, 1). So, you need to figure out, how do you want to transform your text column:

text_features = ['text']
text_transformer = CountVectorizer()

text = text_transformer.fit_transform(df[text_features])

print(text_transformer.get_feature_names())
print(text.toarray())

#['text']
#[[1]]