Sklearn: Text and Numeric features with ColumnTran

I'm trying to use SKLearn 0.20.2 to make a pipeline while using the new ColumnTransformer feature. My problem is that when I run my classifier: clf.fit(x_train, y_train) I keep getting the error:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

I have a column of blocks of text called, text. All of my other columns are numerical in nature. I'm trying to use the Countvectorizer in my pipeline and I think that's where the trouble is. Would much appreciate a hand with this.

After I run the pipeline and I check my x_train/y_train it looks like this if helpful (omitting the row numbers that normally show in the left column, and the text column runs taller than is shown in the image).

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# plus other necessary modules

# mapped to column names from dataframe
numeric_features = ['hasDate', 'iterationCount', 'hasItemNumber', 'isEpic']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

# mapped to column names from dataframe
text_features = ['text']
text_transformer = Pipeline(steps=[
    ('vect', CountVectorizer())
])

preprocessor = ColumnTransformer(
    transformers=[('num', numeric_transformer, numeric_features),('text', text_transformer, text_features)]
)

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', MultinomialNB())
                     ])

x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.33)
clf.fit(x_train,y_train)

标签： python machine-learning scikit-learn

1条回答

乱世女痞

2楼-- · 2019-07-30 13:41

I suppose you shouldn't use Pipeline if you need to understand or debug the code. The issue is with your text_transformer. The output of numeric_transformer is as expected:

# example
df = pd.DataFrame([['(0,17569)\t1\n(0,8779)\t0\n', 1, 13, 1, 0],
                   ['(0,16118)\t1\n(0,9480)\t1\n', 1, None, 0, 1],
                   ['(0,123)\t1\n(0,456)\t1\n', 1, 15, 0, 0]],
                  columns=('text', 'hasDate', 'iterationCount', 'hasItemNumber', 'isEpic'))

numeric_features = ['hasDate', 'iterationCount', 'hasItemNumber', 'isEpic']
numeric_transformer = SimpleImputer(strategy='median')

num = numeric_transformer.fit_transform(df[numeric_features])

print(num)

#[[ 1. 13.  1.  0.]
# [ 1. 14.  0.  1.]
# [ 1. 15.  0.  0.]]

But text_transformer gives you an array of shape (1, 1). So, you need to figure out, how do you want to transform your text column:

text_features = ['text']
text_transformer = CountVectorizer()

text = text_transformer.fit_transform(df[text_features])

print(text_transformer.get_feature_names())
print(text.toarray())

#['text']
#[[1]]

0人赞添加讨论(0) 举报

Sklearn: Text and Numeric features with ColumnTran

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间