I'm trying to use SKLearn 0.20.2 to make a pipeline while using the new ColumnTransformer feature. My problem is that when I run my classifier: clf.fit(x_train, y_train)
I keep getting the error:
ValueError: all the input array dimensions except for the concatenation axis must match exactly
I have a column of blocks of text called, text
. All of my other columns are numerical in nature. I'm trying to use the Countvectorizer in my pipeline and I think that's where the trouble is. Would much appreciate a hand with this.
After I run the pipeline and I check my x_train/y_train it looks like this if helpful (omitting the row numbers that normally show in the left column, and the text column runs taller than is shown in the image).
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# plus other necessary modules
# mapped to column names from dataframe
numeric_features = ['hasDate', 'iterationCount', 'hasItemNumber', 'isEpic']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median'))
])
# mapped to column names from dataframe
text_features = ['text']
text_transformer = Pipeline(steps=[
('vect', CountVectorizer())
])
preprocessor = ColumnTransformer(
transformers=[('num', numeric_transformer, numeric_features),('text', text_transformer, text_features)]
)
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', MultinomialNB())
])
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.33)
clf.fit(x_train,y_train)
I suppose you shouldn't use
Pipeline
if you need to understand or debug the code. The issue is with yourtext_transformer
. The output ofnumeric_transformer
is as expected:But
text_transformer
gives you an array of shape(1, 1)
. So, you need to figure out, how do you want to transform yourtext
column: