I have constructed a pipeline that takes a pandas dataframe that has been split into categorical and numerical columns. I am trying to run GridSearchCV on my results and ultimately look at the ranked features of importance for the best performing model that GridSearchCV selects. The problem I am encountering is that sklearn pipelines output numpy array objects and lose any column information along the way. Thus when I go to examine the most important coefficients of the model I am left with an unlabeled numpy array.
I have read that building a custom transformer might be a possible solution to this, but I do not have any experience doing so myself. I have also looked into leveraging the sklearn-pandas package, but I am hesitant to try and implement something that might not be updated in parallel with sklearn. Can anyone suggest what they believe is the best path to go about getting around this issue? I am also open to any literature that has hands on application of pandas and sklearn pipelines.
My Pipeline:
# impute and standardize numeric data
numeric_transformer = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy="mean")),
('scale', StandardScaler())
])
# impute and encode dummy variables for categorical data
categorical_transformer = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
clf = Pipeline([
('transform', preprocessor),
('ridge', Ridge())
])
Cross Validation:
kf = KFold(n_splits=4, shuffle=True, random_state=44)
cross_val_score(clf, X_train, y_train, cv=kf).mean()
Grid Search:
param_grid = {
'ridge__alpha': [.001, .1, 1.0, 5, 10, 100]
}
gs = GridSearchCV(clf, param_grid, cv = kf)
gs.fit(X_train, y_train)
Examining Coefficients:
model = gs.best_estimator_
predictions = model.fit(X_train, y_train).predict(X_test)
model.named_steps['ridge'].coef_
Here is the output of the model coefficients as it currently stands when performed on the seaborn "mpg" dataset:
array([-4.64782052e-01, 1.47805207e+00, -3.28948689e-01, -5.37033173e+00,
2.80000700e-01, 2.71523808e+00, 6.29170887e-01, 9.51627968e-01,
...
-1.50574860e+00, 1.88477450e+00, 4.57285471e+00, -6.90459868e-01,
5.49416409e+00])
Ideally I would like to preserve the pandas dataframe information and retrieve the derived column names after OneHotEncoder and the other methods are called.
I would actually go for creating column names from the input. If your input is already divided into numerical an categorical you can use
pd.get_dummies
to get the number of different category for each categorical feature.then you can just create proper names for the columns as shown in the last part of this working example based on the question with some artificial data.
the code above (in the end) prints out the following dataframe which is a map from parameter name to parameter value: