scikit-learn LogisticRegressionCV: best coefficien

2019-07-31 18:19发布

问题:

I am trying to understand how the best coefficients are calculated in a logistic regression cross-validation, where the "refit" parameter is True. If I understand the docs correctly, the best coefficients are the result of first determining the best regularization parameter "C", i.e., the value of C that has the highest average score over all folds. Then, the best coefficients are simply the coefficients that were calculated on the fold that has the highest score for the best C. I assume that if the maximum score is achieved by several folds, the coefficients of these folds would be averaged to give the best coefficients (I didn't see anything on how this case is handled in the docs).

To test my understanding, I determined the best coefficients in two different ways:

  1. directly from the coef_ attribute of the fitted model, and
  2. from the coefs_paths attribute, which contains the path of the coefficients obtained during cross-validating across each fold and then across each C.

The results I get from 1. and 2. are similar but not identical, so I was hoping someone could point out what I am doing wrong here. Thanks!

An example to demonstrate the issue:

from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Set parameters
n_folds = 10
C_values = [0.001, 0.01, 0.05, 0.1, 1., 100.]

# Load and preprocess data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
X_train_scaled = StandardScaler().fit_transform(X_train)

# Fit model
clf = LogisticRegressionCV(Cs=C_values, cv=n_folds, penalty='l1', 
                           refit=True, scoring='roc_auc', 
                           solver='liblinear', random_state=0,
                           fit_intercept=False)
clf.fit(X_train_scaled, y_train)

########################
# Get and plot coefficients using method 1
########################
coefs1 = clf.coef_
coefs1_series = pd.Series(coefs1.ravel(), index=cancer['feature_names'])
coefs1_series.sort_values().plot(kind="barh")

########################
# Get and plot coefficients using method 2
########################
# mean of scores of class "1"
scores = clf.scores_[1]
mean_scores = np.mean(scores, axis=0)
# Get index of the C that has the highest average score across all folds
best_C_idx = np.where(mean_scores==np.max(mean_scores))[0][0]
# Get index (here: indices) of the folds with highest scores for the 
# best C
best_folds_idx = np.where(scores[:, best_C_idx]==np.max(scores[:, best_C_idx]))[0]

paths = clf.coefs_paths_[1]  # has shape (n_folds, len(C_values), n_features)
coefs2 = np.squeeze(paths[best_folds_idx, best_C_idx, :])
coefs2 = np.mean(coefs2, axis=0)
coefs2_series = pd.Series(coefs2.ravel(), index=cancer['feature_names'])
coefs2_series.sort_values().plot(kind="barh")