make custom scorer with GridSearchCV

2019-08-01 02:36发布

问题:

I have the code below where I’m trying to use a custom scorer I defined “custom_loss_five” with GridSearchCV to tune hyper parameters. I have the example code below. I also have some sample data. I’m getting an error 'numpy.dtype' object has no attribute 'base_dtype’. I think this is because I’m mixing keras code with sklearn. I’m also using this same “custom_loss_five” function to train a neural network. So that’s why I used keras. If anyone could point out the issue and let me know how to adapt the function to use with GridSearchCV I would appreciate it.

sample data:

print(x_train_scld[:5])

[[ 0.37773519  2.0109691   0.49644224  0.21679945  0.538941    1.99144889
   2.15011467  1.20312084  0.86114816  0.79507318 -0.45602028  0.07146743
  -0.19524294 -0.33405545 -0.60264522  1.26724727  1.44991588  0.74630967
   0.16529837  0.89613455  0.3253014   2.19166429  0.64865429  0.12894674
   0.46995314  3.41479052  4.44308499  1.83182458  1.54348561  2.50155582]
 [ 0.32029317  0.1214269   0.28824456  0.13510828 -0.0851059  -0.0057386
  -0.31671716  0.0303454   0.32754165 -0.15354084 -0.36310852 -0.34419771
  -0.28347519 -0.28927174 -0.39507256 -0.2039463  -0.49919802  0.12281647
  -0.56756272 -0.30637335  0.10701249  0.21461633  0.17531634 -0.04414507
   0.19574444  0.36354262 -1.23318869  0.59029124  0.28936372  0.19248437]
 [ 0.25843254  0.29037034  0.21339798  0.12738073  0.28185716 -0.47995085
  -0.13321816  0.14228058 -3.69915162 -0.10246162  0.26193423  0.12807553
   0.18956053  0.12487671 -0.28174435 -0.71770499 -0.34455425  0.00729992
  -0.70102685 -0.57022389  0.59171701  0.77319193  0.52065985 -1.37655715
   0.59387438 -1.52826854  0.18054306  0.76212977  0.3639211   0.08726502]
 [-0.70482588 -0.32963569 -0.74849491 -0.86505667  0.10026287 -0.87877366
  -1.06584707 -1.19559926  0.34039964  0.10112554 -0.62427503 -0.3134676
  -0.65996358 -0.52932857  0.11989554 -0.95345177 -0.67459484 -0.82130922
  -0.52228025 -0.38191412 -0.75239269 -0.31180246 -0.7418967  -0.7432583
   0.12191902 -0.97620932 -1.02049823 -1.20098216 -0.02333216 -0.24853266]
 [-0.36680171 -0.14757043 -0.41413663 -0.56754624 -0.34512544 -0.76162172
  -0.72684687 -0.61557149  0.31896966 -0.25351016 -0.6357623   0.12484078
  -0.71632135 -0.51097128  0.26933611 -0.53549047 -0.54070413 -0.36472263
  -0.24581883 -0.67901706 -0.44128802  0.16221265 -0.42239358 -0.52459003
   0.34339528 -0.43064345 -1.23318869 -0.23310168  0.44404246 -0.40964978]]


print(x_test_scld[:5])

[[ 2.60641850e-01 -7.18369636e-01  3.27138629e-01 -1.76172773e+00
   4.67645320e-01  1.53766591e+00  7.62837058e-01  4.07109050e-01
   7.71142242e-01  9.80417766e-01  5.10262027e-01  5.66383900e-01
   9.28678845e-01  2.06576727e-01  9.68389151e-01  1.48288576e+00
   7.53349504e-01  7.04842193e-01  7.80186706e-01  6.43850055e-01
   1.43107505e-01 -7.20312971e-01  2.96065817e-01 -4.51322867e-02
   1.93107816e-01  7.41280492e-01  3.28514299e-01  4.47039330e-02
   1.39136160e-01  4.94989991e-01]
 [-7.51730115e-02  4.92568820e-02 -7.29146850e-02 -2.86318841e-01
   1.00026599e+00  4.43886212e-01  4.80336890e-01  6.71683119e-01
   8.61148159e-01  5.21434522e-01 -3.65135682e-01 -4.32021118e-01
  -4.10049198e-01 -3.01778906e-01 -4.27568719e-02 -1.34413479e+00
  -4.09570872e-02  1.64283954e-01 -3.04209384e-01 -7.10176931e-03
   7.32148655e-03 -2.90459367e+00  2.31719950e-02 -1.37655715e+00
   1.44286672e+00  1.07281572e+00  1.19548020e+00  1.44805187e+00
   1.33316704e+00  1.55622575e+00]
 [-1.23777794e-01 -3.83763205e-01 -1.65737513e-01 -3.43999436e-01
   3.58604868e-01 -3.45623859e-01 -2.89602186e-01 -3.38277511e-01
   8.23494778e-03  2.97415674e-01 -6.27653637e-01 -6.42441486e-01
  -7.17707195e-01 -4.34516210e-01  6.01100047e-01 -2.64325075e-01
  -2.31751338e-01  4.13624916e-02  7.46820672e-01  3.84336779e-01
  -3.24408912e-01 -5.30945125e-01 -3.14685046e-01 -4.13363730e-01
   6.43970206e-01 -2.37091815e-01 -1.45963962e-01 -2.97594271e-02
   7.54512744e-01  6.49530907e-01]
 [ 1.06041146e+00  3.61350612e-02  9.93240469e-01  1.11126264e+00
  -2.54537983e-01 -2.50709092e-01 -3.56042668e-02 -1.19559926e+00
  -2.25351836e-01 -4.65124054e-01 -4.64466800e-01 -1.10808348e+00
  -4.47005113e-01 -2.07571731e-01 -1.11908130e+00 -8.49190558e-01
  -5.40704133e-01 -6.40037086e-01 -1.10737748e+00 -9.30940117e-01
   9.76730527e-01  2.34863210e-01  9.02228200e-01  9.43399666e-01
  -1.25487123e-02 -1.70804996e-03  4.83277659e-01  7.07714236e-01
   5.60886115e-01 -4.38009686e-01]
 [ 3.57851416e-01  1.87811066e+00  2.77785646e-01  2.23975029e-01
  -3.66933526e-01 -9.49100986e-01 -4.74866806e-01 -4.98802740e-01
   2.69680706e-01 -5.60715159e-01  2.46392629e-01  7.53999293e-01
   1.19344293e-01  1.24473258e-01  4.50284535e-02 -5.74844494e-01
  -1.80203418e-01 -2.89340672e-01  1.37362545e+00 -6.91305992e-01
   2.80612333e-01  1.49136056e+00  1.99466234e-01  1.55930637e-01
  -2.39298218e-01 -9.12274848e-01 -4.82659170e-01 -6.00406523e-01
   5.90931626e-01 -7.55722792e-01]]


print(y_train[:5])

562    1
291    0
16     1
546    0
293    0
Name: diagnosis, dtype: int64


print(y_test[:5])

421    0
47     1
292    0
186    1
414    1
Name: diagnosis, dtype: int64

Code:

# custom loss function

# importing libraries

import io
import os
import time
import pandas as pd
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense
import keras.backend as K
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_curve, roc_auc_score, precision_recall_fscore_support, accuracy_score
import matplotlib.pyplot as plt
from IPython.core.display import display, HTML

# from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import fbeta_score, make_scorer
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier


# custom loss function

def custom_loss_wrapper(fn_cost=1, fp_cost=1):

    def custom_loss(y_true, y_pred, fn_cost=fn_cost, fp_cost=fp_cost):

        h = K.ones_like(y_pred)
        fn_value = fn_cost * h
        fp_value = fp_cost * h

        weighted_values = y_true * K.abs(1-y_pred)*fn_value + (1-y_true) * K.abs(y_pred)*fp_value

        loss = K.mean(weighted_values)
        return loss

    return custom_loss


custom_loss_five = custom_loss_wrapper(fn_cost=5, fp_cost=1)


# TODO: Initialize the classifier
clf = AdaBoostClassifier(random_state=0)

# TODO: Create the parameters list you wish to tune
parameters = {'n_estimators':[100,200,300],'learning_rate':[1.0,2.0,4.0]}

# TODO: Make an fbeta_score scoring object
# scorer = make_scorer(fbeta_score, beta=0.5)
scorer2 = make_scorer(custom_loss_five)


# TODO: Perform grid search on the classifier using 'scorer' as the scoring method
grid_obj2 = GridSearchCV(clf,parameters,scoring=scorer2)

# TODO: Fit the grid search object to the training data and find the optimal parameters
grid_fit2 = grid_obj2.fit(x_train_scld,y_train)

# Get the estimator
best_clf2 = grid_fit2.best_estimator_

# Make predictions using the unoptimized and model
predictions = (clf.fit(x_train_scld, y_train)).predict(x_test_scld)
best_predictions = best_clf.predict(x_test_scld)

# Report the before-and-afterscores
print("Unoptimized model\n------")
print("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions)))
# print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5)))
print("\nOptimized Model\n------")
print("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
# print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5)))
print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5)))

error:

/Users/sshields/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-34-b87eab01e7ec> in <module>()
     24 
     25 # TODO: Fit the grid search object to the training data and find the optimal parameters
---> 26 grid_fit2 = grid_obj2.fit(x_train_scld,y_train)
     27 
     28 # Get the estimator

~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    720                 return results_container[0]
    721 
--> 722             self._run_search(evaluate_candidates)
    723 
    724         results = results_container[0]

~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates)
   1189     def _run_search(self, evaluate_candidates):
   1190         """Search all candidates in param_grid"""
-> 1191         evaluate_candidates(ParameterGrid(self.param_grid))
   1192 
   1193 

~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params)
    709                                for parameters, (train, test)
    710                                in product(candidate_params,
--> 711                                           cv.split(X, y, groups)))
    712 
    713                 all_candidate_params.extend(candidate_params)

~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    915             # remaining jobs.
    916             self._iterating = False
--> 917             if self.dispatch_one_batch(iterator):
    918                 self._iterating = self._original_iterator is not None
    919 

~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    757                 return False
    758             else:
--> 759                 self._dispatch(tasks)
    760                 return True
    761 

~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    714         with self._lock:
    715             job_idx = len(self._jobs)
--> 716             job = self._backend.apply_async(batch, callback=cb)
    717             # A job can complete so quickly than its callback is
    718             # called before we get here, causing self._jobs to

~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
    180     def apply_async(self, func, callback=None):
    181         """Schedule a func to be run"""
--> 182         result = ImmediateResult(func)
    183         if callback:
    184             callback(result)

~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
    547         # Don't delay the application, to avoid keeping the input
    548         # arguments in memory
--> 549         self.results = batch()
    550 
    551     def get(self):

~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
    566         fit_time = time.time() - start_time
    567         # _score will return dict if is_multimetric is True
--> 568         test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
    569         score_time = time.time() - start_time - fit_time
    570         if return_train_score:

~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _score(estimator, X_test, y_test, scorer, is_multimetric)
    603     """
    604     if is_multimetric:
--> 605         return _multimetric_score(estimator, X_test, y_test, scorer)
    606     else:
    607         if y_test is None:

~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _multimetric_score(estimator, X_test, y_test, scorers)
    633             score = scorer(estimator, X_test)
    634         else:
--> 635             score = scorer(estimator, X_test, y_test)
    636 
    637         if hasattr(score, 'item'):

~/anaconda2/envs/py36/lib/python3.6/site-packages/sklearn/metrics/scorer.py in __call__(self, estimator, X, y_true, sample_weight)
     96         else:
     97             return self._sign * self._score_func(y_true, y_pred,
---> 98                                                  **self._kwargs)
     99 
    100 

<ipython-input-4-afa574df52f0> in custom_loss(y_true, y_pred, fn_cost, fp_cost)
     11         weighted_values = y_true * K.abs(1-y_pred)*fn_value + (1-y_true) * K.abs(y_pred)*fp_value
     12 
---> 13         loss = K.mean(weighted_values)
     14         return loss
     15 

~/anaconda2/envs/py36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py in mean(x, axis, keepdims)
   1377         A tensor with the mean of elements of `x`.
   1378     """
-> 1379     if x.dtype.base_dtype == tf.bool:
   1380         x = tf.cast(x, floatx())
   1381     return tf.reduce_mean(x, axis, keepdims)

AttributeError: 'numpy.dtype' object has no attribute 'base_dtype'

回答1:

The custom scoring function need not has to be a Keras function.

Here is a working example.

from sklearn import svm, datasets
import numpy as np
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

def custom_loss(y_true, y_pred):

    fn_cost, fp_cost = 5, 1
    h = np.ones(len(y_pred))
    fn_value = fn_cost * h
    fp_value = fp_cost * h

    weighted_values = y_true * np.abs(1-y_pred)*fn_value + (1-y_true) * np.abs(y_pred)*fp_value

    loss = np.mean(weighted_values)
    return loss


svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5,scoring= make_scorer(custom_loss, greater_is_better=True))
clf.fit(iris.data, iris.target)