SKLearn cross-validation: How to pass info on fold

2019-07-12 12:10发布

问题:

I am trying to craft a custom scorer function for cross-validating my (binary classification) model in scikit-learn (Python).

Some examples of my raw test data:

Source   Feature1   Feature2   Feature3
 123        0.1        0.2        0.3
 123        0.4        0.5        0.6
 456        0.7        0.8        0.9

Assuming that any fold might contain multiple test examples that come from the same source...

Then for the set of examples with the same source, I want my custom scorer to "decide" the "winner" to be the example for which the model spit out the higher probability. In other words, there can be only one correct prediction for each source but if my model claims that more than one evaluation example was "correct" (label=1), I want the example with the highest probability to be matched against the truth by my scorer.

My problem is that the scorer function requires the signature:

score_func(y_true, y_pred, **kwargs)

where y_true and y_pred contain the probability/label only.

However, what I really need is:

score_func(y_true_with_source, y_pred_with_source, **kwargs)

so I can group the y_pred_with_source examples by their source and choose the winner to match against that of the y_true_with_source truth. Then I can carry on to calculate my precision, for example.

Is there a way I can pass in this information in some way? Maybe the examples' indices?

回答1:

It sounds like you have a learning-to-rank problem here. You are trying to find the highest-ranked instance out of each group of instances. Learning-to-rank isn't directly supported in scikit-learn right now - scikit-learn pretty much assumes i.i.d. instances - so you'll have to do some extra work.

I think my first suggestion is to drop down a level in the API and use the cross-validation iterators. That would just generate indices for training and validation folds. You would subset your data with those indices and call fit and predict on the subsets, with Source removed, and then score it using the Source column.

You can probably hack it in to the cross_val_score approach, but its trickier. In scikit-learn there is a distinction between the score function, which is what you showed above, and the scoring object (which can be a function) taken by cross_val_score. The scoring object is a callable object or function which has signature scorer(estimator, X, y). It looks to me like you can define a scoring object that works for your metric. You just have to remove the Source column before sending data to the estimator, and then use that column when computing your metric. If you go this route, I think you will have to wrap the classifier, too, so that its fit method skips the Source column.

Hope that helps... Good luck!