Getting a score of zero using cross val score

2019-08-31 07:07发布

问题:

I am trying to use cross_val_score on my dataset, but I keep getting zeros as the score:

This is my code:

df = pd.read_csv("Flaveria.csv")
df = pd.get_dummies(df, columns=["N level", "species"], drop_first=True)

# Extracting the target value from the dataset
X = df.iloc[:, df.columns != "Plant Weight(g)"]
y = np.array(df.iloc[:, 0], dtype="S6")

logreg = LogisticRegression()
loo = LeaveOneOut()

scores = cross_val_score(logreg, X, y, cv=loo)
print(scores)

The features are categorical values, while the target value is a float value. I am not exactly sure why I am ONLY getting zeros.

The data looks like this before creating dummy variables

N level,species,Plant Weight(g)
L,brownii,0.3008
L,brownii,0.3288
M,brownii,0.3304
M,brownii,0.388
M,brownii,0.406
H,brownii,0.3955
H,brownii,0.3797
H,brownii,0.2962

Updated code where I am still getting zeros:

 from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestRegressor


import numpy as np
import pandas as pd

# Creating dummies for the non numerical features in the dataset

df = pd.read_csv("Flaveria.csv")
df = pd.get_dummies(df, columns=["N level", "species"], drop_first=True)

# Extracting the target value from the dataset
X = df.iloc[:, df.columns != "Plant Weight(g)"]
y = df.iloc[:, 0]

forest = RandomForestRegressor()
loo = LeaveOneOut()

scores = cross_val_score(forest, X, y, cv=loo)
print(scores)

回答1:

The general cross_val_score will split the data into train and test with the given iterator, then fit the model with the train data and score on the test fold. And for regressions, r2_score is the default in scikit.

You have specified LeaveOneOut() as your cv iterator. So each fold will contain a single test case. In this case, R_squared will always be 0.

Looking at the formula for R2 in wikipedia:

R2 = 1 - (SS_res/SS_tot)

And

SS_tot = sqr(sum(y - y_mean))

Here for a single case, y_mean will be equal to y value and hence denominator is 0. So the whole R2 is undefined (Nan). In this case, scikit-learn will set the value to 0, instead of nan.

Changing the LeaveOneOut() to any other CV iterator like KFold, will give you some non-zero results as you have already observed.