Cross-validation for Sklearn 0.20+?

2019-01-29 01:48发布

问题:

I am trying to do cross validation and I am running into an error that says: 'Found input variables with inconsistent numbers of samples: [18, 1]'

I am using different columns in a pandas data frame (df) as the features, with the last column as the label. This is derived from the machine learning repository for UC Irvine. When importing the cross-validation package that I have used in the past, I am getting an error that it may have depreciated. I am going to be running a decision tree, SVM, and K-NN.

My code is as such:

feature = [df['age'], df['job'], df['marital'], df['education'], df['default'], df['housing'], df['loan'], df['contact'],
       df['month'], df['day_of_week'], df['campaign'], df['pdays'], df['previous'], df['emp.var.rate'], df['cons.price.idx'],
       df['cons.conf.idx'], df['euribor3m'], df['nr.employed']]
label = [df['y']]

from sklearn.cross_validation import train_test_split
from sklearn.model_selection import cross_val_score
# Model Training 
x = feature[:]
y = label
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)

Any help would be great!

回答1:

cross_validation module is deprecated. The new module model_selection has taken its place. So everything you did with cross_validation. is now available in model_selection. Then your above code becomes:

feature = [df['age'], df['job'], df['marital'], df['education'], df['default'], df['housing'], df['loan'], df['contact'],
       df['month'], df['day_of_week'], df['campaign'], df['pdays'], df['previous'], df['emp.var.rate'], df['cons.price.idx'],
       df['cons.conf.idx'], df['euribor3m'], df['nr.employed']]
label = [df['y']]

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

Now as far as declaring the X and y is concerned, why are you wrapping them in a list. Just use them like this:

feature = df[['age', 'job', 'marital', 'education', 'default', 'housing', 
              'loan', 'contact', 'month', 'day_of_week', 'campaign', 
              'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 
              'cons.conf.idx', 'euribor3m', 'nr.employed']]
label = df['y']

And then you can simply use your code, without changing anything.

# Model Training 
x = feature[:]
y = label
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)

And for your last question about folds in cross-validation, there are multiple classes in sklearn which does this (depending upon task). Please have a look at:

  • http://scikit-learn.org/stable/modules/classes.html#splitter-classes

Which contains fold iterators. And remember, all this is present in model_selection package.



回答2:

The items in your feature list are pandas Series. You don't need to list out each feature in a list like you have done; you just need to pass them all as a single "table".

For example, this looks like the bank dataset so:

df = pd.read_csv('bank.csv', sep=';')
#df.shape
#(4521, 17)
#df.columns
#Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
#       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
#       'previous', 'poutcome', 'y'],
#      dtype='object')

x = df.iloc[:, :-1]
y = df.iloc[:, -1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)

Should work. The only thing to notice here is that xis a DataFrame with 16 columns but its underlying data is a numpy ndarray - not a list of Series but a single "matrix".