I'm trying to run cross validation on my data set. The data appears to be clean, but then when I try to run it, some of my data gets replaced by NaNs. I'm not sure why. Has anybody seen this before?
y, X = np.ravel(df_test['labels']), df_test[['variation', 'length', 'tempo']]
X_train, X_test, y_train, y_test = cv.train_test_split(X,y,test_size=.30, random_state=4444)
This is what my X data looked like before KFolds:
variation length tempo
0 0.005144 1183.148118 135.999178
1 0.002595 720.165442 117.453835
2 0.008146 397.500952 112.347147
3 0.005367 1109.819501 172.265625
4 0.001631 509.931973 135.999178
5 0.001620 560.365714 151.999081
6 0.002513 763.377778 107.666016
7 0.009262 502.083628 99.384014
8 0.000610 500.017052 143.554688
9 0.000733 269.001723 117.453835
My Y data looks like this:
array([ True, False, False, True, True, True, True, False, True, False], dtype=bool)
Now when I try to do the cross val:
kf = KFold(X_train.shape[0], n_folds=4, shuffle=True)
for train_index, val_index in kf:
cv_train_x = X_train.ix[train_index]
cv_val_x = X_train.ix[val_index]
cv_train_y = y_train[train_index]
cv_val_y = y_train[val_index]
print cv_train_x
logreg = LogisticRegression(C = .01)
logreg.fit(cv_train_x, cv_train_y)
pred = logreg.predict(cv_val_x)
print accuracy_score(cv_val_y, pred)
When I try to run this, I error out with the below error, so I add the print statement.
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
In my print statement, this is what it printed, some data became NaNs.
variation length tempo
0 NaN NaN NaN
1 NaN NaN NaN
2 0.008146 397.500952 112.347147
3 0.005367 1109.819501 172.265625
4 0.001631 509.931973 135.999178
I'm sure I'm doing something wrong, any ideas? As always, thank you so much!
To solve use
.iloc
instead of.ix
to index your pandas dataframeIndexing with
ix
is usually equivalent to using.loc
which is label based indexing, not index based. While.loc
works onX
which has a nice integer based indexing/labeling, after cv split this rule is no longer there, you get something like:and now you no longer have label 0 or 1, so if you do
you will get an Exception
However, pandas has a silent error if you request multiple labels, where at least one exists. Thus if you do
you will get
As expected - 1 returns NaNs (since it was not found) and 4 represents actual row - since it is inside X_train. In order to solve it - just switch to
.iloc
or manually rebuild an index of X_train.