I'm trying to run cross validation on my data set. The data appears to be clean, but then when I try to run it, some of my data gets replaced by NaNs. I'm not sure why. Has anybody seen this before?
y, X = np.ravel(df_test['labels']), df_test[['variation', 'length', 'tempo']]
X_train, X_test, y_train, y_test = cv.train_test_split(X,y,test_size=.30, random_state=4444)
This is what my X data looked like before KFolds:
variation length tempo
0 0.005144 1183.148118 135.999178
1 0.002595 720.165442 117.453835
2 0.008146 397.500952 112.347147
3 0.005367 1109.819501 172.265625
4 0.001631 509.931973 135.999178
5 0.001620 560.365714 151.999081
6 0.002513 763.377778 107.666016
7 0.009262 502.083628 99.384014
8 0.000610 500.017052 143.554688
9 0.000733 269.001723 117.453835
My Y data looks like this:
array([ True, False, False, True, True, True, True, False, True, False], dtype=bool)
Now when I try to do the cross val:
kf = KFold(X_train.shape[0], n_folds=4, shuffle=True)
for train_index, val_index in kf:
cv_train_x = X_train.ix[train_index]
cv_val_x = X_train.ix[val_index]
cv_train_y = y_train[train_index]
cv_val_y = y_train[val_index]
print cv_train_x
logreg = LogisticRegression(C = .01)
logreg.fit(cv_train_x, cv_train_y)
pred = logreg.predict(cv_val_x)
print accuracy_score(cv_val_y, pred)
When I try to run this, I error out with the below error, so I add the print statement.
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
In my print statement, this is what it printed, some data became NaNs.
variation length tempo
0 NaN NaN NaN
1 NaN NaN NaN
2 0.008146 397.500952 112.347147
3 0.005367 1109.819501 172.265625
4 0.001631 509.931973 135.999178
I'm sure I'm doing something wrong, any ideas? As always, thank you so much!