I'm trying to apply Gaussian Naive Bayes
model on a dataset to predict disease. It's running correctly when I'm predicting using training data, but when I'm trying to predict using testing data It's giving ValueError
.
runfile('D:/ROFI/ML/Heart Disease/prediction.py', wdir='D:/ROFI/ML/Heart Disease') Traceback (most recent call last):
File "", line 1, in runfile('D:/ROFI/ML/Heart Disease/prediction.py', wdir='D:/ROFI/ML/Heart Disease')
File "C:\Users\User\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile execfile(filename, namespace)
File "C:\Users\User\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile exec(compile(f.read(), filename, 'exec'), namespace)
File "D:/ROFI/ML/Heart Disease/prediction.py", line 85, in predict(x_train, y_train, x_test, y_test)
File "D:/ROFI/ML/Heart Disease/prediction.py", line 73, in predict predicted_data = model.predict(x_test)
File "C:\Users\User\Anaconda3\lib\site-packages\sklearn\naive_bayes.py", line 65, in predict jll = self._joint_log_likelihood(X)
File "C:\Users\User\Anaconda3\lib\site-packages\sklearn\naive_bayes.py", line 429, in _joint_log_likelihood n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) /
ValueError: operands could not be broadcast together with shapes (294,14) (15,)
What's wrong here ?
import pandas
from sklearn import metrics
from sklearn.preprocessing import Imputer
from sklearn.naive_bayes import GaussianNB
def load_data(feature_columns, predicted_column):
train_data_frame = pandas.read_excel("training_data.xlsx")
test_data_frame = pandas.read_excel("testing_data.xlsx")
data_frame = pandas.read_excel("data_set.xlsx")
x_train = train_data_frame[feature_columns].values
y_train = train_data_frame[predicted_column].values
x_test = test_data_frame[feature_columns].values
y_test = test_data_frame[predicted_column].values
x_train, x_test = impute(x_train, x_test)
return x_train, y_train, x_test, y_test
def impute(x_train, x_test):
fill_missing = Imputer(missing_values=-9, strategy="mean", axis=0)
x_train = fill_missing.fit_transform(x_train)
x_test = fill_missing.fit_transform(x_test)
return x_train, x_test
def predict(x_train, y_train, x_test, y_test):
model = GaussianNB()
model.fit(x_train, y_train.ravel())
predicted_data = model.predict(x_test)
accuracy = metrics.accuracy_score(y_test, predicted_data)
print("Accuracy of our naive bayes model is : %.2f"%(accuracy * 100))
return predicted_data
feature_columns = ["age", "sex", "chol", "cigs", "years", "fbs", "trestbps", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]
predicted_column = ["cp"]
x_train, y_train, x_test, y_test = load_data(feature_columns, predicted_column)
predict(x_train, y_train, x_test, y_test)
N.B: Both file has same number of columns.
I found the bug. The error is occurring because of
Imputer
.Imputer
replaces the missing value in data set. But, if any column is entirely composed of missing value then it deletes that column. I had a column full of missing data entirely in testing data set. So,Imputer
was deleting that and thus shape didn't match with training data and that's the reason of the error. Just removed the column name fromfeature_columns
list which was full of missing value and it worked.