I am trying to create an ML model (regression) using various techniques like SMR, Logistic Regression, and others. With all the techniques, I'm not able to get efficiency more than 35%. Here's what I'm doing:
X_data = [X_data_distance]
X_data = np.vstack(X_data).astype(np.float64)
X_data = X_data.T
y_data = X_data_orders
#print(X_data.shape)
#print(y_data.shape)
#(10000, 1)
#(10000,)
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.33, random_state=42)
svr_rbf = SVC(kernel= 'rbf', C= 1.0)
svr_rbf.fit(X_train, y_train)
plt.plot(X_data_distance, svr_rbf.predict(X_data), color= 'red', label= 'RBF model')
For the plot, I'm getting the following:
I have tried various parameter tuning, changing the parameter C, gamma even tried different kernels, but nothing changes the accuracy. Even tried SVR, Logistic regression instead of SVC, but nothing helps. I tried different scaling for training input data like StandardScalar()
and scale()
.
I used this as a reference
What should I do?
As a rule of thumb, we usually follow this convention:
Logistic Regression
.SVM
.Neural Network
.Because your dataset is a 10K cases, it'd be better to use
Logistic Regression
becauseSVM
will take forever to finish!.Nevertheless, because your dataset contains a lot of classes, there is a chance of classes imbalance in your implementation. Thus I tried to workaround this problem via using the StratifiedKFold instead of
train_test_split
which doesn't guarantee balanced classes in the splits.Moreover, I used GridSearchCV with StratifiedKFold to perform Cross-Validation in order to tune the parameters and try all different optimizers!
So the full implementation is as follows:
Despite all the attempts with all different algorithms, the accuracy didn't exceed 36%!!.
Why is that?
If you want to make a person recognize/classify another person by their T-shirt color, you cannot say: hey if it's red that means he's John and if it's red it's Peter but if it's red it's Aisling!! He would say "really, what the hack is the difference"?!!.
And that's exactly what is in your dataset!
Simply, run
print(len(np.unique(X_data)))
andprint(len(np.unique(Y_data)))
and you'll find that the numbers are so weird, in a nutshell you have:All classes are sharing hell a lot of information which make it impressive to have even up to 36% accuracy!
In other words, you have no informative features which lead to a lack in the uniqueness of each class model!
What to do? I believe you are not allowed to remove some classes, so the only two solutions you have are:
Either live with this very valid result.
Or add more informative feature(s).
Update
Having you provided same dataset but with more features (i.e. complete set of features), the situation now is different.
I recommend you do the following:
Pre-process your dataset (i.e. prepare it by imputing missing values or deleting rows containing missing values, and converting dates to some unique values (example) ...etc).
Check what features are most important to the
Orders
Classes, you can achieve that by using ofForests of Trees
to evaluate the importance of features. Here is a complete and simple example of how to do that inScikit-Learn
.Create a new version of the dataset but this time hold
Orders
as theY
response, and the above-found features as theX
variables.Follow the same
GrdiSearchCV
andStratifiedKFold
procedure that I showed you in the implementation above.Hint
As per mentioned by Vivek Kumar in the comment below,
stratify
parameter has been added inScikit-learn
update to the train_test_split function.It works by passing the array-like ground truth, so you don't need my workaround in the function
stratifiedSplit(X, Y)
above.