Higher ROC-AUC and F-1 scores but poor looking ROC

I recreated a new Ensemble method that does Voting manually between my three classifiers. (Courtesty of Daniel who helped me make the function from here: Improving the prediction score by use of confidence level of classifiers on instances).

The purpose of this manual voting was to accept the answers for each instance for the most confident classifier. Below is the code with their accuracy scores:

# parameters for random forest
rfclf_params = {
    'n_estimators': 500, 
    'bootstrap': True, 
    'class_weight':None, 
    'criterion':'gini',
    'max_depth':None, 
    'max_features':'auto',
    'warm_start': True,
    'random_state': 41
    # ... fill in the rest you want here
}

# Fill in svm params here
svm_params = {
    'C': 100,
    'probability':True,
    'random_state':42
}

# KNeighbors params go here
kneighbors_params= {
    'n_neighbors': 5,
    'weights':'distance'
}

y_test_classes = (y_test_sl, y_test_lim, y_test_shale, y_test_sandlim, y_test_ss, y_test_dol, y_test_sand)
classifiers = [RandomForestClassifier, SVC, KNeighborsClassifier]
params = [rfclf_params, svm_params, kneighbors_params]
y_trains_classes= (y_train_sl, y_train_lim, y_train_shale, y_train_sandlim, 
                   y_train_ss, y_train_dol, y_train_sand)
y_classes_names = ("shaly limestone", "limestone", "shale", "sandy lime", 
                   "shaly sandstone", "dolomite", "sandstone")

#Just get predictions
for y_trains, y_test, y_strings in zip(y_trains_classes, y_test_classes, y_classes_names):
    y_preds_test = ensemble_test(classifiers, params, X_train, y_trains, X_test_prepared)
    print("\n","Accuracy score for", y_strings, "=", accuracy_score(y_test, y_preds_test))
    print("f1_score for", y_strings, "=", f1_score(y_test, y_preds_test,
                                                        average = 'weighted', labels=np.unique(y_preds_test)))
    print("roc auc score for", y_strings, "=", roc_auc_score(y_test, y_preds_test,
                                                                  average = 'weighted'))

Accuracy score for shaly limestone = 0.949514563107
f1_score for shaly limestone = 0.949653574035
roc auc score for shaly limestone = 0.933362369338

 Accuracy score for limestone = 0.957281553398
f1_score for limestone = 0.957272532095
roc auc score for limestone = 0.957311555515

 Accuracy score for shale = 0.95145631068
f1_score for shale = 0.948556595316
roc auc score for shale = 0.845505617978

 Accuracy score for sandy lime = 0.998058252427
f1_score for sandy lime = 0.998008114117
roc auc score for sandy lime = 0.95

 Accuracy score for shaly sandstone = 0.996116504854
f1_score for shaly sandstone = 0.998054474708
roc auc score for shaly sandstone = 0.5

 Accuracy score for dolomite = 1.0
f1_score for dolomite = 1.0
roc auc score for dolomite = 1.0

 Accuracy score for sandstone = 0.996116504854
f1_score for sandstone = 0.996226826208
roc auc score for sandstone = 0.997995991984

When I want to plot an ROC-curves I know that I need to get predict_probas from this function so again, referring to suggestion from the link earlier I made the function return probabilities instead:

def ensemble_proba(classifiers, params, X_train, y_train, X_test):
    best_preds_test = np.zeros((len(X_test), 2))
    classes_test = np.unique(y_train)

    for i in range(len(classifiers)):
        # Construct the classifier by unpacking params 
        # store classifier instance
        clf_test = classifiers[i](**params[i])
        # Fit the classifier as usual and call predict_proba
        clf_test.fit(X_train, y_train)
        y_preds_test = clf_test.predict_proba(X_test)
        # Take maximum probability for each class on each classifier 
        # This is done for every instance in X_test
        # see the docs of np.maximum here: 
        # https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.maximum.html
        best_preds_test = np.maximum(best_preds_test, y_preds_test)

    # map the maximum probability for each instance back to its corresponding class
    preds_test = np.array([classes_test[np.argmax(pred)] for pred in best_preds_test])
    return np.array([np.amax(pred_probs) for pred_probs in best_preds_test])

Now, since I wanted to plot ROC-curves for all the classes in the test set I did the following and got ROC-curves that look very different then what I had expected to since my ROC-AUC scores are pretty good except for "shaly sandstone" class.

for y_trains, y_test, y_strings in zip(y_trains_classes, y_test_classes, y_classes_names):
    y_scores_ensemble_all = ensemble_proba(classifiers, params, X_train, y_trains, X_test_prepared)
    fpr_ensemble_all, tpr_ensemble_all, thresholds_ensemble_all = roc_curve(y_test_all,
                                                                              y_scores_ensemble_all)

    plt.figure(figsize=(8, 6))
    plot_roc_curve(fpr_ensemble_all, tpr_ensemble_all, "Ensemble manual voting")
    plt.legend(loc="lower right", fontsize=16)
    plt.title('ROC curve of Ensemble manual voting of  %s'%(y_strings))
    plt.axis([-0.01, 1.01, -0.01, 1.01])
    plt.show()

First two classes

Next two classes

Last class

Why does the curves look like this when their F1-scores and ROC-AUC scores are pretty good for almost all classes but they perform poorly on the ROC-curves? Did I do something wrong when I returned the probabilties from my function or are the curves supposed to look like this due to some reasons?