How to use the a k-fold cross validation in scikit

I have a small corpus and I want to calculate the accuracy of naive Bayes classifier using 10-fold cross validation, how can do it.

标签： python scikit-learn nltk bayesian cross-validation

5条回答

戒情不戒烟

2楼-- · 2019-01-21 20:01

Modified the second answer:

cv = cross_validation.KFold(len(training_set), n_folds=10, shuffle=True, random_state=None)

0人赞添加讨论(0) 举报

小情绪 Triste *

3楼-- · 2019-01-21 20:11

Inspired from Jared's answer, here is a version using a generator:

def k_fold_generator(X, y, k_fold):
    subset_size = len(X) / k_fold  # Cast to int if using Python 3
    for k in range(k_fold):
        X_train = X[:k * subset_size] + X[(k + 1) * subset_size:]
        X_valid = X[k * subset_size:][:subset_size]
        y_train = y[:k * subset_size] + y[(k + 1) * subset_size:]
        y_valid = y[k * subset_size:][:subset_size]

        yield X_train, y_train, X_valid, y_valid

I am assuming that your data set X has N data points (= 4 in the example) and D features (= 2 in the example). The associated N labels are stored in y.

X = [[ 1, 2], [3, 4], [5, 6], [7, 8]]
y = [0, 0, 1, 1]
k_fold = 2

for X_train, y_train, X_valid, y_valid in k_fold_generator(X, y, k_fold):
    # Train using X_train and y_train
    # Evaluate using X_valid and y_valid

0人赞添加讨论(0) 举报

孤傲高冷的网名

4楼-- · 2019-01-21 20:15

I've used both libraries and NLTK for naivebayes sklearn for crossvalidation as follows:

import nltk
from sklearn import cross_validation
training_set = nltk.classify.apply_features(extract_features, documents)
cv = cross_validation.KFold(len(training_set), n_folds=10, indices=True, shuffle=False, random_state=None, k=None)

for traincv, testcv in cv:
    classifier = nltk.NaiveBayesClassifier.train(training_set[traincv[0]:traincv[len(traincv)-1]])
    print 'accuracy:', nltk.classify.util.accuracy(classifier, training_set[testcv[0]:testcv[len(testcv)-1]])

and at the end I calculated the average accuracy

0人赞添加讨论(0) 举报

ら.Afraid

5楼-- · 2019-01-21 20:18

Your options are to either set this up yourself or use something like NLTK-Trainer since NLTK doesn't directly support cross-validation for machine learning algorithms.

I'd recommend probably just using another module to do this for you but if you really want to write your own code you could do something like the following.

Supposing you want 10-fold, you would have to partition your training set into 10 subsets, train on 9/10, test on the remaining 1/10, and do this for each combination of subsets (10).

Assuming your training set is in a list named training, a simple way to accomplish this would be,

num_folds = 10
subset_size = len(training)/num_folds
for i in range(num_folds):
    testing_this_round = training[i*subset_size:][:subset_size]
    training_this_round = training[:i*subset_size] + training[(i+1)*subset_size:]
    # train using training_this_round
    # evaluate against testing_this_round
    # save accuracy

# find mean accuracy over all rounds

0人赞添加讨论(0) 举报

叛逆

6楼-- · 2019-01-21 20:21

Actually there is no need for a long loop iterations that are provided in the most upvoted answer. Also the choice of classifier is irrelevant (it can be any classifier).

Scikit provides cross_val_score, which does all the looping under the hood.

from sklearn.cross_validation import KFold, cross_val_score
k_fold = KFold(len(y), n_folds=10, shuffle=True, random_state=0)
clf = <any classifier>
print cross_val_score(clf, X, y, cv=k_fold, n_jobs=1)

0人赞添加讨论(0) 举报

How to use the a k-fold cross validation in scikit

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间