我有一个小语料库和我要计算使用10倍交叉验证朴素贝叶斯分类器的准确度,如何能做到这一点。
Answer 1:
你的选择是要么设置它自己或使用类似NLTK培训师 ,因为NLTK 不直接支持的机器学习算法交叉验证 。
我建议可能只是使用其他模块来为你做这一点,但如果你真的想编写自己的代码,你可以不喜欢以下。
假如你希望10倍 ,你就必须重新分区您的培训设置为10
个子集,列车9/10
,测试在剩余的1/10
,并为子集的每个组合(做10
)。
假设你的训练集合在列表中指定的training
,一个简单的方法来完成,这将是,
num_folds = 10
subset_size = len(training)/num_folds
for i in range(num_folds):
testing_this_round = training[i*subset_size:][:subset_size]
training_this_round = training[:i*subset_size] + training[(i+1)*subset_size:]
# train using training_this_round
# evaluate against testing_this_round
# save accuracy
# find mean accuracy over all rounds
Answer 2:
其实没有必要在最upvoted答案提供的长期循环迭代。 也分类器的选择无关紧要(它可以是任何分类器)。
Scikit提供cross_val_score ,这确实引擎盖下的所有循环。
from sklearn.cross_validation import KFold, cross_val_score
k_fold = KFold(len(y), n_folds=10, shuffle=True, random_state=0)
clf = <any classifier>
print cross_val_score(clf, X, y, cv=k_fold, n_jobs=1)
Answer 3:
我已经用了naivebayes sklearn这两个库,并为NLTK交叉证实如下:
import nltk
from sklearn import cross_validation
training_set = nltk.classify.apply_features(extract_features, documents)
cv = cross_validation.KFold(len(training_set), n_folds=10, indices=True, shuffle=False, random_state=None, k=None)
for traincv, testcv in cv:
classifier = nltk.NaiveBayesClassifier.train(training_set[traincv[0]:traincv[len(traincv)-1]])
print 'accuracy:', nltk.classify.util.accuracy(classifier, training_set[testcv[0]:testcv[len(testcv)-1]])
和在结束我计算出的平均准确
Answer 4:
修改了第二个答案:
cv = cross_validation.KFold(len(training_set), n_folds=10, shuffle=True, random_state=None)
Answer 5:
从启发Jared的答案 ,这里是用发电机的版本:
def k_fold_generator(X, y, k_fold):
subset_size = len(X) / k_fold # Cast to int if using Python 3
for k in range(k_fold):
X_train = X[:k * subset_size] + X[(k + 1) * subset_size:]
X_valid = X[k * subset_size:][:subset_size]
y_train = y[:k * subset_size] + y[(k + 1) * subset_size:]
y_valid = y[k * subset_size:][:subset_size]
yield X_train, y_train, X_valid, y_valid
我假设你的数据集X
具有N个数据点(=在实施例4)和d功能(=在实施例2)。 在相关的氮标签存储在y
。
X = [[ 1, 2], [3, 4], [5, 6], [7, 8]]
y = [0, 0, 1, 1]
k_fold = 2
for X_train, y_train, X_valid, y_valid in k_fold_generator(X, y, k_fold):
# Train using X_train and y_train
# Evaluate using X_valid and y_valid
文章来源: How to use the a k-fold cross validation in scikit with naive bayes classifier and NLTK