1 Estimating model performance with k-fold
cross-validation
估计与交叉验证模型的性能
The k-fold cross-validation technique is a common technique used to estimate theperformance of a classifier as it overcomes the problem of over-fitting. For k-fold crossvalidation, the method does not use the entire dataset to build the model, instead it splits the data into a training dataset and a testing dataset. Therefore, the model built with a training dataset can then be used to assess the performance of the model on the testing dataset. By performing n repeats of the k-fold validation, we can then use the average of n accuracies to truly assess the performance of the built model. In this recipe, we will illustrate how to perform a k-fold cross-validation.
The k-fold的交叉验证技术来估计一个分类器的性能,克服了过度拟合的问题的常用技术。对于交叉验证,该方法不使用整个数据集生成模型,它将数据分为训练集和测试集。因此,用训练数据集建构的模型可以用来评估模型在测试数据集上的性能。通过执行n的k重验证的重复,那么我们可以用平均n精度真正评估模型的性能。在这个食谱中,我们将说明如何执行k-fold交叉验证。
Getting ready
准备
In this recipe, we will continue to use the telecom churn dataset as the input data source to train the support vector machine. For those who have not prepared the dataset, please refer to Chapter 5, Classification (I) – Tree, Lazy, and Probabilistic, for detailed information.
在这个配方中,我们将继续使用电信流失数据集作为输入数据源来训练支持向量机。对于那些没有准备的数据集,请参阅第5章,分类(I)-树,懒惰和概率,详细信息。
How to do it...
怎么做
Perform the following steps to cross-validate the telecom churn dataset:
# Split the index into 10 fold using the cut function:ind = cut(1:nrow(churnTrain), breaks=10, labels=F) Next, use for loop to perform a 10 fold cross-validation, repeated 10 times:accuracies = c() for (i in 1:10) { fit = svm(churn ~., churnTrain[ind != i,]) predictions = predict(fit, churnTrain[ind == i, ! names(churnTrain) %in% c("churn")]) correct_count = sum(predictions == churnTrain[ind ==i,c("churn")]) accuracies = append(correct_count / nrow(churnTrain[ind ==i,]), accuracies)
3. You can then print the accuracies:
> accuracies
[1] 0.9341317 0.8948949 0.8978979 0.9459459 0.9219219 0.9281437
0.9219219 0.9249249 0.9189189 0.9251497
4. Lastly, you can generate average accuracies with the mean function:> mean(accuracies)
[1] 0.9213852
执行下列步骤交叉验证的电信流失数据集:1。使用切割功能为10倍分裂指数:> IND =切(1:nrow(churntrain),休息= 10,标签= F)2。接下来,使用for循环执行10倍交叉验证,重复10次:> >为c()精度=(我在1:10){ 配合= SVM(流失~,churntrain [工业!=我]) 预测=预测(适合,churntrain [ IND = =三.然后,您可以打印精度:>精度[ 1 ] 0.8948949 0.8978979,0.9459459 0.9219219,0.9219219 0.9249249,0.9189189 0.9251497,4。最后,您可以生成平均精度的平均功能:>平均(精度)[ 1 ] 0.9213852
How it works...
它如何工作…
In this recipe, we implement a simple script performing 10-fold cross-validations. We first generate an index with 10 fold with the cut function. Then, we implement a for loop to perform a 10-fold cross-validation 10 times. Within the loop, we first apply svm on 9 folds of data as the training set. We then use the fitted model to predict the label of the rest of the data (the testing dataset). Next, we use the sum of the correctly predicted labels to generate the accuracy. As a result of this, the loop stores 10 generated accuracies. Finally, we use the mean function to retrieve the average of the accuracies.
在这个秘诀中,我们实现了一个简单的脚本执行10倍交叉验证。我们首先生成一个指数与10倍的切割功能。然后,我们实现了一个循环执行10倍交叉验证的10倍。在循环中,我们首先应用SVM 9倍的数据作为训练集。然后,我们使用拟合模型来预测标签的其余部分的数据(测试数据),接下来,我们使用正确预测的标签的总和来产生准确度。因此,循环储存了10个生成的精度。最后利用平均值函数取平均值的平均值。
There's more...
还有更多…
If you wish to perform the k-fold validation with the use of other models, simply replace the
line to generate the variable fit to whatever classifier you prefer. For example, if you would like
to assess the Naïve Bayes model with a 10-fold cross-validation, you just need to replace the calling function from svm to naiveBayes:
# for (i in 1:10) { fit = naiveBayes(churn ~., churnTrain[ind != i,]) predictions = predict(fit, churnTrain[ind == i, ! names(churnTrain)%in% c("churn")]) correct_count = sum(predictions == churnTrain[ind == i,c("churn")]) accuracies = append(correct_count / nrow(churnTrain[ind == i,]),accuracies)+ }
如果你想与其他模型的使用进行重验证,只需更换线生成变量适合任何你喜欢的分类。例如,如果你想评估10折交叉验证的那ï朴素贝叶斯模型,你只需要替换从SVM NaiveBayes调用函数:>为(我在1:10){ 配合= NaiveBayes(流失~,churntrain [工业。
2.Performing cross-validation with the e1071 package
与e1071包进行交叉验证
Besides implementing a loop function to perform the k-fold cross-validation, you can use the tuning function (for example, tune.nnet, tune.randomForest, tune.rpart, tune.svm, and tune.knn.) within the e1071 package to obtain the minimum error value. In this recipe, we will illustrate how to use tune.svm to perform the 10-fold cross-validation and obtain the optimum classification model.
除了实现循环功能进行交叉验证,可以使用调整功能(例如,tune.nnet,tune.randomforest,tune.rpart,调。SVM和KNN,曲调。)的e1071包内获得最小误差值。在这个秘诀中,我们将说明如何使用tune.svm进行10倍交叉验证和获得最佳的分类模型。
Getting ready
准备
In this recipe, we continue to use the telecom churn dataset as the input data source to
perform 10-fold cross-validation.
在这个配方中,我们继续使用电信流失数据集作为输入数据源执行10倍交叉验证。
How to do it...
怎么做
Perform the following steps to retrieve the minimum estimation error using cross-validation:
执行以下步骤:使用交叉验证来检索最小估计误差:
1. Apply tune.svm on the training dataset, trainset, with the 10-fold cross-validation
as the tuning control. (If you find an error message, such as could not find
function predict.func, please clear the workspace, restart the R session and
reload the e1071 library again):
适用于训练数据集,tune.svm动车组,与10倍交叉验证作为调谐控制。(如果你发现一个错误消息,如找不到功能predict.func,请清除工作区,启动R会话并重新加载e1071图书馆再次):
# tuned = tune.svm(churn~., data = trainset, gamma = 10^-2, cost =10^2, tunecontrol=tune.control(cross=10))
2. Next, you can obtain the summary information of the model, tuned:
接下来,您可以获得模型的摘要信息,调整:
#summary(tuned)
Error estimation of 'svm' using 10-fold cross validation: 0.08164651
3. Then, you can access the performance details of the tuned model:
然后,您可以访问调整模型的性能细节:
# tuned$performances
gamma cost error dispersion1 0.01 100 0.08164651 0.02437228
4. Lastly, you can use the optimum model to generate a classification table:
最后,您可以使用优化模型生成分类表:
How it works...
它是如何工作的
The e1071 package provides miscellaneous functions to build and assess models, therefore,
you do not need to reinvent the wheel to evaluate a fitted model. In this recipe, we use the
tune.svm function to tune the svm model with the given formula, dataset, gamma, cost, and
control functions. Within the tune.control options, we configure the option as cross=10,
which performs a 10-fold cross validation during the tuning process. The tuning process will
eventually return the minimum estimation error, performance detail, and the best model
during the tuning process. Therefore, we can obtain the performance measures of the tuning
and further use the optimum model to generate a classification table.
e1071包提供了多种功能的建立和评估模型,因此,你不需要重新发明轮子评价拟合模型。在这个秘诀中,我们使用tune.svm功能来调整SVM模型与公式,数据集,γ,成本,和控制功能。在tune.control选项,我们配置选项为跨= 10,进行10倍交叉。调优过程最终将返回优化过程中的最小评估误差,性能细节和最佳模型。因此,我们可以得到调优性能指标,并进一步利用优化模型生成分类表。
See also
参见
In the e1071 package, the tune function uses a grid search to tune parameters.
For those interested in other tuning functions, use the help function to view the
tune document:
在e1071包,调函数使用一个网格搜索参数调整。对于其他调谐功能感兴趣的用户,请使用“帮助”功能查看调谐文件:
3 Performing cross-validation with the caret package并计划进行交叉验证
The Caret (classification and regression training) package contains many functions in regard to the training process for regression and classification problems. Similar to the e1071 package, it also contains a function to perform the k-fold cross validation. In this recipe, we will demonstrate how to the perform k-fold cross validation using the caret package.卡雷特(分类和回归训练)包中所包含的关于回归和分类问题的训练过程中的许多功能。类似的e1071包,它还包含了一个函数来实现交叉验证。在这个秘诀中,我们将演示如何执行交叉验证使用插入符号包。
Getting ready准备
In this recipe, we will continue to use the telecom churn dataset as the input data source to perform the k-fold cross validation.在这个食谱中,我们将继续使用电信客户流失数据集作为输入数据源进行交叉验证。
How to do it...怎么做
Perform the following steps to perform the k-fold cross-validation with the caret package:执行以下步骤并封装进行交叉验证:
1. First, set up the control parameter to train with the 10-fold cross validation in 3
repetitions:首先,建立控制参数训练与10倍交叉验证在3重复
# control = trainControl(method="repeatedcv", number=10,repeats=3)
. Then, you can train the classification model on telecom churn data with rpart:然后,你可以训练分类模型对rpart电信客户流失数据:
# model = train(churn~., data=trainset, method="rpart",preProcess="scale", trControl=control)
3. Finally, you can examine the output of the generated model:最后,您可以检查生成的模型的输出:
# model CART 2315 samples 16 predictor 2 classes: 'yes', 'no' Pre-processing: scaled Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 2084, 2083, 2082, 2084, 2083, 2084, ...
Resampling results across tuning parameters:
cp Accuracy Kappa Accuracy SD Kappa SD
0.0556 0.904 0.531 0.0236 0.155
0.0746 0.867 0.269 0.0153 0.153
0.0760 0.860 0.212 0.0107 0.141
Accuracy was used to select the optimal model using the largest
value.精度是用来选择最佳的模型,使用最大价值The final value used for the model was cp = 0.05555556.
How it works...怎么做
In this recipe, we demonstrate how convenient it is to conduct the k-fold cross-validation using the caret package. In the first step, we set up the training control and select the option to perform the 10-fold cross-validation in three repetitions. The process of repeating the k-fold validation is called repeated k-fold validation, which is used to test the stability of the model. If the model is stable, one should get a similar test result. Then, we apply rpart on the training dataset with the option to scale the data and to train the model with the options configured in the previous step.After the training process is complete, the model outputs three resampling results. Of these results, the model with cp=0.05555556 has the largest accuracy value (0.904), and is therefore selected as the optimal model for classification.
在这个秘诀中,我们演示了使用符号包进行k-fold交叉验证是如何方便的。在第一步中,我们设置了训练控制,并选择选项执行10倍交叉验证在三次重复。重复折验证的过程称为重复折验证,以检验该模型的稳定性。如果模型是稳定的,应该得到类似的测试结果。然后,我们将选择与规模的数据训练集的rpart和与配置在上一步中选择训练模型。在训练过程完成后,模型的输出结果三重采样。这些结果中,与CP = 0.05555556的模型具有最大的精度值(0.904),因此被选择作为分类的最佳模型。
ff You can configure the resampling function in trainControl, in which you can
specify boot, boot632, cv, repeatedcv, LOOCV, LGOCV, none, oob, adaptive_cv, adaptive_boot, or adaptive_LGOCV. To view more detailed information of how to choose the resampling method, view the trainControl document:你可以在控制、配置重采样功能,您可以在其中指定启动,
boot632,CV,repeatedcv,LOOCV,lgocv,无,OOB,adaptive_简历,adaptive_boot,或adaptive_lgocv。
查看更详细的信息如何选择重采样方法、角度控制、文件
---------摘自百度翻译
小结: k-fold 交叉验证就是把原始数据随机分成k个部分。交叉验证的过程实际上是把实验重复做k次,每次实验都从k个部分中选择一个不同的部分作为测试数据,剩下的k-1个当做训练数据进行试验,最后把得到的k个实验结果平均
武雪