哪里有数据集,训练自己的数据集

数据集分为训练验证测试

我们的模型(Testing Our Model ) supervisedmachinelearningalgorithmsareamazingtoolscapableofmakingpredictionsandclassifications.hhhhored oaskyourselfhowaccuratethosepredictionsare.after all，it’spossiblethateverypredictionyourclassifiermakesisaces luckily， wecanleveragethefactthatsupervisedmachinelearningalgorithms，by definition，haveadatasetofpre-labeledddatapoints.inordertor

监控型机器学习算法是可以预测和分类的优秀工具。但是，重要的是问自己这些预测的正确性。毕竟，你的分类器所做的所有预测实际上都是错误的！幸运的是，根据定义，可以利用机器学习算法监视具有预先标记了数据点的数据集的事实。为了测试算法的有效性，将这些数据分类如下。

training set

训练集

validation set

验证集

test set

测试集

训练集和验证集thetrainingsetisthedatathatthealgorithmwilllearnfrom.learninglooksdifferentdedetion are using.for example when using linear regression，thepointsinthetrainingsetareusedtodrawthelineofbestfit.ink-nearest nenighbobobbbbbbot

训练集是算法学习的数据。根据您使用的算法，学习看起来会不同。例如，如果使用线性回归，则训练集中的点将用于绘制最佳拟合线。 “k最近邻”中，训练集中这一点是可以成为邻居的点。

aftertrainingusingthetrainingset， thepointsinthevalidationsetareusedtocomputetheaccuracyorerroroftheclassifier.thekeyinsighthereisthetweknowthetruelabelsof evelsof n set， but we’retemporarilygoingtopretendlikewedon’t.wecanuseeverypointinthevalidationsetasinputtoourclassifier.we’llthenreclthenrecarecifier nowpeekatthetruelabelofthevalidationpointandseewhetherwegotitrightornot.ifwedothisforevery point in the validation set，we can comome

使用训练集进行训练后，使用验证集中的点计算分类器的正确性或误差。这里的重要观点是，我们知道验证集中每个点的真正标签，但我们暂时装作不知道。可以将验证集中的每个点用作分类器的输入。然后，我们会收到那一点的分类。在这里，你可以偷看验证点的真正标签，看看是否正确。对验证集中的所有点执行此操作可以计算验证错误。

validationerrormightnotbetheonlymetricwe’re interested in.a better

way of judging the effectiveness of a machine learning algorithm is to compute its precision, recall, and F1 score.

验证错误可能不是我们感兴趣的唯一度量标准。判断机器学习算法有效性的更好方法是计算其精度，召回率和F1分数。

如何分裂 (How to Split)

Figuring out how much of your data should be split into your validation set is a tricky question. If your training set is too small, then your algorithm might not have enough data to effectively learn. On the other hand, if your validation set is too small, then your accuracy, precision, recall, and F1 score could have a large variance. You might happen to get a really lucky or a really unlucky split! In general, putting 80% of your data in the training set, and 20% of your data in the validation set is a good place to start.

弄清楚应将多少数据分成验证集是一个棘手的问题。如果训练集太小，则您的算法可能没有足够的数据来有效学习。另一方面，如果您的验证集太小，则您的准确性，准确性，召回率和F1得分可能会有较大差异。您可能碰巧遇到了一个真正幸运或非常不幸的分裂！通常，将80％的数据放入训练集中，将20％的数据放入验证集中是一个不错的起点。

N折交叉验证 (N-Fold Cross-Validation)

Sometimes your dataset is so small, that splitting it 80/20 will still result in a large amount of variance. One solution to this is to perform N-Fold Cross-Validation. The central idea here is that we’re going to do this entire process N times and average the accuracy. For example, in 10-fold cross-validation, we’ll make the validation set the first 10% of the data and calculate accuracy, precision, recall and F1 score. We’ll then make the validation set the second 10% of the data and calculate these statistics again. We can do this process 10 times, and every time the validation set will be a different chunk of the data. If we then average all of the accuracies, we will have a better sense of how our model does on average.

有时，您的数据集非常小，以至于将其拆分为80/20仍会导致大量差异。一种解决方案是执行N折交叉验证 。这里的中心思想是，我们将整个过程进行N次，并取平均精度。例如，在10倍交叉验证中，我们将验证集设置为数据的前10％，并计算准确性，准确性，召回率和F1得分。然后，我们将验证设置为数据的后10％，然后再次计算这些统计信息。我们可以执行10次此过程，并且每次验证集都是不同的数据块。如果我们随后将所有精度平均，则可以更好地了解我们的模型的平均效果。

更改模型/测试集 (Changing The Model / Test Set)

Understanding the accuracy of your model is invaluable because you can begin to tune the parameters of your model to increase its performance. For example, in the K-Nearest Neighbors algorithm, you can watch what happens to accuracy as you increase or decrease K. (You can try out all of this in our K-Nearest Neighbors lesson!)

了解模型的准确性非常重要，因为您可以开始调整模型的参数以提高其性能。例如，在“ K最近邻”算法中，您可以观察增加或减小K时精度的变化。(您可以在我们的“ K最近邻”课程中尝试所有这些方法！)

Once you’re happy with your model’s performance, it is time to introduce the test set. This is part of your data that you partitioned away at the very start of your experiment. It’s meant to be a substitute for the data in the real world that you’re actually interested in classifying. It functions very similarly to the validation set, except you never touched this data while building or tuning your model. By finding the accuracy, precision, recall, and F1 score on the test set, you get a good understanding of how well your algorithm will do in the real world.

对模型的性能满意后，就该介绍测试集了。这是您在实验开始时就将数据分区的一部分。它旨在替代您实际上对分类感兴趣的现实世界中的数据。它的功能与验证集非常相似，只不过您在构建或调整模型时从未接触过此数据。通过在测试集上找到准确性，准确性，召回率和F1分数，您可以很好地了解算法在现实世界中的表现。

翻译自: https://medium.com/@vinaykumarpaspula/splitting-a-data-set-into-training-validation-and-test-sets-f1654b7574c

数据集分为训练验证测试