Purpose of test data in supervised learning?

2019-09-02 17:50发布

问题:

So this question may seem a little stupid but I couldn't wrap my head around it. What is the purpose of test data? Is it only to calculate accuracy of the classifier? I'm using Naive Bayes for sentiment analysis of tweets. Once I train my classifier using training data, I use test data just to calculate accuracy of the classifier. How can I use the test data to improve classifier's performance?

回答1:

In doing general supervised machine learning, the test data set plays a critical role in determining how well your model is performing. You typically will build a model with say 90% of your input data, leaving 10% aside for testing. You then check the accuracy of that model by seeing how well it does against the 10% training set. The performance of the model against the test data is meaningful because the model has never "seen" this data. If the model be statistically valid, then it should perform well on both the training and test data sets. This general procedure is called cross validation and you can read more about it here.



回答2:

You don't -- like you surmise, the test data is used for testing, and mustn't be used for anything else, lest you skew your accuracy measurements. This is an important cornerstone of any machine learning -- you only fool yourself if you use your test data for training.

If you are considering desperate measures like that, the proper way forward is usually to re-examine your problem space and the solution you have. Does it adequately model the problem you are trying to solve? If not, can you devise a better model which captures the essence of the problem?

Machine learning is not a silver bullet. It will not solve your problem for you. Too many failed experiments prove over and over again, "garbage in -- garbage out".