可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I want to implement a AdaBoost model using scikit-learn (sklearn). My question is similar to another question but it is not totally the same. As far as I understand, the random_state variable described in the documentation is for randomly splitting the training and testing sets, according to the previous link. So if I understand correctly, my classification results should not be dependent on the seeds, is it correct? Should I be worried if my classification results turn out to be dependent on the random_state variable?

回答1:

Your classification scores will depend on random_state. As @Ujjwal rightly said, it is used for splitting the data into training and test test. Not just that, a lot of algorithms in scikit-learn use the random_state to select the subset of features, subsets of samples, and determine the initial weights etc.

For eg.

Tree based estimators will use the random_state for random selections of features and samples (like DecisionTreeClassifier, RandomForestClassifier).
In clustering estimators like Kmeans, random_state is used to initialize centers of clusters.
SVMs use it for initial probability estimation
Some feature selection algorithms also use it for initial selection
And many more...

Its mentioned in the documentation that:

If your code relies on a random number generator, it should never use functions like numpy.random.random or numpy.random.normal. This approach can lead to repeatability issues in tests. Instead, a numpy.random.RandomState object should be used, which is built from a random_state argument passed to the class or function.

Do read the following questions and answers for better understanding:

Choosing random_state for sklearn algorithms
confused about random_state in decision tree of scikit learn

回答2:

It does matter. When your training set differs then your trained state also changes. For a different subset of data you can end up with a classifier which is little different from the one trained with some other subset.

Hence, you should use a constant seed like 0 or another integer, so that your results are reproducible.