I want to implement a AdaBoost model using scikit-learn (sklearn). My question is similar to another question but it is not totally the same. As far as I understand, the random_state variable described in the documentation is for randomly splitting the training and testing sets, according to the previous link. So if I understand correctly, my classification results should not be dependent on the seeds, is it correct? Should I be worried if my classification results turn out to be dependent on the random_state variable?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to get the background from multiple images by
- Evil ctypes hack in python
- Correctly parse PDF paragraphs with Python
Your classification scores will depend on
random_state
. As @Ujjwal rightly said, it is used for splitting the data into training and test test. Not just that, a lot of algorithms in scikit-learn use therandom_state
to select the subset of features, subsets of samples, and determine the initial weights etc.For eg.
Tree based estimators will use the
random_state
for random selections of features and samples (likeDecisionTreeClassifier, RandomForestClassifier
).In clustering estimators like Kmeans,
random_state
is used to initialize centers of clusters.SVMs use it for initial probability estimation
Its mentioned in the documentation that:
Do read the following questions and answers for better understanding:
It does matter. When your training set differs then your trained state also changes. For a different subset of data you can end up with a classifier which is little different from the one trained with some other subset.
Hence, you should use a constant seed like
0
or another integer, so that your results are reproducible.