I understand that random_state
is used in various sklearn algorithms to break tie between different predictors (trees) with same metric value (say for example in GradientBoosting
). But the documentation does not clarify or detail on this. Like
1 ) where else are these seeds used for random number generation ? Say for RandomForestClassifier
, random number can be used to find a set of random features to build a predictor. Algorithms which use sub sampling, can use random numbers to get different sub samples. Can/Is the same seed (random_state
) playing a role in multiple random number generations ?
What I am mainly concerned about is
2) how far reaching is the effect of this random_state variable. ? Can the value make a big difference in prediction (classification or regression). If yes, what kind of data sets should I care for more ? Or is it more about stability than quality of results?
3) If it can make a big difference, how best to choose that random_state?. Its a difficult one to do GridSearch on, without an intuition. Specially if the data set is such that one CV can take an hour.
4) If the motive is to only have steady result/evaluation of my models and cross validation scores across repeated runs, does it have the same effect if I set random.seed(X)
before I use any of the algorithms (and use random_state
as None).
5) Say I am using a random_state
value on a GradientBoosted Classifier, and I am cross validating to find the goodness of my model (scoring on the validation set every time). Once satisfied, I will train my model on the whole training set before I apply it on the test set. Now, the full training set has more instances than the smaller training sets in the cross validation. So the random_state
value can now result in a completely different behavior (choice of features and individual predictors) when compared to what was happening within the cv loop. Similarly things like min samples leaf etc can also result in a inferior model now that the settings are w.r.t the number of instances in CV while the actual number of instances is more. Is this a correct understanding ? What is the approach to safeguard against this ?
Yes, the choice of the random seeds will impact your prediction results and as you pointed out in your fourth question, the impact is not really predictable.
The common way to guard against predictions that happen to be good or bad just by chance is to train several models (based on different random states) and to average their predictions in a meaningful way. Similarly, you can see cross validation as a way to estimate the "true" performance of a model by averaging the performance over multiple training/test data splits.
1 ) where else are these seeds used for random number generation ? Say for RandomForestClassifier , random number can be used to find a set of random features to build a predictor. Algorithms which use sub sampling, can use random numbers to get different sub samples. Can/Is the same seed (random_state) playing a role in multiple random number generations ?
random_state
is used wherever randomness is needed:
If your code relies on a random number generator, it should never use functions like numpy.random.random
or numpy.random.normal
. This approach can lead to repeatability issues in unit tests. Instead, a numpy.random.RandomState
object should be used, which is built from a random_state
argument passed to the class or function.
2) how far reaching is the effect of this random_state variable. ? Can the value make a big difference in prediction (classification or regression). If yes, what kind of data sets should I care for more ? Or is it more about stability than quality of results?
Good problems should not depend too much on the random_state
.
3) If it can make a big difference, how best to choose that random_state?. Its a difficult one to do GridSearch on, without an intuition. Specially if the data set is such that one CV can take an hour.
Do not choose it. Instead try to optimize the other aspects of classification to achieve good results, regardless of random_state
.
4) If the motive is to only have steady result/evaluation of my models and cross validation scores across repeated runs, does it have the same effect if I set random.seed(X) before I use any of the algorithms (and use random_state as None).
As of Should I use `random.seed` or `numpy.random.seed` to control random number generation in `scikit-learn`?, random.seed(X)
is not used by sklearn. If you need to control this, you could set np.random.seed()
instead.
5) Say I am using a random_state value on a GradientBoosted Classifier, and I am cross validating to find the goodness of my model (scoring on the validation set every time). Once satisfied, I will train my model on the whole training set before I apply it on the test set. Now, the full training set has more instances than the smaller training sets in the cross validation. So the random_state value can now result in a completely different behavior (choice of features and individual predictors) when compared to what was happening within the cv loop. Similarly things like min samples leaf etc can also result in a inferior model now that the settings are w.r.t the number of instances in CV while the actual number of instances is more. Is this a correct understanding ? What is the approach to safeguard against this ?
How can I know training data is enough for machine learning's answers mostly state that the more data the better.
If you do a lot of model-selection, maybe Sacred can help, too. Among other things, it sets and can log the random seed for each evaluation, f.ex.:
>>./experiment.py with seed=123