I made a random forest model using python's sklearn package where I set the seed to for example to 1234
. To productionise models, we use pyspark. If I was to pass the same hyperparmeters and same seed value, i.e. 1234
, will it get the same results?
Basically, do random seed numbers work between different systems?
Well, this is exactly the kind of question that could really do with some experiments & code snippets provided...
Anyway, it seems that the general answer is a firm no: not only between Python and Spark MLlib, but even between Spark sub-modules, or between Python & Numpy...
Here is some reproducible code, run in the Databricks community cloud (where
pyspark
is already imported & the relevant contexts initialized):And here are the results:
Native Python 3:
Numpy:
Spark SQL:
Spark MLlib:
Of course, even if the above results were identical, this would be no guarantee that results from, say, Random Forest in scikit-learn, would be exactly identical to the results of pyspark Random Forest...
Despite the negative answer, I really cannot see how that affects the deployment of any ML system, i.e. if the results depend crucially on the RNG, then something is definitely not right...
Yes, (pseudo)random number generators are completely deterministic and always return the same output given the same input. That is of course if the environment that generated the random numbers is the same across systems (there may be differences with different versions).
In the old days portability of PRNGs was not a given. Differences in machine architecture, overflow handling, and implementation differences for both the algorithm being used and the language it was being implemented in meant that results could and did vary, even if they were nominally based on the same mathematical formulation. In 1979 Schrage (see page 1194 here) created a portable prime-modulus multiplicative linear congruential generator and showed that it could be implemented in a machine and language independent way "...as long as the machine can represent all integers in the interval -231 to 23 - 1." He gave a specific check that implementers could use to test their implementation, specifying what the 1000th outcome should be given a particular seed value. Since Schrage's work, designing algorithms to be platform and language independent has become the norm.
Python's default generator is a Mersenne twister, and a variety of platform and language independent MT implementations are available on the Mersenne Twister home page. If Python switches its default generator in the future, then compatibility is not guaranteed unless you use one of the independent Python implementations available from the link above.