Are random seeds compatible between systems?

I made a random forest model using python's sklearn package where I set the seed to for example to 1234. To productionise models, we use pyspark. If I was to pass the same hyperparmeters and same seed value, i.e. 1234, will it get the same results?

Basically, do random seed numbers work between different systems?

标签： python random scikit-learn pyspark apache-spark-mllib

3条回答

手持菜刀，她持情操

2楼-- · 2019-06-28 04:03

Well, this is exactly the kind of question that could really do with some experiments & code snippets provided...

Anyway, it seems that the general answer is a firm no: not only between Python and Spark MLlib, but even between Spark sub-modules, or between Python & Numpy...

Here is some reproducible code, run in the Databricks community cloud (where pyspark is already imported & the relevant contexts initialized):

import sys

import random
import pandas as pd
import numpy as np
from pyspark.sql.functions import rand, randn
from pyspark.mllib import random as r  # avoid conflict with native Python random module

print("Spark version " + spark.version)
print("Python version %s.%s.%s" % sys.version_info[:3])
print("Numpy version " + np.version.version)

# Spark version 2.3.1 
# Python version 3.5.2 
# Numpy version 1.11.1

s = 1234 # RNG seed


# Spark SQL random module:
spark_df = sqlContext.range(0, 10)
spark_df = spark_df.select("id", randn(seed=s).alias("normal"), rand(seed=s).alias("uniform"))


# Python 3 random module:
random.seed(s)
x = [random.uniform(0,1) for i in range(10)] # random.rand() gives exact same results

random.seed(s)
y = [random.normalvariate(0,1) for i in range(10)]

df = pd.DataFrame({'uniform':x, 'normal':y})


# numpy random module
np.random.seed(s)
xx = np.random.uniform(size=10)  # again, np.random.rand(10) gives exact same results

np.random.seed(s)
yy = np.random.randn(10)

numpy_df = pd.DataFrame({'uniform':xx, 'normal':yy})


# Spark MLlib random module
rdd_uniform = r.RandomRDDs.uniformRDD(sc, 10, seed=s).collect()
rdd_normal = r.RandomRDDs.normalRDD(sc, 10, seed=s).collect()

rdd_df = pd.DataFrame({'uniform':rdd_uniform, 'normal':rdd_normal})

And here are the results:

Native Python 3:

# df

     normal  uniform
0  1.430825 0.966454
1  1.803801 0.440733 
2  0.321290 0.007491 
3  0.599006 0.910976 
4 -0.700891 0.939269 
5  0.233350 0.582228
6 -0.613906 0.671563
7 -1.622382 0.083938
8  0.131975 0.766481
9  0.191054 0.236810

Numpy:

# numpy_df

     normal  uniform
0  0.471435 0.191519
1 -1.190976 0.622109 
2  1.432707 0.437728
3 -0.312652 0.785359
4 -0.720589 0.779976
5  0.887163 0.272593
6  0.859588 0.276464 
7 -0.636524 0.801872 
8  0.015696 0.958139
9 -2.242685 0.875933

Spark SQL:

# spark_df.show()

+---+--------------------+-------------------+ 
| id|              normal|            uniform|
+---+--------------------+-------------------+
|  0|  0.9707422835368164| 0.9499610869333489| 
|  1|  0.3641589200870126| 0.9682554532421536|
|  2|-0.22282955491417034|0.20293463923130883|
|  3|-0.00607734375219...|0.49540111648680385|
|  4|  -0.603246393509015|0.04350782074761239|
|  5|-0.12066287904491797|0.09390549680302918|
|  6|  0.2899567922101867| 0.6789838400775526|
|  7|  0.5827830892516723| 0.6560703836291193|
|  8|   1.351649207673346| 0.7750229279150739|
|  9|  0.5286035772104091| 0.6075560897646175|
+---+--------------------+-------------------+

Spark MLlib:

# rdd_df

     normal  uniform 
0 -0.957840 0.259282 
1  0.742598 0.674052 
2  0.225768 0.707127 
3  1.109644 0.850683 
4 -0.269745 0.414752 
5 -0.148916 0.494394 
6  0.172857 0.724337
7 -0.276485 0.252977
8 -0.963518 0.356758
9  1.366452 0.703145

Of course, even if the above results were identical, this would be no guarantee that results from, say, Random Forest in scikit-learn, would be exactly identical to the results of pyspark Random Forest...

Despite the negative answer, I really cannot see how that affects the deployment of any ML system, i.e. if the results depend crucially on the RNG, then something is definitely not right...

0人赞添加讨论(0) 举报

在下西门庆

3楼-- · 2019-06-28 04:19

Yes, (pseudo)random number generators are completely deterministic and always return the same output given the same input. That is of course if the environment that generated the random numbers is the same across systems (there may be differences with different versions).

0人赞添加讨论(0) 举报

Emotional °昔

4楼-- · 2019-06-28 04:27

In the old days portability of PRNGs was not a given. Differences in machine architecture, overflow handling, and implementation differences for both the algorithm being used and the language it was being implemented in meant that results could and did vary, even if they were nominally based on the same mathematical formulation. In 1979 Schrage (see page 1194 here) created a portable prime-modulus multiplicative linear congruential generator and showed that it could be implemented in a machine and language independent way "...as long as the machine can represent all integers in the interval -2³¹ to 2³ - 1." He gave a specific check that implementers could use to test their implementation, specifying what the 1000^th outcome should be given a particular seed value. Since Schrage's work, designing algorithms to be platform and language independent has become the norm.

Python's default generator is a Mersenne twister, and a variety of platform and language independent MT implementations are available on the Mersenne Twister home page. If Python switches its default generator in the future, then compatibility is not guaranteed unless you use one of the independent Python implementations available from the link above.

0人赞添加讨论(0) 举报

Are random seeds compatible between systems?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间