Is there any way to set an environment variable on all nodes of an EMR cluster?
I am getting an error when trying to use reduceByKey() in Python3 PySpark, and getting an error regarding the hash seed. I can see this is a known error, and that the environment varialbe PYTHONHASHSEED needs to be set to the same value on all nodes of the cluster, but I haven't had any luck with it.
I have tried adding a variable to spark-env through the cluster configuration:
[
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3",
"PYTHONHASHSEED": "123"
}
}
]
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]
but this doesn't work. I have also tried adding a bootstrap script:
#!/bin/bash
export PYTHONHASHSEED=123
but this also doesn't seem to do the trick.
You could probably do it via the bootstrap script but you'll need to do something like this:
(or possibly
.profile
)So that it's picked up by the spark processes when they are launched.
Your configuration looks reasonable though, it might be worth setting it in the
hadoop-env
section instead?From the spark docs
Properties are listed here so I think you want this:
EMR docs for configuring spark-defaults.conf are here.
I believe that the
/usr/bin/python3
isn't picking up the environment variablePYTHONHASHSEED
that you are defining in the cluster configuration under thespark-env
scope.You ought using
python34
instead of/usr/bin/python3
and set the configuration as followed :Now, let's test it. I define a bash script call both
python
s :The verdict :
PS1: I am using AMI release
emr-4.8.2
.PS2: Snippet inspired from this answer.
EDIT: I have tested the following using
pyspark
.Also created a simple application (
simple_app.py
):Which also seems to work perfectly :
Output (truncated) :
As you can see it also works returning the same hash each time.
EDIT 2: From the comments, it seems like you are trying to compute hashes on the executors and not the driver, thus you'll need to set up
spark.executorEnv.PYTHONHASHSEED
, inside your spark application configuration so it can be propagated on the executors (it's one way to do it).Thus the following minimalist example with
simple_app.py
:And now let's test it again. Here is the truncated output :
I think that this covers all.
Just encountered the same problem, adding the following configuration solved it:
Be careful: we do not use yarn as a cluster manager, for the moment the cluster is only running Hadoop and Spark.
EDIT : Following Tim B comment, this seems to work also with yarn installed as a cluster manager.