Dataproc: Jupyter pyspark notebook unable to impor

2019-08-05 06:24发布

In Dataproc spark cluster, graphframe package is available in spark-shell but not in jupyter pyspark notebook.

Pyspark kernel config:

PACKAGES_ARG='--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11'

Following is the cmd to initialize cluster :

gcloud dataproc clusters create my-dataproc-cluster --properties spark.jars.packages=com.databricks:graphframes:graphframes:0.2.0-spark2.0-s_2.11 --metadata "JUPYTER_PORT=8124,INIT_ACTIONS_REPO=https://github.com/{xyz}/dataproc-initialization-actions.git" --initialization-actions  gs://dataproc-initialization-actions/jupyter/jupyter.sh --num-workers 2 --properties spark:spark.executorEnv.PYTHONHASHSEED=0,spark:spark.yarn.am.memory=1024m     --worker-machine-type=n1-standard-4  --master-machine-type=n1-standard-4

2条回答
Lonely孤独者°
2楼-- · 2019-08-05 06:38

This is an old bug with Spark Shells and YARN, that I thought was fixed in SPARK-15782, but apparently this case was missed.

The suggested workaround is adding

import os
sc.addPyFile(os.path.expanduser('~/.ivy2/jars/graphframes_graphframes-0.2.0-spark2.0-s_2.11.jar'))

before your import.

查看更多
Explosion°爆炸
3楼-- · 2019-08-05 06:48

I found another way to do add packages which works on Jupyter notebook:

spark = SparkSession.builder \
.appName("Python Spark SQL") \    \
.config("spark.jars.packages", "graphframes:graphframes:0.5.0-spark2.1-s_2.11") \
.getOrCreate()
查看更多
登录 后发表回答