Importing PySpark packages

2019-02-26 09:22发布

问题:

I have downloaded the graphframes package (from here) and saved it on my local disk. Now, I would like to use it. So, I use the following command:

IPYTHON_OPTS="notebook --no-browser" pyspark --num-executors=4  --name gorelikboris_notebook_1  --py-files ~/temp/graphframes-0.1.0-spark1.5.jar --jars ~/temp/graphframes-0.1.0-spark1.5.jar --packages graphframes:graphframes:0.1.0-spark1.5

All the pyspark functionality works as expected, except for the new graphframes package: whenever I try to import graphframes, I get an ImportError. When I examine sys.path, I can see the following two paths:

/tmp/spark-1eXXX/userFiles-9XXX/graphframes_graphframes-0.1.0-spark1.5.jar and /tmp/spark-1eXXX/userFiles-9XXX/graphframes-0.1.0-spark1.5.jar, however these files don't exist. Moreover, the /tmp/spark-1eXXX/userFiles-9XXX/ directory is empty.

What am I missing?

回答1:

This might be an issue in Spark packages with Python in general. Someone else was asking about it too earlier on the Spark user discussion alias.

My workaround is to unpackage the jar to find the python code embedded, and then move the python code into a subdirectory called graphframes.

For instance, I run pyspark from my home directory

~$ ls -lart
drwxr-xr-x 2 user user   4096 Feb 24 19:55 graphframes

~$ ls graphframes/
__init__.pyc  examples.pyc  graphframe.pyc  tests.pyc

You would not need the py-files or jars parameters, though, something like

IPYTHON_OPTS="notebook --no-browser" pyspark --num-executors=4 --name gorelikboris_notebook_1 --packages graphframes:graphframes:0.1.0-spark1.5

and having the python code in the graphframes directory should work.



回答2:

in my case:
1、cd /home/zh/.ivy2/jars

2、jar xf graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar

3、add /home/zh/.ivy2/jar to PYTHONPATH in spark-env.sh like code above:

export PYTHONPATH=$PYTHONPATH:/home/zh/.ivy2/jars:.


回答3:

Add these lines to your $SPARK_HOME/conf/spark-defaults.conf :

spark.executor.extraClassPath file_path/jar1:file_path/jar2

spark.driver.extraClassPath file_path/jar1:file_path/jar2



回答4:

In the more general case of importing 'orphan' python file (outside of current folder, not part of properly installed package) - use addPyFile, e.g.:

sc.addPyFile('somefolder/graphframe.zip')

addPyFile(path): Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.