This is a copy of someone else's question on another forum that was never answered, so I thought I'd re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736)
I have Spark installed properly on my machine and am able to run python programs with the pyspark modules without error when using ./bin/pyspark as my python interpreter.
However, when I attempt to run the regular Python shell, when I try to import pyspark modules I get this error:
from pyspark import SparkContext
and it says
"No module named pyspark".
How can I fix this? Is there an environment variable I need to set to point Python to the pyspark headers/libraries/etc.? If my spark installation is /spark/, which pyspark paths do I need to include? Or can pyspark programs only be run from the pyspark interpreter?
Turns out that the pyspark bin is LOADING python and automatically loading the correct library paths. Check out $SPARK_HOME/bin/pyspark :
I added this line to my .bashrc file and the modules are now correctly found!
This is what I did for using my Anaconda distribution with Spark. This is Spark version independent. You can change the first line to your users' python bin. Also, as of Spark 2.2.0 PySpark is available as a Stand-alone package on PyPi but I am yet to test it out.
If it prints such error:
Please add $SPARK_HOME/python/build to PYTHONPATH:
I am running a spark cluster, on CentOS VM, which is installed from cloudera yum packages.
Had to set the following variables to run pyspark.
On Mac, I use Homebrew to install Spark (formula "apache-spark"). Then, I set the PYTHONPATH this way so the Python import works:
Replace the "1.2.0" with the actual apache-spark version on your mac.
For a Spark execution in pyspark two components are required to work together:
pyspark
python packageWhen launching things with spark-submit or pyspark, these scripts will take care of both, i.e. they set up your PYTHONPATH, PATH, etc, so that your script can find pyspark, and they also start the spark instance, configuring according to your params, e.g. --master X
Alternatively, it is possible to bypass these scripts and run your spark application directly in the python interpreter like
python myscript.py
. This is especially interesting when spark scripts start to become more complex and eventually receive their own args.getOrCreate()
from the builder object.Your script can therefore have something like this: