This is a copy of someone else's question on another forum that was never answered, so I thought I'd re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736)
I have Spark installed properly on my machine and am able to run python programs with the pyspark modules without error when using ./bin/pyspark as my python interpreter.
However, when I attempt to run the regular Python shell, when I try to import pyspark modules I get this error:
from pyspark import SparkContext
and it says
"No module named pyspark".
How can I fix this? Is there an environment variable I need to set to point Python to the pyspark headers/libraries/etc.? If my spark installation is /spark/, which pyspark paths do I need to include? Or can pyspark programs only be run from the pyspark interpreter?
For Linux users, the following is the correct (and non-hard-coded) way of including the pyspark libaray in PYTHONPATH. Both PATH parts are necessary:
Notice below that the zipped library version is dynamically determined, so we do not hard-code it.
I got this error because the python script I was trying to submit was called pyspark.py (facepalm). The fix was to set my PYTHONPATH as recommended above, then rename the script to pyspark_test.py and clean up the pyspark.pyc that was created based on my scripts original name and that cleared this error up.
On Windows 10 the following worked for me. I added the following environment variables using Settings > Edit environment variables for your account:
(change "C:\Programming\..." to the folder in which you have installed spark)
Here is a simple method (If you don't bother about how it works!!!)
Go to your python shell
pip install findspark import findspark findspark.init()
import the necessary modules
from pyspark import SparkContext from pyspark import SparkConf
Done!!!
By exporting the SPARK path and the Py4j path, it started to work:
So, if you don't want to type these everytime you want to fire up the Python shell, you might want to add it to your
.bashrc
fileTo get rid of
ImportError: No module named py4j.java_gateway
, you need to add following lines: