I am trying to write a very simple code using Spark in Pycharm and my os is Windows 8. I have been dealing with several problems which somehow managed to fix except for one. When I run the code using pyspark.cmd everything works smoothly but I have had no luck with the same code in pycharm. There was a problem with SPARK_HOME variable which I fixed using the following code:
import sys
import os
os.environ['SPARK_HOME'] = "C:/Spark/spark-1.4.1-bin-hadoop2.6"
sys.path.append("C:/Spark/spark-1.4.1-bin-hadoop2.6/python")
sys.path.append('C:/Spark/spark-1.4.1-bin-hadoop2.6/python/pyspark')
So now when I import the pyspark and everything is fine:
from pyspark import SparkContext
The problem rises when I want to run the rest of my code:
logFile = "C:/Spark/spark-1.4.1-bin-hadoop2.6/README.md"
sc = SparkContext()
logData = sc.textFile(logFile).cache()
logData.count()
When I receive the following error:
15/08/27 12:04:15 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Cannot run program "python": CreateProcess error=2, The system cannot find the file specified
I have added the python path as an environment variable and it's working properly using the command line but I could not figure out what my problem is with this code. Any help or comment is much appreciated.
Thanks
I had the same problem as you, and then I made the following changes: set PYSPARK_PYTHON as environment variable to point to python.exe in Edit Configurations of Pycharm, here is my example:
I had to set
SPARK_PYTHONPATH
as environment variable to point to python.exe file in addition toPYTHONPATH
andSPARK_HOME
variables asAfter struggling with this for two days, I figured what the problem is. I added the followings to the "PATH" variable as windows environment variable:
Remember, You need to change the directory to wherever your spark is installed and also the same thing for python. On the other hand, I have to mention that I am using prebuild version of spark which has Hadoop included.
Best of luck to you all.
I have faced this problem, it's caused by python version conflicts on diff nodes of cluster, so, it can be solved by
which are the same version on diff nodes. and then start: