I configured Eclipse in order to develop with Spark and Python. I configured : 1. PyDev with the Python interpreter 2. PyDev with the Spark Python sources 3. PyDev with the Spark Environment variables.
This is my Libraries configuration :
And this is my Environment configuration :
I created a project named CompensationStudy and I want to run an small example and be sure that everything will go smoothly.
This is my code :
from pyspark import SparkConf, SparkContext
import os
sparkConf = SparkConf().setAppName("WordCounts").setMaster("local")
sc = SparkContext(conf = sparkConf)
textFile = sc.textFile(os.environ["SPARK_HOME"] + "/README.md")
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
for wc in wordCounts.collect(): print wc
But I got this error : ImportError: No module named py4j.protocol
Logicly, all of PySpark’s library dependencies, including Py4J, are automatically imported when I configure PyDev with the Spark Python sources.. So, what's wrong here ? Is there just a problem with my log4j.properties file ? Help please !
Had similar error.
After installing py4j, able to execute without the error
Are you able to run it from the command line? I think the first step would be taking the IDE out of the question, so, try to get everything running with the proper environment variables in the command line (maybe asking for help to the pyspark community), after that's running, try comparing the env variables you have in your run to the run in the command line (create a program which runs the env variables and run it in the console and then in the IDE to check the difference).
One note (which is probably not the issue, but still...): from your screenshot, it seems that your project configuration has
/CompensationStudy
added to the PYTHONPATH, yet, you seem to be putting your code in/CompensationStudy/src
(so, you should edit your project configuration to only put/CompensationStudy/src
in the PYTHONPATH).