importing pyspark in python shell

2019-01-03 12:36发布

This is a copy of someone else's question on another forum that was never answered, so I thought I'd re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736)

I have Spark installed properly on my machine and am able to run python programs with the pyspark modules without error when using ./bin/pyspark as my python interpreter.

However, when I attempt to run the regular Python shell, when I try to import pyspark modules I get this error:

from pyspark import SparkContext

and it says

"No module named pyspark".

How can I fix this? Is there an environment variable I need to set to point Python to the pyspark headers/libraries/etc.? If my spark installation is /spark/, which pyspark paths do I need to include? Or can pyspark programs only be run from the pyspark interpreter?

17条回答
迷人小祖宗
2楼-- · 2019-01-03 12:53

For Linux users, the following is the correct (and non-hard-coded) way of including the pyspark libaray in PYTHONPATH. Both PATH parts are necessary:

  1. The path to the pyspark Python module itself, and
  2. The path to the zipped library that that pyspark module relies on when imported

Notice below that the zipped library version is dynamically determined, so we do not hard-code it.

export PYTHONPATH=${SPARK_HOME}/python/:$(echo ${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}
查看更多
爷的心禁止访问
3楼-- · 2019-01-03 12:53

I got this error because the python script I was trying to submit was called pyspark.py (facepalm). The fix was to set my PYTHONPATH as recommended above, then rename the script to pyspark_test.py and clean up the pyspark.pyc that was created based on my scripts original name and that cleared this error up.

查看更多
走好不送
4楼-- · 2019-01-03 12:55

On Windows 10 the following worked for me. I added the following environment variables using Settings > Edit environment variables for your account:

SPARK_HOME=C:\Programming\spark-2.0.1-bin-hadoop2.7
PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%

(change "C:\Programming\..." to the folder in which you have installed spark)

查看更多
贼婆χ
5楼-- · 2019-01-03 12:57

Here is a simple method (If you don't bother about how it works!!!)

Use findspark

  1. Go to your python shell

    pip install findspark import findspark findspark.init()

  2. import the necessary modules

    from pyspark import SparkContext from pyspark import SparkConf

  3. Done!!!

查看更多
▲ chillily
6楼-- · 2019-01-03 13:01

By exporting the SPARK path and the Py4j path, it started to work:

export SPARK_HOME=/usr/local/Cellar/apache-spark/1.5.1
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH 
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

So, if you don't want to type these everytime you want to fire up the Python shell, you might want to add it to your .bashrc file

查看更多
不美不萌又怎样
7楼-- · 2019-01-03 13:04

To get rid of ImportError: No module named py4j.java_gateway, you need to add following lines:

import os
import sys


os.environ['SPARK_HOME'] = "D:\python\spark-1.4.1-bin-hadoop2.4"


sys.path.append("D:\python\spark-1.4.1-bin-hadoop2.4\python")
sys.path.append("D:\python\spark-1.4.1-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf

    print ("success")

except ImportError as e:
    print ("error importing spark modules", e)
    sys.exit(1)
查看更多
登录 后发表回答