I'm following this installation guide but have the following problem with using graphframes
from pyspark import SparkContext
sc =SparkContext()
!pyspark --packages graphframes:graphframes:0.5.0-spark2.1-s_2.11
from graphframes import *
--------------------------------------------------------------------------- ImportError Traceback (most recent call
last) in ()
----> 1 from graphframes import *
ImportError: No module named graphframes
I'm not sure wether it is possible to install package on the following way.
But I'll appreciate your advice and help.
Good question!
Open up your bashrc file, and type export SPARK_OPTS="--packages graphframes:graphframes:0.5.0-spark2.1-s_2.11"
. Once you saved your bashrc file, close it and type source .bashrc
.
Finally, open up your notebook and type:
from pyspark import SparkContext
sc = SparkContext()
sc.addPyFile('/home/username/spark-2.3.0-bin-hadoop2.7/jars/graphframes-0.5.0-spark2.1-s_2.11.jar')
After that, you may able to run it.
I'm using jupyter notebook in docker, trying to get graphframes working. First, I used the method in https://stackoverflow.com/a/35762809/2202107, I have:
import findspark
findspark.init()
import pyspark
import os
SUBMIT_ARGS = "--packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = pyspark.SparkConf()
sc = pyspark.SparkContext(conf=conf)
print(sc._conf.getAll())
Then by following this issue, we finally are able to import graphframes
: https://github.com/graphframes/graphframes/issues/172
import sys
pyfiles = str(sc.getConf().get(u'spark.submit.pyFiles')).split(',')
sys.path.extend(pyfiles)
from graphframes import *