load external libraries inside pyspark code

2020-07-24 04:02发布

问题:

I have a spark cluster I use in local mode. I want to read a csv with the databricks external library spark.csv. I start my app as follows:

import os
import sys

os.environ["SPARK_HOME"] = "/home/mebuddy/Programs/spark-1.6.0-bin-hadoop2.6"

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

from pyspark import SparkContext, SparkConf, SQLContext

try:
    sc
except NameError:
    print('initializing SparkContext...')
    sc=SparkContext()
sq = SQLContext(sc)
df = sq.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("/my/path/to/my/file.csv")

When I run it, I get the following error:

java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.csv.

My question: how can I load the databricks.spark.csv library INSIDE my python code. I don't want to load it from outside (using --packages) from instance.

I tried to add the following lines but it did not work:

os.environ["SPARK_CLASSPATH"] = '/home/mebuddy/Programs/spark_lib/spark-csv_2.11-1.3.0.jar'

回答1:

If you create SparkContext from scratch you can for example set PYSPARK_SUBMIT_ARGS before SparkContext is intialized:

os.environ["PYSPARK_SUBMIT_ARGS"] = (
  "--packages com.databricks:spark-csv_2.11:1.3.0 pyspark-shell"
)

sc = SparkContext()

If for some reason you expect that SparkContext has been already initialized, as it is suggested by your code, this won't work. In local mode you could try to use Py4J gateway and URLClassLoader but it doesn't look like a good idea and won't work in a cluster mode.