Adding custom jars to pyspark in jupyter notebook

2019-03-11 11:59发布

问题:

I am using the Jupyter notebook with Pyspark with the following docker image: Jupyter all-spark-notebook

Now I would like to write a pyspark streaming application which consumes messages from Kafka. In the Spark-Kafka Integration guide they describe how to deploy such an application using spark-submit (it requires linking an external jar - explanation is in 3. Deploying). But since I am using Jupyter notebook I never actually run the spark-submit command, I assume it gets run in the back if I press execute.

In the spark-submit command you can specify some parameters, one of them is -jars, but it is not clear to me how I can set this parameter from the notebook (or externally via environment variables?). I am assuming I can link this external jar dynamically via the SparkConf or the SparkContext object. Has anyone experience on how to perform the linking properly from the notebook?

回答1:

I've managed to get it working from within the jupyter notebook which is running form the all-spark container.

I start a python3 notebook in jupyterhub and overwrite the PYSPARK_SUBMIT_ARGS flag as shown below. The Kafka consumer library was downloaded from the maven repository and put in my home directory /home/jovyan:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = 
  '--jars /home/jovyan/spark-streaming-kafka-assembly_2.10-1.6.1.jar pyspark-shell'

import pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext

sc = pyspark.SparkContext()
ssc = StreamingContext(sc,1)

broker = "<my_broker_ip>"
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"],
                        {"metadata.broker.list": broker})
directKafkaStream.pprint()
ssc.start()

Note: Don't forget the pyspark-shell in the environment variables!

Extension: If you want to include code from spark-packages you can use the --packages flag instead. An example on how to do this in the all-spark-notebook can be found here



回答2:

You can run your jupyter notebook with the pyspark command by setting the relevant environment variables:

export PYSPARK_DRIVER_PYTHON=jupyter
export IPYTHON=1
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --port=XXX --ip=YYY"

with XXX being the port you want to use to access the notebook and YYY being the ip address.

Now simply run pyspark and add --jars as a switch the same as you would spark submit