When using PySpark I'd like a SparkContext to be initialised (in yarn client mode) upon creation of a new notebook.
The following tutorials describe how to do this in past versions of ipython/jupyter < 4
https://www.dataquest.io/blog/pyspark-installation-guide/
https://npatta01.github.io/2015/07/22/setting_up_pyspark/
I'm not quite sure how to achieve the same with notebook > 4 as noted in http://jupyter.readthedocs.io/en/latest/migrating.html#since-jupyter-does-not-have-profiles-how-do-i-customize-it
I can manually create and configure a Sparkcontext but I don't want our analysts to have to worry about this.
Does anyone have any ideas?
Well, the missing profiles functionality in Jupyter also puzzled me in the past, albeit for a different reason - I wanted to be able to switch between different deep learning frameworks (Theano & TensorFlow) on demand; eventually I found the solution (described in a blog post of mine here).
The fact is that, although there are not profiles in Jupyter, the startup files functionality for the IPython kernel is still there, and, since Pyspark employs this particular kernel, it can be used in your case.
So, provided that you already have a working Pyspark kernel for Jupyter, all you have to do is write a short initialization script init_spark.py
as follows:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("yarn-client")
sc = SparkContext(conf = conf)
and place it in the ~/.ipython/profile_default/startup/
directory of your users.
You can confirm that now sc
is already set after starting a Jupyter notebook:
In [1]: sc
Out[1]:<pyspark.context.SparkContext at 0x7fcceb7c5fd0>
In [2]: sc.version
Out[2]: u'2.0.0'
A more disciplined way for integrating PySpark & Jupyter notebooks is described in my answers here and here.
A third way is to try Apache Toree (formerly Spark Kernel), as described here (haven't tested it though).