I understand that you can send individual files as dependencies with Python Spark programs. But what about full-fledged libraries (e.g. numpy)?
Does Spark have a way to use a provided package manager (e.g. pip) to install library dependencies? Or does this have to be done manually before Spark programs are executed?
If the answer is manual, then what are the "best practice" approaches for synchronizing libraries (installation path, version, etc.) over a large number of distributed nodes?
Actually having actually tried it, I think the link I posted as a comment doesn't do exactly what you want with dependencies. What you are quite reasonably asking for is a way to have Spark play nicely with setuptools and pip regarding installing dependencies. It blows my mind that this isn't supported better in Spark. The third-party dependency problem is largely solved in general purpose Python, but under Spark, it seems the assumption is you'll go back to manual dependency management or something.
I have been using an imperfect but functional pipeline based on virtualenv. The basic idea is
pip install
of all your own in-house Python libraries. If you have set these up withsetuptools
, this will install their dependencies.zip
file, containing your libraries and their dependencies as an argument to--py-files
Of course you would want to code up some helper scripts to manage this process. Here is a helper script adapted from one I have been using, which could doubtless be improved a lot:
I have a collection of other simple wrapper scripts I run to submit my spark jobs. I simply call this script first as part of that process and make sure that the second argument (name of a zip file) is then passed as the --py-files argument when I run
spark-submit
(as documented in the comments). I always run these scripts, so I never end up accidentally running old code. Compared to the Spark overhead, the packaging overhead is minimal for my small scale project.There are loads of improvements that could be made – eg being smart about when to create a new zip file, splitting it up two zip files, one containing often-changing private packages, and one containing rarely changing dependencies, which don't need to be rebuilt so often. You could be smarter about checking for file changes before rebuilding the zip. Also checking validity of arguments would be a good idea. However for now this suffices for my purposes.
The solution I have come up with is not designed for large-scale dependencies like NumPy specifically (although it may work for them). Also, it won't work if you are building C-based extensions, and your driver node has a different architecture to your cluster nodes.
I have seen recommendations elsewhere to just run a Python distribution like Anaconda on all your nodes since it already includes NumPy (and many other packages), and that might be the better way to get NumPy as well as other C-based extensions going. Regardless, we can't always expect Anaconda to have the PyPI package we want in the right version, and in addition you might not be able to control your Spark environment to be able to put Anaconda on it, so I think this virtualenv-based approach is still helpful.