Is it always the case that the Driver (as a program that runs the master node) must be on a master node ?
For example, if I setup the ec2 with one master and two workers, does my code that has the main must be executed from the master EC2 instance ?
If answer is NO, what would be the best way to set-up the system where the driver is outside the ec2's master node (lets say, Driver is ran from my computer, while Master and Workers are on EC2)? Do I always have to use the spark-submit, or can I do it from an IDE such as Eclipse or IntelliJ IDEA?
If answer is YES, what would be the best reference to learn more about it (since I need to provide some sort of a proof)?
Thank you kindly for your answer, references would be highly appreciated!
No, it doesn't have to be on the master.
Using
spark-submit
you can use deploy-mode to control how your driver is run (either as aclient
, on the machine you run submit on (which could be master or another), or ascluster
, on the workers).There is network communication between the workers and the driver so you want it 'close' to the workers, never across the WAN.
You can run from inside a repl (
spark-shell
) which could be accessed from your IDE. If you're using a dynamic language like Clojure, you can also just create aSparkContext
referencing (throughmaster
) a local cluster, or the cluster you want to put jobs to, and then code through the repl. In practice it isn't this easy.