All Spark Java examples that I can find online use a single static class which contains the entire program functionality. How would I structure an ordinary Java program containing several non-static classes so that I can make calls to Spark preferably from several Java objects?
There are several reasons why this is not entirely straight forward:
- JavaSparkContext needs to be available everywhere where new RDDs need to be created and it is not serializable. At the time of writing only a single spark context can work in a single JVM reliably. For now I am using one static class in my program just for the JavaSparkContext, HiveContext and SparkConf so that they are available everywhere.
- Anonymous classes are not practicable: Almost all examples online use anonymous classes exclusively to be passed to Spark operations. But using an anonymous class requires the enclosing class to be serializable and causes the entire enclosing class to be send to the worker nodes. That's not necessarily what people want. To prevent this you have to define a separate class outside the enclosing class which implements the interface for the
call
. Now only the contents of the new class are serialized. (By implementing thecall
-containing interface a class also implementsSerializable
.) Alternatively if you want to have the code inside the enclosing class you can use a static nested class.
There are probably even more things that demand a special structure when you use Spark. I wonder if the structuring that I used for solving the two issues above can be improved?