Reducing size of application jar by providing spark- classPath for maven dependencies:
My cluster is having 3 ec2 instances on which hadoop and spark is running.If I build jar with maven dependencies, it becomes too large(around 100 MB) which I want to avoid this as Jar is getting replicating on all nodes ,each time I run the job.
To avoid that I have build a maven package as "maven package".For dependency resolution I have downloaded the all maven dependencies on each node and then only provided above below jar paths:
I have added class paths on each node in the "spark-defaults.conf" as
spark.driver.extraClassPath /home/spark/.m2/repository/com/google/code/gson/gson/2.3.1/gson-2.3.1.jar:/home/spark/.m2/repository/com/datastax/cassandra/cassandra-driver-core/2.1.5/cassandra-driver-core-2.1.5.jar:/home/spark/.m2/repository/com/google/guava/guava/16.0.1/guava-16.0.1.jar:/home/spark/.m2/repository/com/google/collections/google-collections/1.0/google-collections-1.0.jar:/home/spark/.m2/repository/com/datastax/spark/spark-cassandra-connector-java_2.10/1.2.0-rc1/spark-cassandra-connector-java_2.10-1.2.0-rc1.jar:/home/spark/.m2/repository/com/datastax/spark/spark-cassandra-connector_2.10/1.2.0-rc1/spark-cassandra-connector_2.10-1.2.0-rc1.jar:/home/spark/.m2/repository/org/apache/cassandra/cassandra-thrift/2.1.3/cassandra-thrift-2.1.3.jar:/home/spark/.m2/repository/org/joda/joda-convert/1.2/joda-convert-1.2.jar
It has worked,locally on single node. Still i am getting this error.Any help will be appreciated.
You don't need to put all jars files .Just Put your application jar file . If you get again error than put all jar files which are needed .
You have to put jars file by setJars() methods .
Finally, I was able to solve the problem. I have created application jar using "mvn package" instead of "mvn clean compile assembly:single ",so that it will not download the maven dependencies while creating jar(But need to provide these jar/dependencies run-time) which resulted in small size Jar(as there is only reference of dependencies).
Then, I have added below two parameters in spark-defaults.conf on each node as:
So question arises that,how application JAR will get the maven dependencies(required jar's) run-time?
For that I have downloaded all required dependencies on each node using mvn clean compile assembly:single in advance.