I am running spark job on emr and using datastax connector to connect to cassandra cluster. I am facing issues with the guava jar please find the details as below
I am using below cassandra deps
cqlsh 5.0.1 | Cassandra 3.0.1 | CQL spec 3.3.1
Running spark job on EMR 4.4 with below maven deps
org.apache.spark
spark-streaming_2.10
1.5.0
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.5.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId><dependency>
<groupId>com.datastax.spark</groupId>
<artifactId>spark-cassandra-connector_2.10</artifactId>
<version>1.5.0</version>
</dependency>
<artifactId>spark-streaming-kinesis-asl_2.10</artifactId>
<version>1.5.0</version>
</dependency>
facing issues when i submit spark job as below
ava.lang.ExceptionInInitializerError
at com.datastax.spark.connector.cql.DefaultConnectionFactory$.clusterBuilder(CassandraConnectionFactory.scala:35)
at com.datastax.spark.connector.cql.DefaultConnectionFactory$.createCluster(CassandraConnectionFactory.scala:87)
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:153)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:148)
at com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:31)
at com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:56)
at com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:81)
at ampush.event.process.core.CassandraServiceManagerImpl.getAdMetaInfo(CassandraServiceManagerImpl.java:158)
at ampush.event.config.metric.processor.ScheduledEventAggregator$4.call(ScheduledEventAggregator.java:308)
at ampush.event.config.metric.processor.ScheduledEventAggregator$4.call(ScheduledEventAggregator.java:290)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:222)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:222)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:902)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:902)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: Detected Guava issue #1635 which indicates that a version of Guava less than 16.01 is in use. This introduces codec resolution issues and potentially other incompatibility issues in the driver. Please upgrade to Guava 16.01 or later.
at com.datastax.driver.core.SanityChecks.checkGuava(SanityChecks.java:62)
at com.datastax.driver.core.SanityChecks.check(SanityChecks.java:36)
at com.datastax.driver.core.Cluster.<clinit>(Cluster.java:67)
... 23 more
please let me know how to manage guava deps here ?
Thanks
Another solution, Go to directory
spark/jars
. Rename guava-14.0.1.jar
then copy guava-19.0.jar
like this picture:
Just add in your POM's <dependencies>
block something like this:
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>19.0</version>
</dependency>
(or any version > 16.0.1 that you prefer)
I've had the same problem, and resolved it by using the maven Shade plugin to shade the guava version that the Cassandra connector brings in.
I needed to exclude the Optional, Present and Absent classes explicitly because I was running into issues with Spark trying to cast from the non-shaded Guava Present type to the shaded Optional type. I'm not sure if this will cause any problems later on, but it seems to be working for me for now.
You can add this to the <plugins>
section in your pom.xml:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>
shade
</goal>
</goals>
</execution>
</executions>
<configuration>
<minimizeJar>true</minimizeJar>
<shadedArtifactAttached>true</shadedArtifactAttached>
<shadedClassifierName>fat</shadedClassifierName>
<relocations>
<relocation>
<pattern>com.google</pattern>
<shadedPattern>shaded.guava</shadedPattern>
<includes>
<include>com.google.**</include>
</includes>
<excludes>
<exclude>com.google.common.base.Optional</exclude>
<exclude>com.google.common.base.Absent</exclude>
<exclude>com.google.common.base.Present</exclude>
</excludes>
</relocation>
</relocations>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</plugin>
I was able to get around this by adding the guava 16.0.1 jar externally and then specifying the class-path on Spark submit with help of below configuration values:
--conf "spark.driver.extraClassPath=/guava-16.0.1.jar"
--conf "spark.executor.extraClassPath=/guava-16.0.1.jar"
Hope this helps someone with similar error !
Thanks Adrian for your response.
I am on a little of a different architecture than everybody else on the thread but the Guava problem is still the same. I am using spark 2.2 with mesosphere. In our development environment we use sbt-native-packager to produce our docker images to pass into mesos.
Turns out, we needed to have a different guava for the spark submit executors than we need for the code that we run on the driver. This worked for me.
build.sbt
....
libraryDependencies ++= Seq(
"com.google.guava" % "guava" % "19.0" force(),
"org.apache.hadoop" % "hadoop-aws" % "2.7.3" excludeAll (
ExclusionRule(organization = "org.apache.hadoop", name = "hadoop-common"), //this is for s3a
ExclusionRule(organization = "com.google.guava", name= "guava" )),
"org.apache.spark" %% "spark-core" % "2.1.0" excludeAll (
ExclusionRule("org.glassfish.jersey.bundles.repackaged", name="jersey-guava"),
ExclusionRule(organization = "com.google.guava", name= "guava" )) ,
"com.github.scopt" %% "scopt" % "3.7.0" excludeAll (
ExclusionRule("org.glassfish.jersey.bundles.repackaged", name="jersey-guava"),
ExclusionRule(organization = "com.google.guava", name= "guava" )) ,
"com.datastax.spark" %% "spark-cassandra-connector" % "2.0.6",
...
dockerCommands ++= Seq(
...
Cmd("RUN rm /opt/spark/dist/jars/guava-14.0.1.jar"),
Cmd("RUN wget -q http://central.maven.org/maven2/com/google/guava/guava/23.0/guava-23.0.jar -O /opt/spark/dist/jars/guava-23.0.jar")
...
When I tried to replace guava 14 on the executors with guava 16.0.1 or 19, it still wouldn't work. Spark submit just died. My fat jar which is actually the guava that is in use for my application in the driver I forced to be 19, but my spark submit executor I had to replace to be 23. I did try replacing to 16 and 19, but spark just died there too.
Sorry for diverting, but every time after all my google searches this one came up every time. I hope this helps other SBT/mesos folks too.
I was facing the the same issue while retrieving records from Cassandra table using Spark (java) on Spark submit.
Please check your guava jar version used by Hadoop and Spark in cluster using find command and change it accordingly.
find / -name "guav*.jar"
Otherwise add guava jar externally during spark-submit for driver and executer spark.driver.extraClassPath and spark.executor.extraClassPath respectively.
spark-submit --class com.my.spark.MySparkJob --master local --conf 'spark.yarn.executor.memoryOverhead=2048' --conf 'spark.cassandra.input.consistency.level=ONE' --conf 'spark.cassandra.output.consistency.level=ONE' --conf 'spark.dynamicAllocation.enabled=false' --conf "spark.driver.extraClassPath=lib/guava-19.0.jar" --conf "spark.executor.extraClassPath=lib/guava-19.0.jar" --total-executor-cores 15 --executor-memory 15g --jars $(echo lib/*.jar | tr ' ' ',') target/my-sparkapp.jar
It's working for me. Hope you can try it.