I am running an application in spark which do the simple diff between two data frame . I execute as jar file in my cluster environment . My cluster environment is 94 node cluster. There are two data set 2 GB and 4 GB which mapped to data frame .
My job is working fine for the very small size files ...
I personal think saveAsTextFile
takes more time in my application
Below my cluster connfig details
Total Vmem allocated for Containers 394.80 GB
Total Vmem allocated for Containers 394.80 GB
Total VCores allocated for Containers 36
This is how i run my spark job
spark-submit --queue root.queue --deploy-mode client --master yarn SparkApplication-SQL-jar-with-dependencies.jar
Here is my code .
object TestDiff {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("WordCount");
conf.set("spark.executor.memory", "32g")
conf.set("spark.driver.memory", "32g")
conf.set("spark.driver.maxResultSize", "4g")
val sc = new SparkContext(conf); //Creating spark context
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{ StructType, StructField, StringType, DoubleType, IntegerType }
import org.apache.spark.sql.functions.udf
val schema = StructType(Array(
StructField("filler1", StringType),
StructField("dunsnumber", StringType),
StructField("transactionalindicator", StringType)))
import org.apache.spark.sql.functions._
val textRdd1 = sc.textFile("/home/cloudera/TRF/PCFP/INCR")
val rowRdd1 = textRdd1.map(line => Row.fromSeq(line.split("\\|", -1)))
var df1 = sqlContext.createDataFrame(rowRdd1, schema)
val textRdd2 = sc.textFile("/home/cloudera/TRF/PCFP/MAIN")
val rowRdd2 = textRdd2.map(line => Row.fromSeq(line.split("\\|", -1)))
var df2 = sqlContext.createDataFrame(rowRdd2, schema)
//Finding the diff between two if any of the columns has changed
val diffAnyColumnDF = df1.except(df2)
diffAnyColumnDF.rdd.coalesce(1).saveAsTextFile("Diffoutput")
}
}
It takes more than 30 minutes and then it fails .
with below exception
Here is the logs
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
17/09/15 11:55:01 WARN netty.NettyRpcEnv: Ignored message: HeartbeatResponse(false)
17/09/15 11:56:19 WARN netty.NettyRpcEndpointRef: Error sending message [message = Heartbeat(1,[Lscala.Tuple2;@7fe57079,BlockManagerId(1, c755kds.int.int.com, 33507))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:491)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:520)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:520)
at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:520)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1818)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:520)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
17/09/15 11:56:19 WARN netty.NettyRpcEnv: Ignored message: HeartbeatResponse(false)
Please suggest how to tune my spark job ?
I just changed executor memory and it job got succeeded but it is very very slow .
conf.set("spark.executor.memory", "64g")
But job is very slow ...It takes around 15 minutes to complete ..
And job has taken 15 minutes to complete .
Attaching DAG Visualization
After increasing the time out conf getting below error ..
executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 175200 ms
I think your single partitiond file size is big. it takes long time to stream the data over the TCP channel and the connection can not be made alive for a long time and gets reset.
Can you coalesce to a higher number of partitions?