We are running into frequent errors with spark standalone cluster with our newly installed CDH 5.5.2 cluster. We have 7 worker nodes each one has 16 GB memory. But, almost all joins are failing.
I have made sure i allocated full memory with --executor-memory and ensured it has allocated that much memory and by verifying it in Spark UI.
Most of our errors are as below. We have checked things from our side. But none of our solutions did work.
Caused by: java.io.FileNotFoundException: /tmp/spark-b9e69c8d-153b-4c4b-88b1-ac779c060de5/executor-44e88b75-5e79-4d96-b507-ddabcab30e1b/blockmgr-cd27625c-8716-49ac-903d-9d5c36cf2622/29/shuffle_1_66_0.index (Permission denied)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:275)
... 27 more
at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:162)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
/tmp has 777 permissions, but it is still telling as /tmp has no permssions.
We have configured SPARK_LOCAL_DIRS to some other folder where we have better disk memory, but still the cluster is using /tmp, why.? we have changed it through Cloudera manager, and printed the spark.local.dirs in spark configuration in spark, which gives the folder that we set. But, when it comes to execution, it is other way. It is checking the files in /tmp. Are we missing any thing here.?
we have turned off spark-yarn, does any configurations of yarn impacting standalone?
Has any one faced this issue? and why is this recurring to us.? We had similar cluster with horton works, where we installed bare-bones spark ( which is not part of distribution), which worked very well. But, in our new cluster, we are facing this issues. May be we might have missed some things.? but curious to know what we missed.