Running a Job on Spark 0.9.0 throws error

2019-04-19 08:47发布

问题:

I have a Apache Spark 0.9.0 Cluster installed where I am trying to deploy a code which reads a file from HDFS. This piece of code throws a warning and eventually the job fails. Here is the code

/**
 * running the code would fail 
 * with a warning 
 * Initial job has not accepted any resources; check your cluster UI to ensure that 
 * workers are registered and have sufficient memory
 */

object Main extends App {
    val sconf = new SparkConf()
    .setMaster("spark://labscs1:7077")
    .setAppName("spark scala")
    val sctx = new SparkContext(sconf)
    sctx.parallelize(1 to 100).count
}

The below is the WARNING message

Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

how to get rid of this or am I missing some configurations.

回答1:

You get this when either the number of cores or amount of RAM (per node) you request via setting spark.cores.max and spark.executor.memory resp' exceeds what is available. Therefore even if no one else is using the cluster, and you specify you want to use, say 100GB RAM per node, but your nodes can only support 90GB, then you will get this error message.

To be fair the message is vague in this situation, it would be more helpful if it said your exceeding the maximum.



回答2:

Looks like Spark master can't assign any workers for this task. Either the workers aren't started or they are all busy.

Check Spark UI on master node (port specified by SPARK_MASTER_WEBUI_PORT in spark-env.sh, 8080 by default). It should look like this:

For cluster to function properly:

  • There must be some workers with state "Alive"
  • There must be some cores available (for example, if all cores are busy with the frozen task, the cluster won't accept new tasks)
  • There must be sufficient memory available


回答3:

Also make sure your spark workers can communicate both ways with the driver. Check for firewalls, etc.



回答4:

I had this exact issue. I had a simple 1-node Spark cluster and was getting this error when trying to run my Spark app.

I ran through some of the suggestions above and it was when I tried to run the Spark shell against the cluster and not being able to see this in the UI that I became suspicious that my cluster was not working correctly.

In my hosts file I had an entry, let's say SparkNode, that referenced the correct IP Address.

I had inadvertently put the wrong IP Address in the conf/spark-env.sh file against the SPARK_MASTER_IP variable. I changed this to SparkNode and I also changed SPARK_LOCAL_IP to point to SparkNode.

To test this I opened up the UI using SparkNode:7077 in the browser and I could see an instance of Spark running.

I then used Wildfires suggestion of running the Spark shell, as follows:

MASTER=spark://SparkNode:7077 bin/spark-shell

Going back to the UI I could now see the Spark shell application running, which I couldn't before.

So I exited the Spark shell and ran my app using Spark Submit and it now works correctly.

It is definitely worth checking out all of your IP and host entries, this was the root cause of my problem.



回答5:

You need to specify the right SPARK_HOME and your driver program's IP address in case Spark may not able to locate your Netty jar server. Be aware that your Spark master should listen to the correct IP address which you suppose to use. This can be done by setting SPARK_MASTER_IP=yourIP in file spark-env.sh.

   val conf = new SparkConf()
  .setAppName("test")
  .setMaster("spark://yourSparkMaster:7077")
  .setSparkHome("YourSparkHomeDir")
  .set("spark.driver.host", "YourIPAddr")


回答6:

Check for errors regard to hostname, IP address and loopback. Make sure to set SPARK_LOCAL_IP and SPARK_MASTER_IP.



回答7:

I had similar issue Initial job has not accepted any resource, fixed it by specify the spark correct download url on spark-env.sh or installing spark on all slaves.

export SPARK_EXECUTOR_URI=http://mirror.fibergrid.in/apache/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz

Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory