What does setMaster `local[*]` mean in spark?

2020-01-24 02:48发布

问题:

I found some code to start spark locally with:

val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val ctx = new SparkContext(conf)

What does the [*] mean?

回答1:

From the doc:

./bin/spark-shell --master local[2]

The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing.

And from here:

local[*] Run Spark locally with as many worker threads as logical cores on your machine.



回答2:

Master URL Meaning


local : Run Spark locally with one worker thread (i.e. no parallelism at all).


local[K] : Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).


local[K,F] : Run Spark locally with K worker threads and F maxFailures (see spark.task.maxFailures for an explanation of this variable)


local[*] : Run Spark locally with as many worker threads as logical cores on your machine.


local[*,F] : Run Spark locally with as many worker threads as logical cores on your machine and F maxFailures.


spark://HOST:PORT : Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.


spark://HOST1:PORT1,HOST2:PORT2 : Connect to the given Spark standalone cluster with standby masters with Zookeeper. The list must have all the master hosts in the high availability cluster set up with Zookeeper. The port must be whichever each master is configured to use, which is 7077 by default.


mesos://HOST:PORT : Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... To submit with --deploy-mode cluster, the HOST:PORT should be configured to connect to the MesosClusterDispatcher.


yarn : Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.

https://spark.apache.org/docs/latest/submitting-applications.html



回答3:

Some additional Info

Do not run Spark Streaming programs locally with master configured as "local" or "local[ 1]". This allocates only one CPU for tasks and if a receiver is running on it, there is no resource left to process the received data. Use at least "local[ 2]" to have more cores.

From -Learning Spark: Lightning-Fast Big Data Analysis



回答4:

Master URL

You can run Spark in local mode using local, local[n] or the most general local[*] for the master URL.

The URL says how many threads can be used in total:

local uses 1 thread only.

local[n] uses n threads.

local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).

local[N, maxFailures] (called local-with-retries) with N being * or the number of threads to use (as explained above) and maxFailures being the value of spark.task.maxFailures.