I found some code to start spark locally with:
val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val ctx = new SparkContext(conf)
What does the [*]
mean?
I found some code to start spark locally with:
val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val ctx = new SparkContext(conf)
What does the [*]
mean?
From the doc:
./bin/spark-shell --master local[2]
The --master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing.
And from here:
local[*] Run Spark locally with as many worker threads as logical cores on your machine.
Master URL Meaning
local : Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K] : Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[K,F] : Run Spark locally with K worker threads and F maxFailures (see spark.task.maxFailures for an explanation of this variable)
local[*] : Run Spark locally with as many worker threads as logical cores on your machine.
local[*,F] : Run Spark locally with as many worker threads as logical cores on your machine and F maxFailures.
spark://HOST:PORT : Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.
spark://HOST1:PORT1,HOST2:PORT2 : Connect to the given Spark standalone cluster with standby masters with Zookeeper. The list must have all the master hosts in the high availability cluster set up with Zookeeper. The port must be whichever each master is configured to use, which is 7077 by default.
mesos://HOST:PORT : Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... To submit with --deploy-mode cluster, the HOST:PORT should be configured to connect to the MesosClusterDispatcher.
yarn : Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.
https://spark.apache.org/docs/latest/submitting-applications.html
Some additional Info
Do not run Spark Streaming programs locally with master configured as "local" or "local[ 1]". This allocates only one CPU for tasks and if a receiver is running on it, there is no resource left to process the received data. Use at least "local[ 2]" to have more cores.
From -Learning Spark: Lightning-Fast Big Data Analysis
Master URL
You can run Spark in local mode using local, local[n]
or the most general local[*]
for the master URL.
The URL says how many threads can be used in total:
local
uses 1 thread only.
local[n]
uses n threads.
local[*]
uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors()
to know the number).
local[N, maxFailures]
(called local-with-retries) with N
being *
or the number of threads to use (as explained above) and maxFailures being the value of spark.task.maxFailures
.