What is the relationship between workers, worker i

In Spark Standalone mode, there are master and worker nodes.

Here are few questions:

Does 2 worker instance mean one worker node with 2 worker processes?
Does every worker instance hold an executor for specific application (which manages storage, task) or one worker node holds one executor?
Is there a flow chart explain how spark runtime, such as word count?

标签： apache-spark apache-spark-standalone

4条回答

男人必须洒脱

2楼-- · 2019-01-16 06:21

Extending to other great answers, I would like describe with few images.

In Spark Standalone mode, there are master node and worker nodes.

If we represent both master and workers at one place for standalone mode.

If you are curious about how Spark works with YARN? check this post Spark on YARN

1. Does 2 worker instance mean one worker node with 2 worker processes?

In general we call worker instance as slave as it's a process to execute spark tasks/jobs. Suggested mapping for node(a physical or virtual machine) and worker is,

1 Node = 1 Worker process

2. Does every worker instance hold an executor for specific application (which manages storage, task) or one worker node holds one executor?

Yes, A worker node can be holding multiple executors (processes) if it has sufficient CPU, Memory and Storage.

Check the Worker node in the given image. A Worker node in cluster

BTW, Number of executors in a worker node at a given point of time is entirely depends on work load on the cluster and capability of the node to run how many executors.

3. Is there a flow chart explain how spark runtime?

If we look the execution from Spark prospective over any resource manager for a program, which join two rdds and do some reduce operation then filter

HIH

0人赞添加讨论(0) 举报

迷人小祖宗

3楼-- · 2019-01-16 06:27

I know this is an old question and Sean's answer was excellent. My writeup is about the SPARK_WORKER_INSTANCES in MrQuestion's comment. If you use Mesos or YARN as your cluster manager, you are able to run multiple executors on the same machine with one worker, thus there is really no need to run multiple workers per machine. However, if you use standalone cluster manager, currently it still only allows one executor per worker process on each physical machine. Thus in case you have a super large machine and would like to run multiple exectuors on it, you have to start more than 1 worker process. That's what SPARK_WORKER_INSTANCES in the spark-env.sh is for. The default value is 1. If you do use this setting, make sure you set SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each worker will try to use all the cores.

This standalone cluster manager limitation should go away soon. According to this SPARK-1706, this issue will be fixed and released in Spark 1.4.

0人赞添加讨论(0) 举报

劫难

4楼-- · 2019-01-16 06:39

As Lan was saying, the use of multiple worker instances is only relevant in standalone mode. There are two reasons why you want to have multiple instances: (1) garbage pauses collector can hurt throughput for large JVMs (2) Heap size of >32 GB can’t use CompressedOoops

Read more about how to set up multiple worker instances.

0人赞添加讨论(0) 举报

疯言疯语

5楼-- · 2019-01-16 06:42

I suggest reading the Spark cluster docs first, but even more so this Cloudera blog post explaining these modes.

Your first question depends on what you mean by 'instances'. A node is a machine, and there's not a good reason to run more than one worker per machine. So two worker nodes typically means two machines, each a Spark worker.

Workers hold many executors, for many applications. One application has executors on many workers.

Your third question is not clear.

0人赞添加讨论(0) 举报

What is the relationship between workers, worker i

Extending to other great answers, I would like describe with few images.

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间