I got a question in order to better understand a big data concept within Apache Hadoop Spark. Not sure if it's off-topic in this forum, but let me know.
Imagine a Apache Hadoop cluster with 8 servers managed by the Yarn resource manager. I uploaded a file into HDFS (file system) that is configured with 64MB blocksize and a replication count of 3. That file is then split into blocks of 64MB. Now let's imagine the blocks are distributed by HDFS onto node 1, 2 and 3.
But now I'm coding some Python code with a Jupyter notebook. Therefore the notebook is started with this command:
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook"
pyspark --master yarn-client --num-executors 3 --executor-cores 4
--executor-memory 16G
Within the notebook I'm loading the file from HDFS to do some analytics. When I executed my code, I can see in the YARN Web-UI that I got 3 executors and how the jobs are submitted (distributed) to the executors.
The interesting part is, that my executors are fixed to specific computing nodes right after the start command (see above). For instance node 6, 7 and 8.
My questions are:
- Is my assumption correct, that the executor nodes are fixed to computing nodes and the HDFS blocks will be transferred to the executors once I'm accessing (loading) the file from HDFS?
- Or, are the executors dynamically assigned and started at the nodes where the data is (node 1, 2 and 3). In this case my observation in the YARN web-ui must be wrong.
I'm really interested in understanding this better.
Are Jupyter notebook executors distributed dynamically in Apache Spark
For the sake of clarity, let's distinguish
Jupyter notebooks and their associated kernels - a kernel is the Python process behind a notebook's UI. A kernel executes whatever code you type and submit in your notebook. Kernels are managed by Jupyter, not by Spark.
Spark executors - these are the compute resources allocated on the YARN cluster to execute spark jobs
HDFS data nodes - these are where your data resides. Data nodes may or may not be the same as executor nodes.
Is my assumption correct, that the executor nodes are fixed to computing nodes and the HDFS blocks will be transferred to the executors once I'm accessing (loading) the file from HDFS
Yes and no - yes, Spark takes data locality into account when scheulding jobs. No, there is no guarantee. As per Spark documentation:
(...) there are two options: a) wait until a busy CPU frees up to start a task on data on the same server, or b) immediately start a new task in a farther away place that requires moving data there.
What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout expires, it starts moving the data from far away to the free CPU. (...)
Or, are the executors dynamically assigned and started at the nodes where the data is (node 1, 2 and 3).
This depends on the configuration. In general executors are allocated to a spark application (i.e. a SparkContext) dynamically, and deallocated when no longer used. However, executors are kept alive for some time, as per the Job scheduling documentation:
(...)A Spark application removes an executor when it has been idle for more than spark.dynamicAllocation.executorIdleTimeout seconds.(...)
To get more control on what runs where, you may use Scheduler Pools.