What is 'Active Jobs' in Spark History Ser

2019-08-23 09:53发布

问题:

I'm trying to understand Spark History server components. I know that, History server shows completed Spark applications.

Nonetheless, I see 'Active Jobs' set to 1 for a completed Spark application. I'm trying to understand what is 'Active Jobs' mean in Jobs section. Also, Application completed within 30 minutes, but when I opened History Server after 8 hours, 'Duration' shows 8.0h. Please see the screenshot.

Could you please help me understand 'Active Jobs', 'Duration' and 'Stages: Succeeded/Total' items in above image?

回答1:

Invoking an action(count is action in your case) inside a Spark application triggers the launch of a job to fulfill it. Spark examines the dataset on which that action depends and formulates an execution plan. The execution plan assembles the dataset transformations into stages.

A stage is a physical unit of the execution plan. In shorts, Stage is a set of parallel tasks i.e. one task per partition. Basically, each job which gets divided into smaller sets of tasks is a stage. Although, it totally depends on each other. However, it somewhat same as the map and reduce stages in MapReduce.

each type of Spark Stages in detail:

a. ShuffleMapStage in Spark ShuffleMapStage is considered as an intermediate Spark stage in the physical execution of DAG. Basically, it produces data for another stage(s). consider ShuffleMapStage in Spark as input for other following Spark stages in the DAG of stages. However, it is possible that there is n number of multiple pipeline operations, in ShuffleMapStage. like map and filter, before shuffle operation. Furthermore, we can share single ShuffleMapStage among different jobs.

b. ResultStage in Spark By running a function on a spark RDD Stage which executes a Spark action in a user program is a ResultStage.It is considered as a final stage in spark. ResultStage implies as a final stage in a job that applies a function on one or many partitions of the target RDD in Spark, helps for computation of the result of an action.

coming back to the question of active jobs on history sever there some notes listed on official docs as history server.Also there is jira [SPARK-7889] issue regarding the same link. for more details follow the link source-1



回答2:

Finally after some research, found answer for my question.

A Spark application consists of a driver and one or more executors. The driver program instantiates SparkContext, which coordinates the executors to run the Spark application. This information is displayed on Spark History Server Web UI 'Active Jobs' section.

The executors run tasks assigned by the driver.

When Spark application runs on YARN, it has its own implementation of yarn client and yarn application master. YARN application has a yarn client, yarn application master and list of container running on node managers.

In my case Yarn is running in standalone mode, thus driver program is running as a thread of the yarn application master. The Yarn client pulls status from the application master and application master coordinates the containers to run the tasks.

This running job could be monitored in YARN applications page in the Cloudera Manager Admin Console, while it is running.

If application succeeds, then History server will show list of 'Completed Jobs' and also 'Active Jobs' section will be removed.

If application fails at the containers level and YARN communicates this information to Driver then, History server will show list of 'Failed Jobs' and also 'Active Jobs' section will be removed.

Nonetheless, if application fails at the containers level and YARN couldn't communicate that to driver, then Driver instantiated job gets into oblivion state. It thinks job is still being run and keeps waiting to hear from YARN application master for the job status. Hence, in History Server, it still shows up in 'Active Jobs' as running.

So my take away from this is: To check the status of running job, go to YARN applications page in the Cloudera Manager Admin Console or use YARN CLI command. After job completion/failure, Open the Spark History Server to get more details on resources usage, DAG and execution timeline information.