How to submit Apache Spark job to Hadoop YARN on A

I am very excited that HDInsight switched to Hadoop version 2, which supports Apache Spark through YARN. Apache Spark is a much better fitting parallel programming paradigm than MapReduce for the task that I want to perform.

I was unable to find any documentation however on how to do remote job submission of a Apache Spark job to my HDInsight cluster. For remote job submission of standard MapReduce jobs I know that there are several REST endpoints like Templeton and Oozie. But as for as I was able to find, running Spark jobs is not possible through Templeton. I did find it to be possible to incorporate Spark jobs into Oozie, but I've read that this is a very tedious thing to do and also I've read some reports of job failure detection not working in this case.

Probably there must be a more appropriate way to submit Spark jobs. Does anyone know how to do remote job submissions of Apache Spark jobs to HDInsight?

Many thanks in advance!

标签： azure apache-spark hdinsight

3条回答

姐就是有狂的资本

2楼-- · 2019-06-20 07:04

I also asked the same question with Azure guys. Following is the solution from them:

"Two questions to the topic: 1. How can we submit a job outside of the cluster without "Remote to…" — Tao Li

Currently, this functionality is not supported. One workaround is to build job submission web service yourself:

Create Scala web service that will use Spark APIs to start jobs on the cluster.
Host this web service in the VM inside the same VNet as the cluster.
Expose web service end-point externally through some authentication scheme. You can also employ intermediate map reduce job, it would take longer though.

0人赞添加讨论(0) 举报

贼婆χ

3楼-- · 2019-06-20 07:11

You can install spark on a hdinsight cluster. You have to do it at by creating a custom cluster and adding an action script that will install Spark on the cluster at the time it creates the VMs for the Cluster.

To install with an action script on cluster install is pretty easy, you can do it in C# or powershell by adding a few lines of code to a standard custom create cluster script/program.

powershell:

# ADD SCRIPT ACTION TO CLUSTER CONFIGURATION
$config = Add-AzureHDInsightScriptAction -Config $config -Name "Install Spark" -ClusterRoleCollection HeadNode -Urin https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv02/spark-installer-v02.ps1

C#:

// ADD THE SCRIPT ACTION TO INSTALL SPARK
clusterInfo.ConfigActions.Add(new ScriptAction(
  "Install Spark", // Name of the config action
  new ClusterNodeType[] { ClusterNodeType.HeadNode }, // List of nodes to install Spark on
  new Uri("https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv02/spark-installer-v02.ps1"), // Location of the script to install Spark
  null //because the script used does not require any parameters.
));

you can then RDP into the headnode and run use the spark-shell or use spark-submit to run jobs. I am not sure how would run spark job and not rdp into the the headnode but that is an other question.

0人赞添加讨论(0) 举报

神经病院院长

4楼-- · 2019-06-20 07:24

You might consider using Brisk (https://brisk.elastatools.com) which offers Spark on Azure as a provisioned service (with support available). There's a free tier and it lets you access blob storage with a wasb://path/to/files just like HDInsight.

It doesn't sit on YARN; instead it is a lightweight and Azure oriented distribution of Spark.

Disclaimer: I work on the project!

Best wishes,

Andy

0人赞添加讨论(0) 举报

How to submit Apache Spark job to Hadoop YARN on A

"Two questions to the topic: 1. How can we submit a job outside of the cluster without "Remote to…" — Tao Li

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间