How to get all jobs status through spark REST API?

2019-06-26 11:07发布

I am using spark 1.5.1 and I'd like to retrieve all jobs status through REST API.

I am getting correct result using /api/v1/applications/{appId}. But while accessing jobs /api/v1/applications/{appId}/jobs getting "no such app:{appID}" response.

How should I pass app ID here to retrieve jobs status of application using spark REST API?

5条回答
你好瞎i
2楼-- · 2019-06-26 11:21

Spark provides 4 hidden RESTFUL API

1) Submit the job - curl -X POST http://SPARK_MASTER_IP:6066/v1/submissions/create

2) To kill the job - curl -X POST http://SPARK_MASTER_IP:6066/v1/submissions/kill/driver-id

3) To check status if the job - curl http://SPARK_MASTER_IP:6066/v1/submissions/status/driver-id

4) Status of the Spark Cluster - http://SPARK_MASTER_IP:8080/json/

If you want to use another APIs you can try Livy , lucidworks url - https://doc.lucidworks.com/fusion/3.0/Spark_ML/Spark-Getting-Started.html

查看更多
成全新的幸福
3楼-- · 2019-06-26 11:31

This is supposed to work when accessing a live driver's API endpoints, but since you're using Spark 1.5.x I think you're running into SPARK-10531, a bug where the Spark Driver UI incorrectly mixes up application names and application ids. As a result, you have to use the application name in the REST API url, e.g.

http://localhost:4040/api/v1/applications/Spark%20shell/jobs

According to the JIRA ticket, this only affects the Spark Driver UI; application IDs should work as expected with the Spark History Server's API endpoints.

This is fixed in Spark 1.6.0, which should be released soon. If you want a workaround which should work on all Spark versions, though, then the following approach should work:

The api/v1/applications endpoint misreports job names as job ids, so you should be able to hit that endpoint, extract the id field (which is actually an application name), then use that to construct the URL for the current application's job list (note that the /applications endpoint will only ever return a single job in the Spark Driver UI, which is why this approach should be safe; due to this property, we don't have to worry about the non-uniqueness of application names). For example, in Spark 1.5.2 the /applications endpoint can return a response which contains a record like

{
   id: "Spark shell",
   name: "Spark shell",
   attempts: [
   {
       startTime: "2015-09-10T06:38:21.528GMT",
       endTime: "1969-12-31T23:59:59.999GMT",
       sparkUser: "",
       completed: false
   }]
}

If you use the contents of this id field to construct the applications/<id>/jobs URL then your code should be future-proofed against upgrades to Spark 1.6.0, since the id field will begin reporting the proper IDs in Spark 1.6.0+.

查看更多
仙女界的扛把子
4楼-- · 2019-06-26 11:37

Spark has some hidden RESTFUL API that you can try. Note that i have not tried yet, but i will.

For example: to get status of submit application you can do: curl http://spark-cluster-ip:6066/v1/submissions/status/driver-20151008145126-0000

Note: "driver-20151008145126-0000" is submitsionId.

You can take a deep look in this link: http://arturmkrtchyan.com/apache-spark-hidden-rest-api

查看更多
趁早两清
5楼-- · 2019-06-26 11:38

If you want to use the REST API to control Spark, you're probably best adding the Spark Jobserver to your installation which then gives you a much more comprehensive REST API than the private REST APIs you're currently querying.

Poking around, I've managed to get the job status for a single application by running

curl http://127.0.0.1:4040/api/v1/applications/Spark%20shell/jobs/

which returned

[ {
  "jobId" : 0,
  "name" : "parquet at <console>:19",
  "submissionTime" : "2015-12-21T10:46:02.682GMT",
  "stageIds" : [ 0 ],
  "status" : "RUNNING",
  "numTasks" : 2,
  "numActiveTasks" : 2,
  "numCompletedTasks" : 0,
  "numSkippedTasks" : 0,
  "numFailedTasks" : 0,
  "numActiveStages" : 1,
  "numCompletedStages" : 0,
  "numSkippedStages" : 0,
  "numFailedStages" : 0 }]
查看更多
可以哭但决不认输i
6楼-- · 2019-06-26 11:43

For those who have this problem and are running on YARN:

According to the docs,

when running in YARN cluster mode, [app-id] will actually be [base-app-id]/[attempt-id], where [base-app-id] is the YARN application ID

So if your call to https://HOST:PORT/api/v1/applications/application_12345678_0123 returns something like

{
  "id" : "application_12345678_0123",
  "name" : "some_name",
  "attempts" : [ {
    "attemptId" : "1",
    <...snip...>
  } ]
}

you can get eg. jobs by calling

https://HOST:PORT/api/v1/applications/application_12345678_0123/1/jobs

(note the "1" before "/jobs").

查看更多
登录 后发表回答