Run a MapReduce job via rest api

2019-04-12 19:26发布

问题:

I use hadoop2.7.1's rest apis to run a mapreduce job outside the cluster. This example "http://hadoop-forum.org/forum/general-hadoop-discussion/miscellaneous/2136-how-can-i-run-mapreduce-job-by-rest-api" really helped me. But when I submit a post response, some strange things happen:

  1. I look at "http://master:8088/cluster/apps" and a post response produce two jobs as following picture: strange things: a response produces two jobs

  2. After wait a long time, the job which I defined in the http response body fail because of FileAlreadyExistsException. The reason is another job creates the output directory, so Output directory hdfs://master:9000/output/output16 already exists.

This is my response body:

{
    "application-id": "application_1445825741228_0011",
    "application-name": "wordcount-demo",
    "am-container-spec": {
        "commands": {
            "command": "{{HADOOP_HOME}}/bin/hadoop jar /home/hadoop/hadoop-2.7.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /data/ /output/output16"
        },
        "environment": {
            "entry": [{
                "key": "CLASSPATH",
                "value": "{{CLASSPATH}}<CPS>./*<CPS>{{HADOOP_CONF_DIR}}<CPS>{{HADOOP_COMMON_HOME}}/share/hadoop/common/*<CPS>{{HADOOP_COMMON_HOME}}/share/hadoop/common/lib/*<CPS>{{HADOOP_HDFS_HOME}}/share/hadoop/hdfs/*<CPS>{{HADOOP_HDFS_HOME}}/share/hadoop/hdfs/lib/*<CPS>{{HADOOP_YARN_HOME}}/share/hadoop/yarn/*<CPS>{{HADOOP_YARN_HOME}}/share/hadoop/yarn/lib/*<CPS>./log4j.properties"
            }]
        }
    },
    "unmanaged-AM": false,
    "max-app-attempts": 2,
    "resource": {
        "memory": 1024,
        "vCores": 1
    },
    "application-type": "MAPREDUCE",
    "keep-containers-across-application-attempts": false
}

and this is my command:

curl -i -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' http://master:8088/ws/v1/cluster/apps?user.name=hadoop -d @post-json.txt

Can anybody help me? Thanks a lot.

回答1:

When you run the map reduce, see that you do not have output folder as the job will not run if it is present. You can write program so that you can delete the folder is it exists, or manually delete it before calling the rest api. This is just to prevent the data loss and avoid overwriting the output of other job.