Home Dashboard Directory Help
Search

WebHCat (Templeton) behavior for multiple child MR jobs by SynergyVT


Status: 

Closed
 as By Design Help for as By Design


2
0
Sign in
to vote
Type: Bug
ID: 786674
Opened: 5/9/2013 3:35:13 PM
Access Restriction: Public
0
Workaround(s)
view
0
User(s) can reproduce this bug

Description


I am using WebHCat REST interface to launch remote MapReduce jobs (POST mapreduce/jar). My application (Main class specified as parameter while creating job) launches two separate MapReduce jobs, one after the other. So, there are two child jobs for the Templeton controller job.

The job launch via WebHCat returns me a job ID (Job ID of the Templeton controller job), which can be used to query the status of running Job (GET queue/:jobid). The controller job completes only after the Main class returns (in our case both the child jobs complete) .

Now, when we query for the job status, we observe that the status changes based on the state of the currently running child Job and not the state of the controller job. As a result, the status can go from RUNNING (child job 1) -> SUCCESS (child job 1) -> RUNNING (child job 2) -> SUCCESS (child job 2).

Because of the above characteristics, we cannot rely on the status returned by GET. Currently, we workaround this by retrying the status ping few more times (even after receiving SUCCESS) to confirm that there is no further state change (or child job spawned). This workaround is not fool-proof, if the time gap between the launch of two child jobs is greater than the retrying time to confirm SUCCESS status

I have described the GET queue/:jobid results for my application below:

Job ID of Templeton controller job : job_201305061951_0054

Query 1: Returns the status of the first child (job_201305061951_0055) as RUNNING

c:\apps\dist\hadoop-1.1.0-SNAPSHOT>curl http://localhost:50111/templeton/v1/queue/job_201305061951_0054?user.name=nabeel
{"status":"startTime":1368131303098,"jobPriority":"NORMAL","jobId":"job_201305061951_0055","jobID":"jtIdentifier":"201305061951","id":55},"runState":1,"jobComplete":false},"profile":{"url":"http://jobtrackerhost:50030/jobdetails.jsp?jobid=job_201305061951_0055","queueName":"default","jobId":"job_201305061951_0055","jobID":"jtIdentifier":"201305061951","id":55}},"id":"job_201305061951_0055","parentId":"job_201305061951_0054"}

Query 2: Returns the status of the first child (job_201305061951_0055) as SUCCEEDED. We cannot rely on this status as there is one more child job that needs to run.

c:\apps\dist\hadoop-1.1.0-SNAPSHOT>curl http://localhost:50111/templeton/v1/queue/job_201305061951_0054?user.name=nabeel
{"status":"startTime":1368131303098,"jobPriority":"NORMAL","jobId":"job_201305061951_0055","jobID":"jtIdentifier":"201305061951","id":55},"runState":2,"jobComplete":true},"profile":"url":"http://jobtrackerhost:50030/jobdetails.jsp?jobid=job_201305061951_0055","queueName":"default","jobId":"job_201305061951_0055","jobID":"jtIdentifier":"201305061951","id":55}},"id":"job_201305061951_0055","parentId":"job_201305061951_0054"}

Query 3: Returns the status of the second child (job_201305061951_0056) as RUNNING

c:\apps\dist\hadoop-1.1.0-SNAPSHOT>curl http://localhost:50111/templeton/v1/queue/job_201305061951_0054?user.name=nabeel
{"status":"startTime":1368131388272,"jobPriority":"NORMAL","jobId":"job_201305061951_0056","jobID":"jtIdentifier":"201305061951","id":56},"runState":1,"jobComplete":false},"profile":{"url":"http://jobtrackerhost:50030/jobdetails.jsp?jobid=job_201305061951_0056","queueName":"default","jobId":"job_201305061951_0056","jobID":"jtIdentifier":"201305061951","id":56}},"id":"job_201305061951_0056","parentId":"job_201305061951_0054"}

Query 4: Returns the status of the second child (job_201305061951_0056) as SUCCEEDED.

c:\apps\dist\hadoop-1.1.0-SNAPSHOT>curl http://localhost:50111/templeton/v1/queue/job_201305061951_0054?user.name=nabeel
{"status":"startTime":1368131388272,"jobPriority":"NORMAL","jobId":"job_201305061951_0056","jobID":"jtIdentifier":"201305061951","id":56},"runState":2,"jobComplete":true},"profile":{"url":"http://jobtrackerhost:50030/jobdetails.jsp?jobid=job_201305061951_0056","queueName":"default","jobId":"job_201305061951_0056","jobID":"jtIdentifier":"201305061951","id":56}},"id":"job_201305061951_0056","parentId":"job_201305061951_0054",}
Details
Sign in to post a comment.
Posted by Microsoft on 7/29/2013 at 6:00 PM
Hi,

This is by design.Please query for the value against "completed" field to get the status of controller job. This field changes from "null" to "done" once the controller job finishes.

webResponse[status][runState] -> Gives status of curently running job.
webResponse[completed] -> Gives status of controller job.

Please let us know if this does not work as expected.

Here's sample the response:
{
    "status": {
        "startTime": 1375141954470,
        "jobPriority": "NORMAL",
        "jobID": {
            "jtIdentifier": "201307171347",
            "id": 21
        },
        "failureInfo": "NA",
        "runState": 2,
        "username": "hadoop",
        "schedulingInfo": "0 running map tasks using 0 map slots. 0 additional slots reserved. 1 running reduce tasks using 1 reduce slots. 0 additional slots reserved.",
        "jobId": "job_201307171347_0021",
        "jobACLs": {},
        "jobComplete": true
    },
    "profile": {
        "url": "http://localhost:50030/jobdetails.jsp?jobid=job_201307171347_0021",
        "user": "hadoop",
        "jobName": "PiEstimator",
        "queueName": "default",
        "jobID": {
            "jtIdentifier": "201307171347",
            "id": 21
        },
        "jobFile": "hdfs://localhost:8020/hadoop/hdfs/tmp/mapred/staging/hadoop/.staging/job_201307171347_0021/job.xml",
        "jobId": "job_201307171347_0021"
    },
    "id": "job_201307171347_0021",
    "parentId": "job_201307171347_0020",
    "percentComplete": "map 100% reduce 100%",
    "exitValue": 0,
    "user": "hadoop",
    "callback": null,
    "completed": "done",
    "userargs": {
        "statusdir": null,
        "define": [],
        "arg": [
            "10",
            "10"
        ],
        "files": null,
        "libjars": null,
        "user.name": "hadoop",
        "jar": "hadoop-examples-1.1.0-SNAPSHOT.jar",
        "callback": null,
        "class": "pi"
    }
}
Sign in to post a workaround.