Wednesday, July 29, 2015

What is a Job Tracker in Hadoop? How many instances of Job Tracker run on a Hadoop Cluster?

Job Tracker is the daemon service for submitting and tracking Map Reduce jobs in Hadoop. There is only One Job Tracker process run on any hadoop cluster. Job Tracker runs on its own JVM process. In a typical production cluster its run on a separate machine. Each slave node is configured with job tracker node location. The Job Tracker is single point of failure for the Hadoop Map Reduce service. If it goes down, all running jobs are halted. Job Tracker in Hadoop performs following actions.
  • The Job Tracker locates Task Tracker nodes with available slots at or near the data
  • The Job Tracker submits the work to the chosen TaskTracker nodes.
  • The Task Tracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different Task Tracker.
  • A Task Tracker will notify the Job Tracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may even blacklist the Task Tracker as unreliable.
  • When the work is completed, the Job Tracker updates its status.
  • Client applications can poll the Job Tracker for information.

How Job Tracker schedules a task?

The Task Trackers send out heartbeat messages to the Job Tracker, usually every few minutes, to reassure the Job Tracker that it is still alive. These message also inform the Job Tracker of the number of available slots, so the Job Tracker can stay up to date with where in the cluster work can be delegated. When the Job Tracker tries to find somewhere to schedule a task within the Map Reduce operations, it first looks for an empty slot on the same server that hosts the Data Node containing the data, and if not, it looks for an empty slot on a machine in the same rack.

No comments:

Post a Comment