In a previous article, I’ve described how does a file get distributed in a hadoop cluster.
In this article, I’ll describe how jobs get scheduled in Hadoop so as to use this data.
One of the key benefits that Hadoop offers is that rather than data being fetched from a distributed file cluster to a central server and a job being executed against that data, Hadoop sends the job to where the data is. The diagram below explains how this works:-
In order that a particular application does not absorb too many resouces (cpu, memory, i/o), the concept of a resource container has been created, in which a particular application can utilise only the resources that a container allocates to it.
A node can refer to a single server or a server which has been logically partitioned so that each partition appears to a client to be a separate server.
1. The client application submits a job to the resource manager.
2. The resource manager takes the job from the job queue and allocates it to an application master. It also manages and monitors resource allocations to each application master and container on the data nodes.
3. The application master divides the job in to tasks and allocates it to each data node.
4. On each data node, a Node manager manages the containers in which the tasks run.
5. The application master will ask the resource manager to allocate more resource to particular containers, if necessary.
6. The application master will keep the resource manager informed as to the status of the jobs allocated to it, and the resource manager will keep the client application informed.
The advantage of this architecture include:-
1. A client application does not have to worry about how hadoop works internally, it communicates only with the resource manager.
2. Resources are allocated based on the needs of each task and managed appropriately so that 1 task does not grab all of the resources of a particular node.