How does a file get distributed in a Hadoop cluster?

The key benefits of the Hadoop Distributed File System over previous storage solutions are:-

1. Highly scalable

2. Highly available

So how is this achieved? The diagram below shows the steps which take place when I client application requests that a file be written to HDFS.

HDFS Write

In the diagram, the key HDFS components are:-

1. The Name node which keeps a directory of where each data block is located on which data node on which rack.

2. The Data nodes which keep the Name node informed as to which data blocks are located on their data node and via a heartbeat process keep the Name node informed that they’re still alive.

The high scalability is achieved by:-

1. Spreading the data blocks over multiple data nodes

The high availability is achieved by:-

1. Maintaining a standby name node which maintains a copy of the data block directory (namespace) located on the active name node.

2. Each data block which is written to a data node is also written to x other data nodes.

3. Should a data node fail, it will stop sending a heartbeat signal to the active name node. If the data blocks on the failed data node are then available on 2 few other data nodes, the name node will tell 1 of the data nodes holding a copy to replicate to another data node.

If you want further information, this is available at the Hadoop HDFS Architecture page


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s