The Hadoop Distributed File System (HDFS)

Data is the new fuel. Data is collected on large scale for analysis like deriving trending topics, Security analysis (behavior analysis), pattern recognition, Machine Learning, etc. Therefore there is a need to store these huge data reliably.

The HDFS was developed at Yahoo and later Open Sourced. Lets start with HDFS Overview:
HDFS is file system component of Hadoop and it is patterned like UNIX file system. It is designed for two main purposes:

  • store large data sets reliably, the data sets grow with demand.
  • stream these data to user applications at high I/O bandwidth.

The important characteristics of HDFS is partitioning the data as well as computation across the machines and executing computations in parallel to data. HDFS stores file system metadata and application data separately.

For Reliability:
HDFS replicate content on different nodes. By default it replicates data on 3 nodes, it can be modified as HDFS provides an API that exposes the replication factor.

For Namespace:
HDFS namespace is hierarchy of files and directories. These entire namespace is kept in RAM. I will cover more of it in later sections.

Architecture:

I) NameNode:
NameNode stores the system metadata information. Also contains inode and Namespace information. As in unix like system: inode contains file attributes like file permission, modification, access time, disk space quota, etc. Application data is split into blocks of 128MB and each data block is replicated on Data Nodes.

a. Data Read: An HDFS client contact NameNode to get all the locations of data blocks and read blocks content from the nearest DataNode. If 1st read fails, read 2nd replica and so on. If there is need for more, a new block will be allocated and file will be written in pipelined way.

b. Data Write: An HDFS Client asks NameNode to nominate 3 DataNodes to host replicas. Then client writes in pipeline fashion.

c. NameSpace: HDFS keeps entire Namespace in RAM, so to have faster read and write.

d. Image: The inode data and list of blocks belonging to each file compromise the metadata is called as Image.

e. Checkpoint: For backup, reliability, persistent record of image is stored in local hosts native FS.

f. Journal: A log file, contains information about the changes happening in the system. Modified log file of image is stored in local host native FS.

Furthermore for durability redundant copies of checkpoint and journal can be made.

II) DataNode:
It stores the application data, each block of replica is represented as two files:
first the data itself, second checksum and block generation number. When a new DataNode is added or on startup, handshake between the NameNode and DataNode is performed.
Handshake basically does this:

  • Verify the namespace Id of the dataNode,
  • Software version of the datanode

If neither of these matches then DataNode is disconnected and shutdown.
If it is successful, the DataNode is given a storage Id at registration and it remains constant after it.

DataNode sends a heartbeat signal each 3 secs. NameNode communicates to DataNode as reply to DataNodes’ heartbeat. Basically the communication involves to decide whether to replicate a block to other nodes, or remove the duplicates or send some reports.

Each hour DataNode also sends block-reports to NameNode which contains block Id, Generation Stamp and length of each block.

<< Image of HDFS Client Interactions >>

III) HDFS Client:
It is a library that exposes the HDFS FS interface. HDFS supports read, write and delete operations.
As explained in NameNode (Point a, b) HDFS read and write are supported using Client.

IV) BackUp Node:
It is in-memory up to date image of NameSpace. NameNode stores the journal info in Backup Node.

So this covers the basic HDFS architecture and overview, that will give you good overview of each terms that is used while discussing HDFS. To further read the detailed information of HDFS, you should refer these links below:

The Hadoop Distributed File System published in 2010 at IEEE 26th Symposium on Mass Storage Systems and Technologies.
Hadoop Wiki contains latest information on Hadoop.

Understanding Big Data Systems

There is lot of talk about the Big Data Systems, many new system architecture with different functionalities have came up. I am sharing my knowledge about some Big Data Systems layered architecture. I plan to cover these topics in depth:
1. HDFS (Hadoop Distributed File Systems),
2. YARN (Yet Another Resource Negotiator),
3. Tachyon,
4. Apache Spark,
5. Spark Streaming, Spark SQL
6. Storm/Heron,
7. GraphX,
8. Blinkdb.

I will cover these topics with their system overview, history, architecture and where and how they are used in practice and some resourceful links to get more information about them.