The Main Components of Hadoop
Last updated
Last updated
Hadoop mainly has 4 components:
Hadoop Common
Hadoop Distributed File System (HDFS)
Hadoop YARN
Hadoop MapReduce
Hadoop Common refers to the common utilities and packages that support the other Hadoop modules.
It is considered as the base/core of the framework as it provides essential services and basic processes such as abstraction of the underlying operating system and its file system.
It also contains the necessary Java Archive (JAR) files and scripts required to start Hadoop.
It provides source code and documentation, as well as a contribution section that includes different projects from the Hadoop Community.
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. Everything in Hadoop happens via HDFS. This also means that everything has to be written to disk first. Note that every object in Hadoop is serializable. This means that it knows how to read from and write to the disk.The above figure represents the architecture of HDFS. It follows a master-slave architecture.
The master node is called the NameNode and the slave nodes are called DataNodes. Every DataNode has its own memory. The DataNodes are placed on racks. Nodes within a rack, or sometimes even entire racks, are connected via fiber-optic cables to enable fast inter-node and inter-rack communication.
The NameNode periodically receives a heartbeat and a blockreport from each of the DataNodes in the cluster. Receipt of a heartbeat implies that the DataNode is functioning properly. A blockreport contains a list of all blocks (chunks of data) present on a DataNode.
The NameNode tracks data in HDFS (i.e. on the cluster-level), but DataNodes track data on a one-machine system (i.e. on the server-level).
Hadoop is rack-aware. This means that it knows the locations of all the nodes and it knows the costs involved in executing jobs across multiple nodes. If a rack of nodes crashes, Hadoop will try to move the jobs to nodes in a far-away rack to ensure fault tolerance.
Large files are partitioned into chunks/fragments/blocks of 64MB size each (by default). This is called granularity of HDFS. These chunks/blocks are stored on the DataNodes.
Every block of data has 2 copies that are stored on different DataNodes, preferably one copy on a node on the same rack and one copy on a node on another rack. This is known as the replication factor, which is by default 3. This ensures fault tolerance.
Both granularity and replication factor are configurable.
It is important to note that Hadoop will ship the code to where the relevant data is. This means we don't have to move large chunks of data from one node to another, thereby saving network costs and improving latency.
The Divide and Conquer strategy can lead to 1/N parallelization and if multi-threading is used, it can allow for 1/C parallelization (where N is the number of nodes and C is the number of chunks).
Note that the number of nodes in HDFS is only limited by the network bandwidth.
YARN (Yet Another Resource Negotiator) is the resource scheduler for HDFS. It is a framework for job scheduling and cluster resource management in Hadoop.
The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling into separate daemons. The key concept of YARN involves setting up both global and application-specific resource management components.
Note that YARN uses client-server or master-slave architecture.
It consists of the following components:
A global ResourceManager: It is the ultimate authority that arbitrates resources among all the applications in the system. It has a scheduler, which is responsible for allocating resources to the various running applications, according to constraints such as queue capacities, user-limits etc. The scheduler performs its scheduling function based on the resource requirements of the applications
A per-node slave NodeManager: It is the per-machine slave that is responsible for containers, monitoring their resource usage (CPU, memory, disk, network) and reporting the same to the ResourceManager
A per-application ApplicationMaster: It is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks
A per-application container running on a NodeManager
The terms Hadoop and MapReduce are sometimes used interchangeably, depending on the context.
MapReduce is a YARN-based system for parallel processing of large data sets. It is a method for distributing a task across multiple nodes. Google's paper on MapReduce describes it as a programming model for processing and generating large data sets. According to the paper, users must specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key, thereby reducing the number of key/value pairs.
In layman terms, the map job takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as its input and combines those data tuples into a smaller set of tuples.
MapReduce performs the following:
Handles scheduling: assigns workers to map and reduce tasks
Handles data-distribution: moves processes to data
Handles synchronization: gathers, sorts and shuffles intermediate data
Handles errors and faults: detects worker failures and restarts them
Note: MapReduce Streaming API allows us to write Hadoop programs (for map and reduce functionalities) in any language as long as the language supports standard input and output
The primary advantages of abstracting our jobs as MapReduce running over a distributed infrastructure like CPU and storage are:
Automatic parallelization and distribution of data in blocks across a distributed, scale-out infrastructure
Fault-tolerance against failure of storage, compute and network infrastructure
Deployment, monitoring and security capabilities
A clean abstraction for programmers (MapReduce abstracts all low-level functionalities and allows programmers to focus on writing code just for the Map and Reduce functionalities)
MapReduce jobs are controlled by a software daemon known as the JobTracker. The JobTracker resides on a 'master node'. Clients submit MapReduce jobs to the JobTracker
The JobTracker assigns Map and Reduce tasks to other nodes on the cluster. These nodes each run a software daemon known as the TaskTracker. The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker
(A job is a program with the ability of complete execution of Mappers and Reducers over a dataset. A task is the execution of a single Mapper or Reducer over a slice of data)
There will be at least as many task attempts as there are tasks. If a task attempt fails, another will be started by the JobTracker. Speculative execution can also result in more task attempts than completed tasks.
The next section lists some commonly used Hadoop-related projects.