# The Main Components of Hadoop

Hadoop mainly has 4 components:

1. Hadoop Common
2. Hadoop Distributed File System (HDFS)
3. Hadoop YARN
4. Hadoop MapReduce

## 1. Hadoop Common

* Hadoop Common refers to the common utilities and packages that support the other Hadoop modules.
* It is considered as the base/core of the framework as it provides essential services and basic processes such as abstraction of the underlying operating system and its file system.
* It also contains the necessary Java Archive (JAR) files and scripts required to start Hadoop.&#x20;
* It provides source code and documentation, as well as a contribution section that includes different projects from the Hadoop Community.

## 2. Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. **Everything in Hadoop happens via HDFS. This also means that everything has to be written to disk first.** Note that every object in Hadoop is **serializable**. This means that it knows how to read from and write to the disk.![](/files/-M5-0TtPH_08xmU1NVCk)The above figure represents the architecture of HDFS. It follows a **master-slave** **architecture**.

The master node is called the **NameNode** and the slave nodes are called **DataNodes**. *Every DataNode has its own memory*. The DataNodes are placed on **racks**. Nodes within a rack, or sometimes even entire racks, are connected via fiber-optic cables to enable fast inter-node and inter-rack communication.

The NameNode periodically receives a **heartbeat** and a **blockreport** from each of the DataNodes in the cluster. Receipt of a heartbeat implies that the DataNode is functioning properly. A blockreport contains a list of all blocks (chunks of data) present on a DataNode.

*The NameNode tracks data in HDFS (i.e. on the cluster-level), but DataNodes track data on a one-machine system (i.e. on the server-level).*

**Hadoop is rack-aware**. This means that it knows the locations of all the nodes and it knows the costs involved in executing jobs across multiple nodes. If a rack of nodes crashes, Hadoop will try to move the jobs to nodes in a far-away rack to ensure fault tolerance.

Large files are partitioned into chunks/fragments/blocks of 64MB size each (by default). This is called **granularity** of HDFS. These chunks/blocks are stored on the DataNodes.

Every block of data has **2 copies** that are stored on different DataNodes, preferably one copy on a node on the same rack and one copy on a node on another rack. This is known as the **replication factor,** which is by default **3**. This ensures fault tolerance.

Both *granularity* and *replication factor* are configurable.

It is important to note that Hadoop will **ship the code to where the relevant data is**. This means we don't have to move large chunks of data from one node to another, thereby saving network costs and improving latency.

The Divide and Conquer strategy can lead to 1/N parallelization and if multi-threading is used, it can allow for 1/C parallelization (where N is the number of nodes and C is the number of chunks).

Note that the number of nodes in HDFS is only limited by the **network bandwidth**.

## 3. Hadoop YARN

YARN (Yet Another Resource Negotiator) is the **resource scheduler** for HDFS. It is a framework for **job scheduling** and cluster **resource management** in Hadoop.

The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling into separate daemons. The key concept of YARN involves setting up both global and application-specific resource management components.

Note that YARN uses **client-server** or **master-slave** architecture.

It consists of the following components:

![](/files/-M5-0TtSnkPCoj5SSjz6)

* A global **ResourceManager**: It is the ultimate authority that arbitrates resources among all the applications in the system. It has a **scheduler**, which is responsible for allocating resources to the various running applications, according to constraints such as queue capacities, user-limits etc. The scheduler performs its scheduling function based on the resource requirements of the applications
* A per-node slave **NodeManager**: It is the per-machine slave that is responsible for containers, monitoring their resource usage (CPU, memory, disk, network) and reporting the same to the ResourceManager
* A per-application **ApplicationMaster**: It is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks
* A per-application **container** running on a NodeManager

## 4. Hadoop MapReduce

The terms *Hadoop* and *MapReduce* are sometimes used interchangeably, depending on the context.

MapReduce is a YARN-based system for parallel processing of large data sets. It is a method for distributing a task across multiple nodes. Google's [paper](https://research.google.com/archive/mapreduce.html) on MapReduce describes it as a **programming model** for processing and generating large data sets. According to the paper, users must specify a **map** function that processes a key/value pair to generate a set of intermediate key/value pairs, and a **reduce** function that merges all intermediate values associated with the same intermediate key, thereby reducing the number of key/value pairs.

In layman terms, the **map** job takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The **reduce** job takes the output from a map as its input and combines those data tuples into a smaller set of tuples.

MapReduce performs the following:

* Handles **scheduling**: assigns workers to map and reduce tasks
* Handles **data-distribution**: moves processes to data
* Handles **synchronization**: gathers, sorts and shuffles intermediate data
* Handles **errors and faults**: detects worker failures and restarts them

***Note**: MapReduce **Streaming API** allows us to write Hadoop programs (for map and reduce functionalities) in any language as long as the language supports standard input and output*

The primary advantages of abstracting our jobs as MapReduce running over a distributed infrastructure like CPU and storage are:

* **Automatic parallelization** and distribution of data in blocks across a distributed, scale-out infrastructure
* **Fault-tolerance** against failure of storage, compute and network infrastructure
* Deployment, monitoring and security capabilities
* A clean **abstraction** for programmers (MapReduce abstracts all low-level functionalities and allows programmers to focus on writing code just for the *Map* and *Reduce* functionalities)

### HDFS/MapReduce High Level Architecture

### ![](/files/-M5-0TtW1ePksebUPYsp)

* MapReduce jobs are controlled by a software daemon known as the **JobTracker**. The JobTracker resides on a 'master node'.  Clients submit MapReduce jobs to the JobTracker
* The JobTracker assigns Map and Reduce tasks to other nodes on the cluster. These nodes each run a software daemon known as the **TaskTracker**. The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker

(A **job** is a program with the ability of complete execution of Mappers and Reducers over a dataset. A **task** is the execution of a single Mapper or Reducer over a slice of data)

There will be at least as many task attempts as there are tasks. If a task attempt fails, another will be started by the JobTracker. Speculative execution can also result in more task attempts than completed tasks.

**The next section lists some commonly used Hadoop-related projects.**


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://vikram-bajaj.gitbook.io/cs-gy-9223-d-programming-for-big-data/hadoop/the-main-components-of-hadoop.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
