# The Google File System vs HDFS

Google's [paper](https://research.google.com/archive/gfs.html) on their file system describes GFS as a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

GFS formed the basis for the development of HDFS, which is more or less an open-source implementation of GFS.

## GFS Assumptions

* **Scale OUT, not UP**: prefer using commodity hardware to exotic hardware
* **High component failure rate**: it is assumed that components of commodity hardware fail all the time
* **Modest number of huge files**: it is assumed that files will be multiple-gigabytes in size
* **Files are write-once, mostly appended to**
* **Large streaming reads over random access**: sequential reading is preferred to random access since the high sustained throughput provided by sequential reading is preferred to the low latency provided by random access

## GFS Design Decisions

* **Files stored as chunks:** Fixed size (64MB)
* **Reliability through replication:** Each chunk is replicated across 3+ "chunkservers"
* **Single master to coordinate access, keep metadata**: Simple centralized management
* **No data caching**: Little benefit due to large datasets, streaming reads
* **Simplify the API:** Push some of the issues onto the client (e.g., data layout)

## From GFS to HDFS

As mentioned earlier, HDFS is based on GFS. However, there are a few key differences between the two:

### Terminology differences

* GFS Master = Hadoop NameNode
* GFS chunkservers = Hadoop DataNodes

### Functional differences

* No file appends in HDFS
* HDFS performance is (likely) slower

**The next section introduces the Data Types in Hadoop.**


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://vikram-bajaj.gitbook.io/cs-gy-9223-d-programming-for-big-data/hadoop/the-google-file-system.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
