The Google File System vs HDFS
Google's paper on their file system describes GFS as a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
GFS formed the basis for the development of HDFS, which is more or less an open-source implementation of GFS.
GFS Assumptions
Scale OUT, not UP: prefer using commodity hardware to exotic hardware
High component failure rate: it is assumed that components of commodity hardware fail all the time
Modest number of huge files: it is assumed that files will be multiple-gigabytes in size
Files are write-once, mostly appended to
Large streaming reads over random access: sequential reading is preferred to random access since the high sustained throughput provided by sequential reading is preferred to the low latency provided by random access
GFS Design Decisions
Files stored as chunks: Fixed size (64MB)
Reliability through replication: Each chunk is replicated across 3+ "chunkservers"
Single master to coordinate access, keep metadata: Simple centralized management
No data caching: Little benefit due to large datasets, streaming reads
Simplify the API: Push some of the issues onto the client (e.g., data layout)
From GFS to HDFS
As mentioned earlier, HDFS is based on GFS. However, there are a few key differences between the two:
Terminology differences
GFS Master = Hadoop NameNode
GFS chunkservers = Hadoop DataNodes
Functional differences
No file appends in HDFS
HDFS performance is (likely) slower
The next section introduces the Data Types in Hadoop.
Last updated