The Google File System vs HDFS
Last updated
Was this helpful?
Last updated
Was this helpful?
Google's on their file system describes GFS as a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
GFS formed the basis for the development of HDFS, which is more or less an open-source implementation of GFS.
Scale OUT, not UP: prefer using commodity hardware to exotic hardware
High component failure rate: it is assumed that components of commodity hardware fail all the time
Modest number of huge files: it is assumed that files will be multiple-gigabytes in size
Files are write-once, mostly appended to
Large streaming reads over random access: sequential reading is preferred to random access since the high sustained throughput provided by sequential reading is preferred to the low latency provided by random access
Files stored as chunks: Fixed size (64MB)
Reliability through replication: Each chunk is replicated across 3+ "chunkservers"
Single master to coordinate access, keep metadata: Simple centralized management
No data caching: Little benefit due to large datasets, streaming reads
Simplify the API: Push some of the issues onto the client (e.g., data layout)
As mentioned earlier, HDFS is based on GFS. However, there are a few key differences between the two:
GFS Master = Hadoop NameNode
GFS chunkservers = Hadoop DataNodes
No file appends in HDFS
HDFS performance is (likely) slower
The next section introduces the Data Types in Hadoop.