CS-GY 9223-D: Programming for Big Data
main
main
  • Introduction
  • Big Data
  • Hadoop
    • An Introduction to Hadoop
    • The Main Components of Hadoop
    • Some Hadoop Related Projects
    • A Typical Large Data Problem
    • The Google File System vs HDFS
    • Data Types in Hadoop
    • Programming in Hadoop
    • Common Examples of MapReduce Jobs
    • Advantages and Disadvantages of MapReduce
  • Pig
    • An Introduction to Pig
    • Components of Pig
    • An Example Data Analysis Task Using Pig
Powered by GitBook
On this page
  • GFS Assumptions
  • GFS Design Decisions
  • From GFS to HDFS
  • Terminology differences
  • Functional differences

Was this helpful?

  1. Hadoop

The Google File System vs HDFS

PreviousA Typical Large Data ProblemNextData Types in Hadoop

Last updated 4 years ago

Was this helpful?

Google's on their file system describes GFS as a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

GFS formed the basis for the development of HDFS, which is more or less an open-source implementation of GFS.

GFS Assumptions

  • Scale OUT, not UP: prefer using commodity hardware to exotic hardware

  • High component failure rate: it is assumed that components of commodity hardware fail all the time

  • Modest number of huge files: it is assumed that files will be multiple-gigabytes in size

  • Files are write-once, mostly appended to

  • Large streaming reads over random access: sequential reading is preferred to random access since the high sustained throughput provided by sequential reading is preferred to the low latency provided by random access

GFS Design Decisions

  • Files stored as chunks: Fixed size (64MB)

  • Reliability through replication: Each chunk is replicated across 3+ "chunkservers"

  • Single master to coordinate access, keep metadata: Simple centralized management

  • No data caching: Little benefit due to large datasets, streaming reads

  • Simplify the API: Push some of the issues onto the client (e.g., data layout)

From GFS to HDFS

As mentioned earlier, HDFS is based on GFS. However, there are a few key differences between the two:

Terminology differences

  • GFS Master = Hadoop NameNode

  • GFS chunkservers = Hadoop DataNodes

Functional differences

  • No file appends in HDFS

  • HDFS performance is (likely) slower

The next section introduces the Data Types in Hadoop.

paper