CS-GY 9223-D: Programming for Big Data
main
main
  • Introduction
  • Big Data
  • Hadoop
    • An Introduction to Hadoop
    • The Main Components of Hadoop
    • Some Hadoop Related Projects
    • A Typical Large Data Problem
    • The Google File System vs HDFS
    • Data Types in Hadoop
    • Programming in Hadoop
    • Common Examples of MapReduce Jobs
    • Advantages and Disadvantages of MapReduce
  • Pig
    • An Introduction to Pig
    • Components of Pig
    • An Example Data Analysis Task Using Pig
Powered by GitBook
On this page

Was this helpful?

  1. Pig

An Introduction to Pig

PreviousPigNextComponents of Pig

Last updated 4 years ago

Was this helpful?

Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin.

Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high-level. It is similar to SQL.

Pig runs as an execution engine on top of Hadoop. It enables us to write complex data transformations without knowing Java. It simply renders Pig Latin into Java MapReduce programs and runs them on the Hadoop cluster.

Pig removes the need for users to tune Hadoop to their needs, and it insulates users from changes in Hadoop's interfaces.

Some Noteworthy Characteristics of Pig

  • Schema on Read: Languages like SQL use schema on write. This means that the schema has to be defined before writing to the database. Due to this, only structured data is allowed. On the other hand, schema on read means that the schema is defined on the fly while reading the data. This allows Pig to work with both structured and unstructured data.

  • Lazy Evaluation: This means that no value or operation is evaluated until the the value or the transformed data is required. This reduces repeated calculations and allows for optimized performance.

The next section gives an overview of the components of Pig.