Components of Pig
Last updated
Last updated
There are 3 main components of Pig:
Pig Latin
Execution
Compiler
Pig Latin is Pig's SQL-like high-level data flow language.
A Pig Latin program can be viewed as a directed acyclic graph where each node represents an operation.
The following table describes the data types used in Pig Latin:
Pig Latin statements work with relations. A relation can be defined as follows:
A relation is a bag (more specifically, an outer bag)
A bag is a collection of tuples
A tuple is an ordered set of fields
A field is a piece of data
Quick Notes on Pig Latin Data Types:
A single tuple can hold multiple types of data
We can nest bags inside tuples, tuples inside bags, tuples inside tuples and bags inside bags
In a map, keys and values can be of any data type
Arithmetic: +, -, *, <, %, etc..., FLATTEN
Relational: LOAD, GROUP, FOREACH, JOIN, …
Diagnostic: DESCRIBE, DUMP, EXPLAIN, ILLUSTRATE
Eval: AVG, TOP, CONCAT, COUNT, …
Load/Store: TextLoader, PigStorage, …
System: cat, cd, ls, exec, …
UDF: User Defined Functions
Note: The largest use case of Pig is data pipelines. A common example is web companies bringing in logs from their web servers, cleansing the data, and precomputing common aggregates before loading it into their data warehouse.
The following are some commonly used operations in Pig Latin:
Loads data from the file system.
Syntax: LOAD 'data' [USING function] [AS schema];
Generates data transformations based on columns of data.
Syntax: alias = FOREACH { block | nested_block };
Selects tuples from a relation based on some condition.
Syntax: alias = FILTER alias BY expression;
Groups the data in one or more relations.
Syntax: alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected' | 'merge'] [PARTITION BY partitioner] [PARALLEL n];
Note: The GROUP and COGROUP operators are identical. Both operators work with one or more relations. For readability, GROUP is used in statements involving one relation and COGROUP is used in statements involving two or more relations. We can COGROUP up to but no more than 127 relations at a time.
Limits the number of output tuples.
Syntax: alias = LIMIT alias n;
Sorts a relation based on one or more fields.
Syntax: alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] } [PARALLEL n];
where * denotes a tuple and filed_alias is a field in the relation "alias".
Dumps or displays results to screen.
Syntax: DUMP alias;
Stores or saves results to the file system.
Syntax: STORE alias INTO 'directory' [USING function];
The FLATTEN operator looks like a UDF syntactically, but it is actually an operator that changes the structure of tuples and bags in a way that a UDF cannot. Flatten un-nests tuples as well as bags. The idea is the same, but the operation and result is different for each type of structure:
For tuples, FLATTEN substitutes the fields of a tuple in place of the tuple. For example, consider a relation that has a tuple of the form (a, (b, c)). The expression GENERATE $0, flatten($1), will cause that tuple to become (a, b, c).
For bags, the situation becomes more complicated. When we un-nest a bag, we create new tuples. If we have a relation that is made up of tuples of the form ({(b,c),(d,e)}) and we apply GENERATE flatten($0), we end up with two tuples (b,c) and (d,e). When we remove a level of nesting in a bag, sometimes we cause a cross product to happen. For example, consider a relation that has a tuple of the form (a, {(b,c), (d,e)}), commonly produced by the GROUP operator. If we apply the expression GENERATE $0, flatten($1) to this tuple, we will create new tuples: (a, b, c) and (a, d, e).
We can run Pig in various modes:
Interactive Mode
Batch Mode
Programmatically
But there are basically two execution modes:
Local Mode: To run Pig in Local mode, we need access to a single machine; all files are installed and run using our local host and file system. We must specify local mode using the -x flag (i.e. pig -x local)
MapReduce Mode: To run Pig in MapReduce mode, we need access to a Hadoop cluster and HDFS installation. MapReduce mode is the default mode; we can, but don't need to, specify it using the -x flag (i.e. pig OR pig -x mapreduce)
All the three modes listed above support both Local and MapReduce execution modes:
We can run Pig in Interactive mode using the Grunt shell.
We can invoke the Grunt shell using the "pig" command (as shown below) and then enter our Pig Latin statements and Pig commands interactively at the command line.
We can also use "pig -x local" OR "pig -x mapreduce" to invoke the Grunt shell in the respective modes.
We can run Pig in Batch mode using Pig scripts (.pig files) and the "pig" command (in Local or MapReduce mode).
Pig scripts are basically Pig Latin statements and Pig commands in a single .pig file.
Note: Comments in Pig scripts are written using -- (for single line comments) and /* ... */ (for multi-line comments)
We can run pig scripts using:
OR
OR
The third method is equivalent to the first one, since MapReduce mode is default.
We can also run Pig programmatically using Java classes PigRunner (Grunt), PigServer (MapReduce).
Note: Pig allows us to define our own functions (User Defined Functions or UDFs) and use them in our Pig scripts. This is referred to as the Embedded Mode.
As discussed earlier, Pig will compile Pig Latin into Java for the MapReduce job.