CS-GY 9223-D: Programming for Big Data
main
main
  • Introduction
  • Big Data
  • Hadoop
    • An Introduction to Hadoop
    • The Main Components of Hadoop
    • Some Hadoop Related Projects
    • A Typical Large Data Problem
    • The Google File System vs HDFS
    • Data Types in Hadoop
    • Programming in Hadoop
    • Common Examples of MapReduce Jobs
    • Advantages and Disadvantages of MapReduce
  • Pig
    • An Introduction to Pig
    • Components of Pig
    • An Example Data Analysis Task Using Pig
Powered by GitBook
On this page

Was this helpful?

  1. Pig

An Example Data Analysis Task Using Pig

PreviousComponents of Pig

Last updated 4 years ago

Was this helpful?

The goal is to find the top 10 most-visited pages in each category.

Given data:

The Data Flow Graph for the task is shown below:

Pig Latin code for the task:

visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts  = foreach gVisits generate url, count(visits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;

gCategories = group visitCounts by category;
topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;