An Example Data Analysis Task Using Pig

The goal is to find the top 10 most-visited pages in each category.

Given data:

Pig Latin code for the task:

visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts  = foreach gVisits generate url, count(visits);

urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;

gCategories = group visitCounts by category;
topUrls = foreach gCategories generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;

Last updated