An Example Data Analysis Task Using Pig
The goal is to find the top 10 most-visited pages in each category.
Given data:
The Data Flow Graph for the task is shown below:
Pig Latin code for the task:
visits = load ‘/data/visits’ as (user, url, time);
gVisits = group visits by url;
visitCounts = foreach gVisits generate url, count(visits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;
topUrls = foreach gCategories generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
Last updated
Was this helpful?