Common Examples of MapReduce Jobs
WordCount
WordCount is the most common example of MapReduce.
Goal: Count the number of occurrences of each word in a file
Input to Mappers: Chunks of the input file
Mapper Output: (word, 1) pairs
Input to Reducers: (word, <1, 1, 1, ..., 1>) pairs
Operation at Reducer: Sum up the values for each word to get the count and output (word, count) pairs
Output of MapReduce Job: File containing (word, count) pairs
Filtering
Goal: Filter out records that are of no interest
Input to Mappers: Chunks of the input file
Mapper Output: Filtered records
Input to Reducer: Filtered records grouped by key
Output of MapReduce Job: Filtered records
We perform filtering at the mappers itself because the sort/shuffle phase of MapReduce is I/O heavy, and we want to reduce the dataset as much as possible in the map phase itself.
Basically, instead of emitting all (key, value) pairs, the mappers will emit only those pairs that meet a certain condition:
Top k
Goal: Select the top k records given any number of input records
Input to Mappers: Chunks of the input file (several records)
Mapper Output: Local top k records (each mapper will output top k records depending on its input records)
Input to Reducer: Top k records based on input records from mappers
Output of MapReduce Job: Top k records from the entire dataset
So, instead of analysing all n records at once, the reducer only has to process k*m records (where m is the number of mappers)
Binning
Goal: Move records into bins/categories, irrespective of the order of records
But partitioners do the exact same task! Why use binning? We use binning because the task of partitioners is I/O intensive.
In binning, the mappers receive chunks of the input file. The mappers itself categorize records into bins. There is no need for a reducer. We simply concatenate the intermediate results of the mappers to get the records in each bin.
Other Applications
MapReduce can also be used for Breadth First Search (BFS) and PageRank (Google's algorithm to rank websites in their search engine results). I have not covered their implementations in detail.
The next section lists the advantages and disadvantages of MapReduce programming.
Last updated