In the MapReduce framework, for performing a parallel computation, a crucial ste
ID: 3813410 • Letter: I
Question
In the MapReduce framework, for performing a parallel computation, a crucial step involves an input that consists of a set of n key-value pairs, (k, v), for which we need to collect each subset of key-value pairs that have the same key, k, into a single file. Describe an efficient external-memory algorithm for constructing all such files. How many disk transfers does your algorithm perform? DESCRIBE THE ALGORITHM, I DO NOT NEED ANY PSEUDO-CODE OR NORMAL CODE. STATE HOW MANY DISK TRANSFERS THE ALGORITHM PERFORMS IN TERMS OF BIG-O NOTATION.
This has to do with any of these: B-trees, external memory sorting or online caching algorithms.
Explanation / Answer
The new parallel-computing architecture, sometimes called cluster computing, is organized as
follows:
· MapReduce is a style of computing that has been implemented in several systems, including Google’s internal implementation (simply called MapReduce) and the popular open-source implementation Hadoop which can be obtained, along with the HDFS file system from the Apache Foundation.
A chunk is a collection of elements, and no element is stored across two chunks. Technically, all inputs to Map tasks and outputs from Reduce tasks are of the key-value-pair form, but normally the keys of input elements are not relevant and we shall tend to ignore them. Insisting on this form for inputs and outputs is motivated by the desire to allow composition of several MapReduce processes. The Map function takes an input element as its argument and produces zero or more key-value pairs. The types of keys and values are each arbitrary.