Distributed Computing

Chapter II discussed the research at CERN. Because of the huge amount of data, CERN invented the Worldwide LHC Computing Grid (WLCG) to perform the analysis that large
data set. Google’s MapReduce framework can be used for generating and processing large data sets. It allows the development of scalable parallel applications to process Big Data using a cluster or a grid [21]. For this the data can be stored in a structured (database) or unstructured way (filesystem). A typical MapReduce computation processes many terabytes of data on thousands of machines [34].

Programming model

The programmer expresses the computation of the two functions map and reduce. The MapReduce framework contains another function that is called shuffle. This function has no user code and it is a generic step which is executed after the users map function. The map and reduce functions are written by the user. The function map takes an input pair and produces a set of intermediate key/value pairs. The input is used by the MapReduce framework to group the values by the intermediate keys. This is done by the function shuffle. Those aggregated values are passed to the reduce function. Because of the arbitrary user code it outputs data which is specified by the user. Usually it uses the given values to form a possibly smaller set of values. Typically there is just one result at the end. [21]

Execution overview

Figure 14 shows the complete flow if user code calls the MapReduce framework.
mapreduce

When the user calls the MapReduce function in the user code,
the following sequence of action occurs [34], [35]:

  1. The MapReduce framework splits the specified set of input files I into m pieces . Then it starts up copies of the users program on various cluster of machines and an additional copy called Master.
  2. All copies excluding the Master are workers that are assigned by the master. The number of workers in the Map phase and Reduce phase don’t have to be equal, so assume there are M map tasks and R reduce task. The Master picks idle workers and assigns Map or Reducer tasks to them.
  3. A worker with a Map task reads content from the corresponding input file The intermediate key/value pairs produced by the map function are buffered in memory.
  4. Periodically, the buffered pairs are written to local disk, partitioned into R regions. The location for each Map worker storage location is passed back to the Master. From here the Master is responsible for forwarding these locations to the reduce workers.
  5. A Reduce worker uses remote procedure calls to read the buffered data from the local disks of the worker. After the Reduce worker has read all intermediate data for its partition, it sorts it by the intermediate key so that all occurrences of the same key are grouped together.
  6. The Reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function.
    The result of the Reduce worker is an output file which is written to a arbitrary destination.
  7. After all Map and Reducer tasks haven been completed, the Master notifies the user program. With this the MapReduce call from the user code returns

Example

Consider the problem of counting words in a single text or on a website. Concerning this problem, the map function is very simple.

procedure MAP(key; value)
for each word in value.split() do
output.collect(word,1);
end for
end procedure

The map function processes one line at a time. It splits the line and emits a key-value pair of <<word>, 1>. Using the line ”Hello World Bye World” the following output is produced:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
The Reducer implementation just sums up the values:

procedure REDUCE(key; values)
while values:hasNext() do
sum   sum + value:get() . Sum up the values
end while
return(key; sum)
end procedure

The output of the job is:
<Hello, 1>
<World, 2>
<Bye, 1>
MapReduce is designed as a batch processing framework and because of this it is not suitable for ad-hoc data analysis. The time to process this kind of analysis is to large and it doesn’t allow programmers to perform iterative or on-shot analysis on Big Data [22]. The  project Apache Hadoop provides an open source implementation of the MapReduce algorithm [36]. Today this implementation is used by many companies like Fujitsu Limited [13] or Amazon.com Inc. [37].

2 Kommentare zu Distributed Computing

  1. Paul sagt:

    Hi,

    I also read some articles about the BigQuery suite that is provided by google. Do you also have some information about this?

    Thank you

  2. Hi Paul,

    thank You for Your question. Actually Google's BigQuery is more like "OLAP for Big Data". Here the user can not execute complex mining operation, it is "just" a query language for very large data sets. There is a small section about this in the chapter "Related Work".

    Regards

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

Time limit is exhausted. Please reload the CAPTCHA.