Related Work

Big Data became a very important topic during the last year. It is still an important area for research but it also reached business. Various companies are working in the field of Big Data, even in research or for consulting: McKinsey Company [1], IBM Corporation [3], Fujitsu Limited [13], Intel Corporation [14] and many more. This chapter discusses some further developments to improve the algorithms and methods that are usually used to process Big Data. In addition, alternative solutions are presented that can be used for analyzing or processing Big Data.

A. Basic principles

There are different techniques which allow the analysis of large data sets. Not all techniques strictly require the use of Big Data, but they can apply to this area. Most algorithms for large item sets are related to the Apriori algorithm that will be discussed in Chapter IV-A2. All algorithms and methods are usually based on the same basic principles. These principles are statistics, probability theory and machine learning in terms of a system that can learn from data. An area called Data Mining is the biggest one. Data Mining can be described as a set of techniques which can be used to extract patterns from small as well as big data sets [15]. In the field of Data Mining the most important techniques are association rule learning, cluster analysis and classification.

B. Improvements

Today the performance of such algorithms becomes more and more important. Analyzing large data sets results in high computational costs. In recent years much progress has been been reached by the following directions [16]:

  • Selection of the startup parameter for the algorithms
  • Reducing the number of passes over the database
  • Sampling the database
  • Adding extra constraints on the structure of patterns
  • Parallelization

Chapter IV-B1 discusses the K-means method which is used for clustering. The classic implementation has an exponential runtime of O(2(n)) [17]. In [18], Arthur et al. published an alternative implementation which leads to a lower runtime by improving the start parameters of the algorithm. Other implementations pursue an optimization approach to reduce the runtime.
There are also many variations done on the Apriori algorithm in such a way. These variations are discussed in [19]. Many of these variations are concerned with the generation of a temporal data set, because this part has the greatest computational cost.

In [20] Yang Kai et al discussed a new algorithm called FA-DMFI that is used for discovering the maximum frequent item sets. For this, the algorithm stores association information by scanning the database only once. Then the temporarily data set is stored in an association matrix. Summarized the efficiency is achieved in two steps.

  1. The huge database is compressed in a smaller data structure which can avoid repeated and costly database scans.
  2. The maximum frequent item set is generated ultimately by means of cover relation in such a information matrix and the costly generation of large number of candidates is avoided.

With these steps the I/O time and CPU time are reduced and the generation of the candidate item set and the frequent item set can be done in less time. The Apriori algorithm, that is discussed in Chapter IV-A2, spends a lot of time generating the frequent item sets and it needs to scan the data source multiple times. Using some implementation details of the FA-DMFI algorithm could also improve the runtime of the Apriori algorithm.

Distributed computing

Often the computational tasks are very complex and the data which is needed for an individual task is serveral terabytes or even more. That is why it is not possible to compute these tasks on a single machine because there are not enough hardware resources. These are the reasons why parallelization of computational tasks become more and more important.

MapReduce is a distributed computing technology that was developed by Google. The programming model allows to implement custom mapper and reducer functions programmatically and run batch processes on hundreds or thousands of computers. Chapter IV-C discusses this method in more detail. Based on this technology Google developed a web service called Google BigQuery which can be used for analyzing large data sets. BigQuery is based on Dremel which is a query service that allows to run SQL-like queries against very large data sets and gets accurate results in seconds [21]. So the data set can be accessed and queried but BigQuery is not meant to
to execute complex mining. Usually the queries are executed with Microsoft Office or the Google Chrome browser. Figure 2 gives an example for the BigQuery workflow.

google-big-query

With this approach Google tries so offer a suitable system for OLAP (Online Analytical Processing) or BI (Business Intelligence) using Big Data. Here, most queries are quite simple and done with simple aggregations or filtering using columns. Because of this BigQuery isn’t a programming model for deep analysis of Big Data. It is a query service  with functionality to perform a quick analysis like aggregation of data and there are no possibilities to implement user code [22].

Big Data and Cloud Computing

In [23], Wangl et al present a development platform based on virtual clusters. Virtual clusters are a kind of virtual machines provided by cloud computing. The platform presented in [23] provides techniques for parallel data mining based on these virtual clusters. To measure the performance of the system, Wangl et al used a PSO (Particle Swarm Optimization) algorithm which was parallelized. The platform reduces the computing time of the swarm and improves the cluster quality. To use the advantages of a parallel architecture a corresponding architecture is required.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

Time limit is exhausted. Please reload the CAPTCHA.