This paper introduces the basic principles for Big Data analysis techniques and technology. As the examples have shown, Big Data is a very important topic in daily business. It helps companies to understand their customers and to improve business decisions. Further research in the medical an many other areas wouldn’t be possible without Big Data analysis.
Due this various requirements, many approaches are needed. This paper focused on Association Rule Mining, Cluster Analysis and Distributed Computing. In the field of Association Rule Mining, the Apriori and FP-Growth algorithms were presented. The Apriori algorithm is the most common one in this area but it has to scan the data source quite often. In addition, the complexity is comparatively high, because of the
generation of the candidate item set. In order to improve the scans of the data source, the FPGrowth algorithm was published. Instead of n+1 scans it takes only two. After a special data structure, the FP-Tree, was build the further work can easily be parallelized and executed on multiple machines.
The K-Means algorithm is the most common one for cluster analysis. Using a data source it creates k clusters based on the euclidean distance. The results depends on the initial cluster position and the number of clusters. Because the initial clusters are chosen at random, there is no unique solution for a specific problem. However the K-Means++ algorithm improves the runtime by doing a preprocessing before it proceeds as the K-Means algorithm. With this preprocessing the K-Means++ algorithm is guaranteed to find a solutions that is O(log(k)) competitive to the solution of the K-Means algorithm.
Finally the MapReduce Framework allows a parallel and distributed processing of Big Data. For this hundred or even thousands of computers are used to process large amount of data. The framework is easy to use, even without knowledge about parallel and distributed systems. All techniques in this paper are used for processing large data sets. Therefore it is usually not possible to get a response in seconds or even minutes. It usually takes multiple minutes, hours or days until the result is computed.
For most companies it is necessary to get a very fast response like in the OLAP approach. For this, Google’s BigQuery is one possible solutions. With using tools for ad-hoc analysis, there are no possibilities for a deep analysis of the data set. To provide an ad-hoc analysis with a deep analysis of the data set, what is needed are algorithms that are more efficient and more specialized for a certain domain. Because of this, technology for analyzing Big Data is also an important area in the academic environment. Here the focus is the running time and the efficiency of these algorithms. Furthermore the data volume is still rising, so another probable approach is further parallelization of the computational tasks. On the other side, the running time can also be improved by specialized algorithms like the Apriori algorithm for the analysis of related transactions.