In July 2012 the Large Hadron Collider (LHC) experiments ATLAS and CMS announced they had each observed a new particle in the mass region around 126 GeV1 also known as
the Higgs boson .
The LHC is the world’s largest and most powerful particle accelerator. To find the Higgs boson, approximately 600 million times per second, particles collide within the LHC. Each collision generates particles that often decay in complex ways into even more particles. Sensors are used to record the passage of each particle through a detector as a series of electronic signals, and send the data to the CERN Data Centre for digital reconstruction. The Physicists must sift through the approximately 15 petabytes of data produced annually to determine if the collisions have thrown up any interesting physics .
The Worldwide LHC Computing Grid (WLCG) was invented in 2002 to address the issue of missing computational resources. The grid links thousands of computers in 140 centers over 45 countries. Up to now the WLCG is the largest computing grid in the world  and it runs more than one million jobs per day. At peak rates, 10 gigabytes of data may be transferred from its servers every second , .
Similar scenarios exist in the medical area, governments or in the private sector (e.g. Amazon.com, Facebook). Dealing with large data sets becomes more and more important because an accurate data base is important to face problems of the named areas. The challenge is to perform complex analysis algorithms on Big Data to generate or to find some new knowledge which the data contains. This new knowledge can be used to discover important, temporal and daily problems such as the example with the research at CERN, the analysis of the stock market or the analysis of company data which was gathered during a long period of time.
A. Just in time analysis
Often it is necessary to analyze data just in time, because it has temporal significance (e.g. stock market or sensor networks). For this it might be necessary to analyze data for a certain time period e.g. the last 30 seconds. Such requirements can be can be addressed through the use of Data Stream Management Systems (DSMS). The software tools are called Complex Event Processing (CEP) engines and the queries are written in declarative languages such as Event Processing Languages (EPL) like Esper . The syntax is similar to the SQL in databases.
s e l e c t avg ( p r i c e ) from
St o c kTi c kEv e n t . win : time (30 s e c ) .
As shown in the upper example an EPLs supports a window over data streams, which can be used to buffer events during a defined time. This technique is used for data stream analysis. The window can move or slide in time. Figure 1 shows the difference of the two variants. If a window moves as much as the window size this type of window is a tumbling window. The other type is named sliding window. This window slides in time and buffers the last x elements.
The size of the window can be set by the user with the select command. Note that smaller window sizes have less data to compare, so often they result in high false positive rates. If a larger window is used, the effect can be compensated . When using a CEP Engine, the computation is in time.
Because of this, the complexity can not be very high, so it might not be possible to detect the Higgs boson using just Data Stream Analysis, because the analysis is too complex. Therefore, further mining algorithms on the full data set are required. Taking a look on the Twitter service, it might be possible to use CEP to analyze the stream of tweets to detect emotional states or something similar. This example shows some additional aspects for Big Data analysis in respect to fault tolerance. Consider the following tweet:
After a whole 5 hours away from work, I get to go back again, I’m so lucky!
This tweet contains sarcasm and it is a complex tasks to detect this. A way to solve this problem is to collect training data to apply this data to an appropriate learning algorithm .
B. Rule mining and clustering
A more advanced technique is called Association rule mining. Association rule mining searches for relationships between items in data sets, but it can also be implemented for analyzing streams. For this, there are association rules which are used by algorithms to find associations, correlations or causal structures.Chapter IV-A2 and IV-A3 discuss this technique in more detail.
Cluster analysis is the tasks of classification of similar objects into groups or classes. The classes are defined during the clustering and it is not a mapping to already pre-defined classes. Today, cluster analysis techniques are used in different areas, such as data compression, process monitoring, analysis of the DNA, finance, radar scanning and further research areas. In all areas a huge data is stored. A clustering algorithm can be hierarchical or partitional .