# Clustering

Cluster analysis is the task of classification of similar objects into groups or classes. By this the classes are defined during the clustering.
An area of use could be marketing. Here clustering can discover groups in all customers to use these groups for targeted marketing.
Cluster analysis is a very important area in data mining. This chapter discusses the K-Means algorithm and a variation called the K-Means++ algorithm. Although the first version of the K-Means algorithm was published in 1964 it is still one of the most important methods of cluster analysis. It is the basis for further algorithms that often improve the accuracy or the run time.

## K-Means algorithm

Clustering is the task of dividing data into groups of similar objects. This task can be solved in various ways and there is no specific algorithm to solve this task.
There are two main methods of clustering:

1. Hierarchical:
One cluster is built at the beginning. Iteratively points are added to existing clusters or a new one is created.
2. Centroid based:
A cluster is represented by a central point which couldn't be a part of the data set.

The K-means algorithm is one of the most common techniques used for clustering. It is a kind of learning algorithm and it is centroid based. The goal of the K-Means algorithm is to find the best division of an arbitrary data set in k classes or clusters, with the feature that the distance between the members of a cluster and its corresponding centroid is minimized. To calculate the distance it is possible to use various metrics. In the standard implementation of the K-means method a partition of the data set with n entities into k sets is given to minimize the within-cluster sum of squares (WCSS).

The expression  describes the Euclidean distance between an entity and the cluster's centroid. This implementation is also known as Lloyd's algorithm.
It provides a solution that can be trapped at a local minimum and there is no guarantee that it corresponds to the best possible solution. The accuracy and running time of K-means clustering depend heavily on the position of the initial cluster center. [29]
The initial K-means implementation requires three parameters: Number of clusters k, a distance metric and the cluster initialization
There are different implementations of the K-means method. The classic K-means clustering algorithm just takes the parameter k. It uses the Euclidean distance and the cluster initialization is done with the first items of the input. The algorithm creates k clusters by assigning the data to its closest cluster mean using the distance metric. Using all data of a cluster the new mean is calculated. Based on the new means the data have to be reassigned to the means. This loop goes on until the maximum value for the iteration is reached or the new calculated means doesn't move any more. In fact it is a heuristic algorithm. The basic k-means algorithm is as follow [30]:

1. Select k entities as the initial centroids.
2. (Re)Assign all entities to their closest centroids.
3. Recompute the centroid of each newly assembled cluster.
4. Repeat step 2 and 3 until the centroids do not change or until the maximum value for the iterations is reached.

Figure 13 shows an exemplary result for a K-means execution

The purple x represents a centroid of a cluster.

## K-Means++ algorithm

The K-means converges to an optimum without guarantee that it is the global one. There may often be a large number of local optima depending on the total number of entities n, the number of clusters k and the original layout of the entities.
Because of this the choice of the starting values for the algorithm is critical, since different parameters can lead to different optima [29][31] . According to Aloise et al. [17], the K-means clustering problem is NP-hard in the Euclidean space even for k = 2 and it has a worst-case runtime of .

The K-means algorithm is the starting point for many further algorithms. Because the basic decision version of the K-means algorithm is NP-hard there are additional implementations that are slightly modified  [18,32]. A way to compute NP-hard problems is to compute an optimized solution.
Based on K-means Arthur and Vassilvitskii described the serial K-means++ method [18] as an approximation algorithm for the K-means to improve the theoretical runtime. The K-means++ method selects the first centroid at random. The subsequent centroids are selected according to the minimum probable distance that separates the centroid from the others. Let  denote the shortest distance from an entity x of the data source X to the closest center which has already been chosen. Then the K-means++ is as follows:

1. Take one center $c_1$, chosen uniformly at random from X
2. Take a new center , choosing  with probability 
3. Repeat step 2, until k centers have been taken
4. Proceed as with the standard K-means algorithm

The step 2 is also called " weighting".
Although the K-means++ method needs more time for the steps 1-3, the step 4 converges much faster than the K-means method without the initial selection. The main idea is to choose the centers in a controlled
fashion, where the current set of chosen centers will stochastically influence the choice of the next center [33].
According to Arthur et al. [18] the K-means++ is  competitive  to the optimal solutions where k is the number of clusters.