Clustering


Identifying Clusters


Clustering is one of the most important problem solved using unsupervised learning. It is used to discover the inherent groupings in the input data. For example: Grouping companies based on their revenues or grouping countries based on their GDP, etc. Since it is a form of unsupervised learning, there is no feedback or verification while learning. The grouping is purely based on the values associated with the input samples.
In theory, data points that are in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features. Such clustering is a common technique for statistical data analysis used in many fields.

Clustering Algorithms

Clustering algorithms are generally grouped into the following three categories:

Hierarchical methods:

In agglomerative hierarchical algorithms, we start by defining each data point as a cluster. Then, the two closest clusters are combined into a new cluster. In each subsequent step, two existing clusters are merged into a single cluster. In divisive hierarchical algorithms, we start by putting all data points into a single cluster. Then we divide this cluster into two clusters. At each subsequent step, we divide an existing cluster into two clusters. Agglomerative methods are used much more often than divisive methods. Hierarchical methods can be adapted to cluster variables rather than observations. This is a common use for hierarchical methods.

Non-hierarchical methods:

In a non-hierarchical method, the data are initially partitioned into a set of K clusters. This may be a random partition or a partition based on a first "good" guess at seed points which form the initial centers of the clusters. Then data points are iteratively moved into different clusters until there is no sensible reassignment possible. The initial number of clusters (K) may be specified by the user or by the software algorithm. The most commonly used non-hierarchical method is MacQueen's K-means method.

Model based methods

A model based method uses a mixture model to specify the density function of the x-variables. In a mixture model, a population is modeled as a mixture of different subpopulations, each with the same general form for its probability density function and possibly different values for parameters, such as the mean vector. For instance, the model may be a mixture of multivariate normal distributions. In cluster analysis, the algorithm provides a partition of the dataset that maximizes the likelihood function as defined by the mixture model. Examples are Affinity Propagation, MeanShift, DBSCAN

Agglomerative Hierarchical Clustering

1. Single Linkage: In single linkage, we define the distance between two clusters as the minimum distance between any single data point in the first cluster and any single data point in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters with the smallest single linkage distance.
2. Complete Linkage: In complete linkage, we define the distance between two clusters to be the maximum distance between any single data point in the first cluster and any single data point in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters that have the smallest complete linkage distance.
3. Average Linkage: In average linkage, we define the distance between two clusters to be the average distance between data points in the first cluster and data points in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters that have the smallest average linkage distance.
4. Centroid Method: In centroid method, the distance between two clusters is the distance between the two mean vectors of the clusters. At each stage of the process we combine the two clusters that have the smallest centroid distance.
5. Ward’s Method: This method does not directly define a measure of distance between two points or clusters. It is an ANOVA based approach. One-way univariate ANOVAs are done for each variable with groups defined by the clusters at that stage of the process. At each stage, two clusters merge that provide the smallest increase in the combined error sum of squares.

Measure of Association

Any clustering is meaningful in the context of a given measure of association or a measure of distance. There are several definitions of the association or distance between two points.
1. Euclidean Distance is the most common measure of distance.
d(X1, X2) = (sum((X1(i) - X2(i))^2))^(1/2)
2. Minkowski Distance is a generalization of the Euclidean distance to power of n
d(X1, X2) = (sum((X1(i) - X2(i))^n))^(1/n)
3. Canberra Metric defines kind of an angular distance between two points.
d(X1,X2) = sum(abs(X1(i) - X2(i)) / (X1(i) + X2(i)))
4. MetricCzekanowski Coefficient also measures an angular distance, but sums over the values before taking a ration
d(X1,X2) = 1 - 2*sum(min(X1(i) - X2(i))) / sum(X1(i) + X2(i))
Or we can create our own. Any distance measure should satisfy some properties
1. Symmetry: Distance between point X1 and X2 should be same as distance between point X2 and X1.
d(X1, X2) = d(X2, X1)
2. Positivity: Distance between two distinct points X1 and X2 should be greater than 0
d(X1, X2) > 0 if X1 != X2
3. Identity: Distance of a point with itself should be 0
d(X, X) = 0
4. Triangle Inequality: The generic triangular inequality should also hold for the new measure of distance
d(X1, X2) + d(X2, X3) >= d(X1, X3)