Introduction to Unsupervised Learning & Clustering

Introduction

Unsupervised learning, as the name suggests, is about learning without supervision. What does it mean to learn without supervision? What can you learn without supervision? As we saw in the regression, the supervision comes from a part of the input data set - that is required to ascertain the correctness of your hypothesis. The input data has the questions as well as correct answers - the machine just needs to map them.
Unsupervised learning is quite different from this. Here, we do not have any answers. No questions either! Then what do we have? We have raw data and we have no idea about its structure. There is nothing we can learn from it - "from" implies supervision. Unsupervised learning is just making sense of the data in hand. This is not just a theoretical fantacy. Unsupervised learning has a great application for analyzing data that we know nothing about.
When faced with such a situation, of huge chunk of unknown data, that natural tendency is to categorize it into different parts based on the parameters we already know. Then, we can identify the tendencies of each of these clusters so that we can get some meaningful mapping of the data. If your parameters are wide enough, your predictions will be correct too.
Consider for example, you land up in another country and you dont know anyone out there. How would you go about it? Assume you are lucky and they all know English, you would first look at the people around you. Looking at their eyes and expression on the face, you would try to guess who is the right person to talk to. Then you would talk to people and based on that, divide them into groups of similar people. Then, based on some more interaction with some of them in each group, you would try tomake more specific guess about their details. All this has to be done without any supervision. Because you really do not have anyone to do it.
Clustering is one of the most important problem solved using unsupervised learning. It is used to discover the inherent groupings in the input data. For example: Grouping companies based on their revenues or grouping countries based on their GDP, etc. Since it is a form of unsupervised learning, there is no feedback or verification while learning. The grouping is purely based on the values associated with the input samples.
In theory, data points that are in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features. Such clustering is a common technique for statistical data analysis used in many fields. Below is an introduction to some clustering algorithms that are commonly used in data analysis and machine learning.

Popular clustering algorithms

The concept of clustering is quite old. The algorithms have evolved over time, to match the complexity of the underlying data.

K-Means

This is the most basic clustering algorithm. Here, you start by choosing the number of clusters (K) that you would like to create. Now, based on this count, randomly initialise some centre points for the K different clusters. Next, assign groups to each point in the input data set based on their distance from the K centre points. Assign a point to the cluster such that its distance from the corresponding centre is lower than others. Now that you have identified the clusters, compute the centre points of these clusters. Normally these would be different from the ones you started. If you do this iteratively, over time, the K centres move along to the real centres of the K clusters and there is very little change on consecutive iterations.
K-Means is pretty fast. Essentially we just have to compute the distances between points and group centres. This is very little compared to some other algorithms. It thus has a linear complexity O(n). One major disadvantage of K-Means clustering is that we decide on an adhoc number of groups/classes. Ideally we would like the algorithm to identify this. Also a random choice of the group centres may not always work well. It may yield different clustering results on different runs of the algorithm. Some other algorithms are more consistent.
K-Medians clustering is quite similar to K-Means. Here, instead of updating the group centres using the mean we use the median vector of the group. This has an advantage that outcome of this method is not disrupted by outliers. But is much slower for larger datasets as sorting is required on each iteration when computing the Median vector.

Affinity Propagation

This is one of the most important among the different clustering algorithms. It is also more elaborate than the others. Affinity Propagation, takes the similarity between pairs of data points as input. It considers all data points as potential exemplars. It works by exchanging real valued messages between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. This is not as complicated as it sounds. Let's look at it in detail. The algorithm requires us to provide two sets of data:
  • Similarities between data points, representing how well-suited a point is to be another one’s exemplar. If there’s no similarity between two points, as in they cannot belong to the same cluster, this similarity can be omitted or set to -Infinity depending on implementation.
  • Preferences, representing each data point’s suitability to be an exemplar. We may have some initial information which points could be favored for this role, and so we can represent it through preferences.
All this information is normally represented through a single matrix. The main diagonal of this matrix represents preferences. Another approach is to keep the similarities between points in the memory. (The latter works better when the data is sparse). The 'exchanging messages between points' is nothing more than matrices manipulation. The algorithm then runs through a number of iterations, until it converges.
Each iteration has two steps:
  • Calculating responsibilities
  • Calculating availabilities.
Responsibility and Availability are two fundamental concepts of Affinity Propagation. To understand them, consider the scenario of a group of employees working with an employer. From the team to work, you have to ensure that the employees find the employer better than other employers around. Additionally you have to ensure that the employees get along well. The first is measured in terms of Responsibility while the second is measured in terms of Availability.
In mathematical terms, Responsibility - r(i, k) reflects the accumulated evidence of k's potential for being an exemplar of i - considering the various different possible exemplars for i. On the other hand, Availability - a(i, k) is the accumulated evidence for k's potential of being an exemplar of i - considering the various different points that consider k as an exemplar. Essentially, the "Responsibility" measures k's responsibility to take up i under its hood. On the other hand, "Availability" measures k's availability for the new point i. While working on the classification, the Responsibility message is passed from i to k and Availability message is passed from k to i. That is, i tells k if it thinks k is responsible for i and k replies if it feels that it is available for i.
In order to calculate responsibilities, the algorithm uses original similarities and availabilities calculated in the previous iteration (initially, all availabilities are set to zero). Responsibilities are set to the input similarity between point i and point k as its exemplar, minus the largest of the similarity and availability sum between point i and other candidate exemplars. The logic behind calculating how suitable a point is for an exemplar is that it is favored more if the initial a priori preference was higher, but the responsibility gets lower when there is a similar point that considers itself a good candidate, so there is a ‘competition’ between the two until one is decided in some iteration.
Calculating availabilities, then, uses calculated responsibilities as evidence whether each candidate would make a good exemplar. Availability a(i, k) is set to the self-responsibility r(k, k) plus the sum of the positive responsibilities that candidate exemplar k receives from other points.
Finally, we can have different stopping criteria to terminate the procedure, such as when changes in values fall below some threshold, or the maximum number of iterations is reached. At any point through Affinity Propagation procedure, summing Responsibility (r) and Availability (a) matrices gives us the clustering information we need: for point i, the k with maximum r(i, k) + a(i, k) represents point i’s exemplar. Or, if we just need the set of exemplars, we can scan the main diagonal. If r(i, i) + a(i, i) > 0, point i is an exemplar.

Mean Shift

Mean shift clustering uses sliding-window to find dense areas in the data points. It is a centroid-based algorithm. The goal of the algorithm is to locate the centre points of each group/class, which works by updating candidates for centre points to be the mean of the points within the sliding-window. These candidate windows are then filtered in a post-processing stage to eliminate near-duplicates, forming the final set of centre points and their corresponding groups. The assumption is that the population is dense at the centre of each cluster.
Mean-shift algorithm starts by selecting a random point in the data space. Then, check the data in a fixed radius around the point, to identify the direction where the density increases most. The point continues to shift in the data space in order to move in the direction where the density increases most. This continues till you reach a point where the density decreases on every side. This is the centre of the group. This process is done with many sliding windows until all points lie within a window. When multiple windows overlap the merge and the maximum is preserved. The data points are then clustered according to the sliding window in which they reside.
This is different from the K-means algorithm because we do not select the number of clusters. The mean shift algorithm identifies this for us. That adds a lot of value. Ofcourse it leaves the room for configuration (hence room for doubt) because we have to select the radius of the sliding window. The radius in turn impacts the count of clusters identified.

Ward's Method

In statistics, Ward's method is a criterion applied in hierarchical cluster analysis. Ward's minimum variance method is a special case of the objective function approach originally presented by Joe H. Ward, Jr. Ward suggested a general agglomerative hierarchical clustering procedure, where the criterion for choosing the pair of clusters to merge at each step is based on the optimal value of an objective function. This objective function could be "any function that reflects the investigator's purpose." Many of the standard clustering procedures are contained in this very general class. To illustrate the procedure, Ward used the example where the objective function is the error sum of squares, and this example is known as Ward's method or more precisely Ward's minimum variance method.
The nearest-neighbor chain algorithm can be used to find the same clustering defined by Ward's method, in time proportional to the size of the input distance matrix and space linear in the number of points being clustered.

Agglomerate Hierarchical Clustering

There are two types of hierarchical clustering algorithms. Top-down or bottom-up. Bottom-up algorithms start with each data point as an independent single cluster and then progressively merge (or agglomerate) cluster pairs until all clusters have been merged into a single cluster that contains all data points. Hence it is also called agglomerative clustering.
If there are X data points in our dataset then we start with X clusters. We then select a distance metric that measures the distance between two clusters. We can use various different measures of this distance. To make things simple, we can assume the average linkage - the average distance between data points in the first cluster and data points in the second cluster. On each iteration we combine two clusters (with minimum distance) into one. This is repeated until we have only one cluster which contains all data points. Thus, we have a cluster tree based on the allowed minimum distance between two clusters.
We can select the clusters simply by choosing when to stop. Again, we do not require a defined number of clusters and we can even select which number of clusters looks best since we are building a tree. The algorithm seems to be dictated by the choice of distance metric. But, observations say otherwise. Any sensible distance measure usually gives similar results. This is a big advantage over other algorithms where choice of the distance metric is very important. Hierarchical clustering helps us identify sub-categories. But hierarchical clustering comes at a very high processing cost. It has a time complexity of O(n³), unlike the linear complexity of K-Means and GMM.

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) adds a couple of important advantages over the mean-shift algorithm. Similar to mean-shift, it starts from an arbitrary new point in the data space and checks the points in the neighborhood defined by epsilon ε (similar to window defined by r in mean-shift). It counts the number of points in this neighborhood and proceeds ahead only if the number of points is more than a threshold identified by minPoints. Else that point is be labeled as noise (later this noisy point might become the part of the cluster). In both cases that point is marked as “visited”.
If it is not noise, this first point defines a new cluster. Also the points within its ε distance neighborhood also become part of the same cluster. This procedure of expanding the cluster to all points in the ε neighborhood is then repeated - until all points in the cluster are determined i.e all points within the ε neighborhood of the cluster have been visited and labelled. After that, a new unvisited point is retrieved and processed - discovering another cluster or noise. This process repeats until all points are marked as visited.
Like the Mean-Shift, DBSCAN does not start with a defined number of clusters. It also identifies outliers as noises unlike mean-shift which goes out of the way to pull them into a cluster. It is able to find clusters of arbitrary shapes and sizes. DBSCAN drops the need for configured cluster count. But it requires two new configurations - ε and minPoints. This can get tricky if these vary from cluster to cluster when the density varies. This problem is amplified with very high-dimensional data where it is much more difficult to estimate ε.

Birch

BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources (memory and time constraints). In most cases, BIRCH only requires a single scan of the database.
Its inventors claim BIRCH to be the "first clustering algorithm proposed in the database area to handle 'noise' (data points that are not part of the underlying pattern) effectively", beating DBSCAN by two months.
It is local in that each clustering decision is made without scanning all data points and currently existing clusters. It exploits the observation that data space is not usually uniformly occupied and not every data point is equally important. It makes full use of available memory to derive the finest possible sub-clusters while minimising I/O costs. It is also an incremental method that does not require the whole data set in advance.

EM Clustering with GMM

Expectation-Maximization (EM) Clustering using Gaussian Mixture Models. One of the major drawbacks of K-Means is that the concept of identifying mean value as the cluster is not rich enough to identify the cluster. For example, if data consists of two parallel lines, the K-Means algorithm will get confused in identifying the clusters simply because it limits itself to mean distance from the center. Essentially, K-Means will fail in scenarios where the clusters do not fit into distinct circles.
Gaussian Mixture Models (GMMs) is much more flexible than K-Means. GMMs assumes that the data points are Gaussian distributed. This is a less restrictive than assuming that they are circular. GMM offers two parameters to describe the shape of the clusters: the mean and the standard deviation. Taking an example in two dimensions, this means that the clusters can take any kind of elliptical shape (since we have standard deviation in both the x and y directions). Thus, each Gaussian distribution is assigned to a single cluster.
This is how the algorithm works:
Start with choosing the number of clusters (similar to K-Means) and randomly initializing the Gaussian distribution parameters for each cluster. One can try to provide a good guesstimate for the initial parameters by taking a quick look at the data. But that is not really necessary as the Gaussians get optimized rather fast.
With these clusters, identify the probability of each data point belonging to a particular cluster. The closer a point is to the Gaussian's center, the more likely it belongs to that cluster. This should make intuitive sense since with a Gaussian distribution we are assuming that most of the data lies closer to the center of the cluster. Use these probabilities, to compute a new set of parameters for the Gaussian distributions such that we maximize the probabilities of data points within the clusters. We compute these new parameters using a weighted sum of the data point positions, where the weights are the probabilities of the data point belonging in that particular cluster. Repeat this iteratively until it converges - where the distributions don’t change much from iteration to iteration.
GMMs are a lot more flexible in terms of cluster covariance than K-Means - due to the standard deviation parameter. The clusters can take on any ellipse shape, rather than being restricted to circles. K-Means is actually a special case of GMM in which each cluster’s covariance along all dimensions approaches 0.
Since GMMs use probabilities, we can have overlapping clusters. If a data point is in the middle of two overlapping clusters, we can define its membership in percentage.

Reference

For detailed understanding of the subject, you can check out this video on YouTube.

Comments