Affinity Propagation with ScikitLearn


What is Affinity Propagation?


This is one of the most important among the different clustering algorithms. It is also more elaborate than the others. Affinity Propagation, takes the similarity between pairs of data points as input. It considers all data points as potential exemplars. It works by exchanging real valued messages between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. This is not as complicated as it sounds. Let's look at it in detail. The algorithm requires us to provide two sets of data:
  • Similarities between data points, representing how well-suited a point is to be another one’s exemplar. If there’s no similarity between two points, as in they cannot belong to the same cluster, this similarity can be omitted or set to -Infinity depending on implementation.
  • Preferences, representing each data point’s suitability to be an exemplar. We may have some initial information which points could be favored for this role, and so we can represent it through preferences.
All this information is normally represented through a single matrix. The main diagonal of this matrix represents preferences. Another approach is to keep the similarities between points in the memory. (The latter works better when the data is sparse). The 'exchanging messages between points' is nothing more than matrices manipulation. The algorithm then runs through a number of iterations, until it converges.
Each iteration has two steps:
  • Calculating responsibilities
  • Calculating availabilities.
Responsibility and Availability are two basic components of Affinity Propagation. To understand them, consider the scenario of a group of employees working with an employer. From the team to work, you have to ensure that the employees find the employer better than other employers around. Additionally you have to ensure that the employees get along well. The first is measured in terms of Responsibility while the second is measured in terms of Availability.
In mathematical terms, the Responsibility r(i, k) reflects the accumulated evidence of k's potential for being an exemplar of i - considering the various different possible exemplars for i. On the other hand, Availability - a(i, k) is the accumulated evidence for k's potential of being an exemplar of i - considering the various different points that consider k as an exemplar. Essentially, the "Responsibility" measures k's responsibility to take up i under its hood. On the other hand, "Availability" measures k's availability for the new point i. While working on the classification, the Responsibility message is passed from i to k and Availability message is passed from k to i. That is, i tells k if it thinks k is responsible for i and k replies if it feels that it is available for i.
In order to calculate responsibilities, the algorithm uses original similarities and availabilities calculated in the previous iteration (initially, all availabilities are set to zero). Responsibilities are set to the input similarity between point i and point k as its exemplar, minus the largest of the similarity and availability sum between point i and other candidate exemplars. The logic behind calculating how suitable a point is for an exemplar is that it is favored more if the initial apriori preference was higher, but the responsibility gets lower when there is a similar point that considers itself a good candidate, so there is a ‘competition’ between the two until one is decided in some iteration.
Calculating availabilities, then, uses calculated responsibilities as evidence whether each candidate would make a good exemplar. Availability a(i, k) is set to the self-responsibility r(k, k) plus the sum of the positive responsibilities that candidate exemplar k receives from other points.
Finally, we can have different stopping criteria to terminate the procedure, such as when changes in values fall below some threshold, or the maximum number of iterations is reached. At any point through Affinity Propagation procedure, summing Responsibility (r) and Availability (a) matrices gives us the clustering information we need: for point i, the k with maximum r(i, k) + a(i, k) represents point i’s exemplar. Or, if we just need the set of exemplars, we can scan the main diagonal. If r(i, i) + a(i, i) > 0, point i is an exemplar.

Python Implementation

Let us now check out how this could be implemented in Python using ScikitLearn. To generate the training data, ScikitLearn provides us a method make_blobs
We can start by importing the required modules
In [1]: from sklearn.datasets.samples_generator import make_blobs
In [2]: from sklearn.cluster import AffinityPropagation
In [3]: from sklearn.model_selection import train_test_split
Now, we can generate the required data. Note that since this is a clustering application. We do not really need the target output. But, we generate the same just to make sure our model generates good clusters.
In [4]: X, y = make_blobs(n_samples=500, centers=5, n_features=5, random_state=0)
We can have a look at the data we got
In [5]: X.shape
Out[5]: (500, 5)
In [6]: y.shape
Out[6]: (500,)

In [7]: X[:2,]
Out[7]: array([[ 0.7927672 ,  4.69609538,  2.29690233,  1.00159957, -2.3564752 ],
       [ 5.60928139, -0.0376682 , -0.68774591,  9.60509046, -8.52337837]])

In [8]: y[:2,]
Out[8]: array([0, 2])
Now, create an instance of the AffinityPropagation and try to fit the data
In [9]: ap = AffinityPropagation()
In [10]: ap.fit(X)
Out[10]: AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
          damping=0.5, max_iter=200, preference=None, verbose=False)
Now we need to check how good is this clustering. The values in y were the clusters assigned by the make_blogs. They may not match exactly with the clusters generated by the Affinity Propagation. For example, cluster 1 in the original data could be cluster 2 here. That is not important. But it is important that the clustering is similar. That is, if two elements were in the same cluster in the first set, they should be in the same group in the second set as well. To verify this, we can write a small script:
y1 = ap.predict(X)
mismatch = 0
for i in range(y1.shape[0]):
    for j in range(y1.shape[0]):
        if ((y[i] == y[j]) and (y1[i] != y1[j])):
            mismatch = mismatch+1

print(mismatch)
The output is 0! That means there is not a single mismatch. Affinity Propagation is very effective. But very expensive. Everything comes at a cost!
There are several hyper parameters that are required to make the Affinity Propagation effective.

Damping

Each new iteration of the Responsibility and Availability leads to recalculation of the clusters. Because of this, it is possible that the system gets into an oscillation without any convergence. In order to avoid this problem, the algorithm introduces Damping. This forces inertia on the changes by allocating some weight to the current position of the data point.

Affinity

The measure of affinity is perhaps the most important component in this exercise of clustering. Generally the euclidean affinity is used. But it is possible to push in any other measure of affinity - so long as it is consistent.