In the preceding blog post, we introduced a notion of **distance** between two arbitrary shapes. Closely related to this notion, was the notion of **similarity** between to shapes, that allowed us to naturally represent the shapes on a undirected weighted graph (weights being proportional to the distance between two nodes/shapes). If clusters were easily identifiable for a small number of words, the task becomes more complex and requires more thought as the number of words grows. In this blog post, we investigate an automatic clustering analysis using the nearest neighbors approach.

##### A small reminder

Let us first recall the definition of the **distance** and **similarity** functions. The distance we introduced is a linear combination of the three distances :

where the coefficients were chose by performing a principal components analysis (we will see that *this is not optimal*). From this, we derived a **similarity function** between two shapes and :

Notice the use of the *exponential function* in the similarity function. The idea behind this was to enforce *locality * in the graph in order to obtain *well-separated* clusters of words: when two words are very close to each other (in terms of the distance ), then the edge linking the two in the graph will have a weight close to one, while when the two shapes are different, this weight will quickly tend to zero. Unfortunately, we will see that this choice of similarity function isn’t good enough with a large number of nodes in the graph.

##### Optimization of the distance and similarity functions

We launched the simulation designed in the second blog post on a bigger database of *42 handwritten words, with 11 semantically different words*. Using the previous definitions, we computed similarities between each pair of words, and formed the following fully connected graph (called **similarity graph**):

Despite the use of the exponential function in the similarity function, we observe a very dense cloud of words in the center of the graph, that will certainly lead to a tremendous cluster in the center of a variety of different words and smaller clusters gravitating around it. Moreover, we observe an unexpected drawback of the use of the exponential function: the purpose of enforcing locality was to separate *groups *of* *words, but never to create *singleton* words in periphery of the graph ! These singleton will be very difficult to group in a cluster with other words, which will lead to clusters with a single element inside it ! Our expectations are confirmed by Figure 2, which present the result of the clustering procedure (described later) realized on the fully connected graph. Therefore, we need to rethink totally our approach: instead of enforcing locality with the exponential in the construction of the similarity measure, we will find the coefficients in that *separates the better the data*. For this, we consider a *training set of 154 words*. In this training set, we know which words are *semantically* the **same** and which words are *semantically* **different**. Then, we compute the three distances between each pair of words. For each pair, we obtain a point of which coordinates correspond to the respective values of the three distances (actually the log of the distances so they are all express “in the same unit”). Additionally, we color this point in **red** if the two words compared are semantically the same and in **blue** if the two words compared are semantically different. We obtain the following plot:

The idea is now to find the **plane that separates the best the two groups**. Then, by projecting onto the vector normal to this plane, we obtain a unique distance that separates the best the two groups. This can be done using **Fischer discrimination**, and we obtain the following distance:

With this new distance in hand, enforcing locality through the exponential function in the similarity function isn’t necessary anymore, because the data is* already optimally separated*. This is good news as it allows us to avoid the drawbacks discussed earlier. Therefore we propose the following similarity measure:

Note that this is well defined because we never compare strictly identical shapes so the distance will never be zero.

##### Clustering analysis with k-nearest neighbors

We investigate here a popular clustering approach based on **similarity graphs** such the one we constructed: **the nearest neighbors**. The idea of this approach is to first reduce the number of edges in the similarity graph: two vertices and are connected if is among the nearest neighbors of , with a parameter to choose.Then, *the idea is to separate the graph in components (clusters), so that the the sum of the weights of the edges we have to cut to disconnect these components is minimized*. For an unweighted graph and only two clusters, the goal is then to find two components so that the number of edges to cut to separate them is minimum (this problem is known as *max-flow, min-cut*). Additionally, we also add some constraints to avoid extremely large or small clusters.

It can be shown that this problem can be expressed as an eigenvalue problem (see **[1]**): the solution is provided by computing the eigenvectors of a certain matrix, called the **graph laplacian.**

If we apply this procedure to our similarity graph, with and we obtain the clusters shown on Figure 4. On this figure, each color corresponds to a cluster, and nodes have been colored according to their respective cluster. We additionally added some **uncertainty measure** on this figure: we’d like to know how certainly a node actually belong to a cluster or not. For this, we separated the nodes in three categories:

**Green:**The nodes which only neighbors are members of their cluster: these nodes being only connected with their pairs, it is really unlikely that they have been misclassified.**Pink:**The nodes which have more neighbors in their cluster than outside of it: these nodes are strongly connected with their cluster, but still they could relate with some other nodes outside of it, so we might wonder if they have been correctly classified.**Red:**The nodes which have more neighbors outside their cluster than inside: these nodes are better connected with the rest of the graph than to the cluster they belong to. Therefore we have to be very careful in the trust we put in these nodes: it is very likely that they have been misclassified and that they should belong in reality to another cluster.

The results of the clustering analysis are quiet good: **86%** (on average) of the words have been correctly grouped together. More precisely, **98%** of the green nodes have been correctly classified, against **87%** for the pink nodes and only **35%** for the red (so the uncertainty measure seems to be a good indicator).

##### Conclusion and remaining work

The results of the clustering analysis are quiet good, and the uncertainty measure seems to be a good indicator. Therefore, we could use this approach to develop a semi-automatic transcription of ancient text. Let say we have digitized an old handwritten book. Then, to be able to extract the information out of it, we need to transcript it, so that a computer could recognize the words. Then, we could imagine the following scenario: a human is paid to transcript a digitized book. However, his work is helped by the clustering analysis developped earlier: as soon as he types the transcription of a word, this transcription automatically propagates to all the members of the cluster to which the word in question belongs. Of course there will be errors (because the procedure isn’t totally accurate), but we can add a color code based on the uncertainty measure introduced previously. Therefore, an uncertain transcription would be colored in red, to catch the attention of the human operator, who would then check the correctness of the transcription.

**SIMEONI Matthieu**

**References**

[1] Daniel Kressner, *Computational Linear Algebra*, Lecture notes, EPFL, Spring 2014.

[2] S. Belongie,J. malik, J. Puzicha, *Shape Matching an Object Recognition using Shape Contexts*, IEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, April 2002