# [Blog Post 3] Clustering with Nearest Neighbors applied to Semi-Automatic Transcription of handwritten texts.

In the preceding blog post, we introduced a notion of distance between two arbitrary shapes. Closely related to this notion, was the notion of similarity between to shapes, that allowed us to naturally represent the shapes on a undirected weighted  graph (weights being proportional to the distance between two nodes/shapes). If clusters were easily identifiable for a small number of words, the task becomes more complex and requires more thought as the number of words grows. In this blog post, we investigate an automatic clustering analysis using  the $k-$nearest neighbors approach.

##### A small reminder

Let us first recall the definition of the distance and similarity functions. The distance we introduced is a linear combination of the three distances $D_{mc},D_{bend},D_{ac}$:

$D=0.2\times \mbox{log}(D_{mc})+0.9\times \mbox{log}(D_{bend})+0.3\times \mbox{log}(D_{ac}),$

where the coefficients were chose by performing a principal components analysis (we will see that this is not optimal). From this, we derived a similarity function between two shapes $\mathcal{P}$ and $\mathcal{Q}$:

$S(\mathcal{P},\mathcal{Q})=\mbox{exp}(-\mathcal{D}(\mathcal{P},\mathcal{Q})^2).$

Notice the use of the exponential function in the similarity function. The idea behind this was to enforce locality  in the graph in order to obtain well-separated clusters of words: when two words are very close to each other (in terms of the distance $D$), then the edge linking the two in the graph will have a weight close to one, while when the two shapes are different, this weight will quickly tend to zero. Unfortunately, we will see that this choice of similarity function isn’t good enough with a large number of nodes in the graph.

##### Optimization of the distance and similarity functions

We launched the simulation designed in the second blog post on a bigger database of 42 handwritten words, with 11 semantically different words. Using the previous definitions, we computed similarities between each pair of words, and formed the following fully connected graph (called similarity graph):

Despite the use of the exponential function in the similarity function, we observe a very dense cloud of words in the center of the graph, that will certainly lead to a tremendous cluster in the center of a variety of different words and smaller clusters gravitating around it. Moreover, we observe an unexpected drawback of the use of the exponential function: the purpose of enforcing locality was to separate groups of words, but never to create singleton words in periphery of the graph ! These singleton will be very difficult to group in a cluster with other words, which will lead to  clusters with a single element inside it ! Our expectations are confirmed by Figure 2, which present the result of the clustering procedure (described later) realized on the fully connected graph. Therefore, we need to rethink totally our approach: instead of enforcing locality with the exponential in the construction of the similarity measure, we will find the coefficients in $D$ that separates the better the data. For this, we consider a training set of 154 words. In this training set, we know which words are semantically the same and which words are semantically different. Then, we compute the three distances  $D_{mc},D_{bend},D_{ac}$ between each pair of words. For each pair, we obtain a point of $\mathbb{R}^3$ which coordinates correspond to  the respective values of the three distances (actually the log of the distances so they are all express “in the same unit”). Additionally, we color this point in red if the two words compared are semantically the same and in blue if the two words compared are semantically different. We obtain the following plot:

The idea is now to find the plane that separates the best the two groups. Then, by projecting onto the vector normal to this plane, we obtain a unique distance that separates the best the two groups. This can be done using Fischer discrimination, and we obtain the following distance:

$D=2.68\times \mbox{log}(D_{mc})+0.68\times \mbox{log}(D_{bend})+0.89\times \mbox{log}(D_{ac}).$

With this new distance in hand, enforcing locality through the exponential function in the similarity function isn’t necessary anymore, because the data is already optimally separated. This is good news as it allows us to avoid the drawbacks discussed earlier. Therefore we propose the following similarity measure:

$S(\mathcal{P},\mathcal{Q})=\frac{1}{\mathcal{D}(\mathcal{P},\mathcal{Q})}.$

Note that this is well defined because we never compare strictly identical shapes so the distance will never be zero.

##### Clustering analysis with k-nearest neighbors

We investigate here a popular clustering approach based on similarity graphs  such the one we constructed: the $k-$nearest neighbors. The idea of this approach is to first reduce the number of edges in the similarity graph: two vertices $x_i$ and $x_j$ are connected if $x_j$ is among the $k$ nearest neighbors of $x_i$, with $k$ a parameter to choose.Then, the idea is to separate the graph in $n$ components (clusters), so that the the sum of the weights of the edges we have to cut to disconnect these components is minimized. For an unweighted graph and only two clusters, the goal is then to find two components so that the number of edges to cut to separate them is minimum (this problem is known as max-flow, min-cut). Additionally, we also add some constraints to avoid extremely large or small clusters.

It can be shown that this problem can be expressed as an eigenvalue problem (see [1]): the solution is provided by computing the eigenvectors of a certain matrix, called the graph laplacian.

If we apply this procedure to our similarity graph, with $k=4$ and $n=9$ we obtain the clusters shown on Figure 4. On this figure, each color corresponds to a cluster, and nodes have been colored according to their respective cluster. We additionally added some uncertainty measure on this figure: we’d like to know how certainly a node actually belong to a cluster or not. For this, we separated the nodes in three categories:

• Green: The nodes which only neighbors are members of their cluster: these nodes being only connected with their pairs, it is really unlikely that they have been misclassified.
• Pink: The nodes which have more neighbors in their cluster than outside of it: these nodes are strongly connected with their cluster, but still they could relate with some other nodes outside of it, so we might wonder if they have been correctly classified.
• Red: The nodes which have more neighbors outside their cluster than inside: these nodes are better connected with the rest of the graph than to the cluster they belong to. Therefore we have to be very careful in the trust we put in these nodes: it is very likely that they have been misclassified and that they should belong in reality to another cluster.

The results of the clustering analysis are quiet good: 86% (on average) of the words have been correctly grouped together. More precisely, 98% of the green nodes have been correctly classified,  against 87% for the pink nodes and only 35% for the red (so the uncertainty measure seems to be a good indicator).

##### Conclusion and remaining work

The results of the clustering analysis are quiet good, and the uncertainty measure seems to be a good  indicator. Therefore, we could use this approach to develop a semi-automatic transcription of ancient text. Let say we have digitized an old handwritten book. Then, to be able to extract the information out of it, we need to transcript it, so that a computer could recognize the words. Then, we could imagine the following scenario: a human is paid to transcript a digitized book. However, his work is helped by the clustering analysis developped earlier: as soon as he types the transcription of a word, this transcription automatically propagates to all the members of the cluster to which the word in question belongs. Of course there will be errors (because the procedure isn’t totally accurate), but we can add a color code based on the uncertainty measure introduced previously. Therefore, an uncertain transcription would be colored in red, to catch the attention of the human operator, who would then check the correctness of the transcription.

SIMEONI Matthieu

#### References

[1] Daniel Kressner, Computational Linear Algebra, Lecture notes, EPFL, Spring 2014.

[2] S. Belongie,J. malik, J. Puzicha, Shape Matching an Object Recognition using Shape Contexts, IEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, April 2002