As a part of the project Imprimatur, our interest lies in processing 1 million images of ornaments provided by the BCU in Lausanne. Those ornaments were extracted from ancient books.
We aim at classifying them into groups of elements sharing similarities in order to distinguish between the originals from the counterfeits.
During the process, we are hoping to also use deep learning to avoid manually making a list of all the features we would want to study.
The project is divided into two parts.
The first part consists of building clusters of similar images without prior knowledge of the total number of clusters.
The second part aims at differentiate the originals from the counterfeits into the created clusters from the first part.
This blog post will focus on the first part and some guidelines for the second part will be given.
Let’s focus on the first learning process.
We want original ornaments and their counterfeits to belong to the same cluster.
Here we want to distinguish two clusters. Our main issue is that we don’t have any prior knowledge on the total number of clusters. This is the reason why we rely on unsupervised machine learning.
First Clusterization Process
Throughout this project, we are assuming that similar ornaments have similar sizes (an image is characterize by a x pixels by y pixels size). But more importantly, we need to resize images to a unique size to be able to cluster them. As we want to avoid resizing images with a high length to width ratio along with other images with a length to width ratio close to 1, we want to regroup the images in classes of similar sizes. To accomplish this, we use the freely available machine learning software MLDemos (http://mldemos.epfl.ch).
Prior to launching MLDemos, we have to translate the set of images into a .data file containing the width and the height of each images.
This process is done using Matlab (we loop throughout all the images, extract their size and write those informations into a .data file).
We can now start clustering with MLDemos.
With this example of clustering we notice that images with a constant height but with widths varying from 0 to 600 pixels belong to the same group. The whole point of the clustering was to avoid comparing images that are very long (20×600) with images that are square (20×20 or 600×600). This is the reason why we consider this first clustering to not be satisfying.
By increasing the number of clusters or changing the method, we can obtain results closer to our expectations, but we notice the problem is still the same. It seems intuitive that more than 8000 images of heterogeneous sizes cannot be divided in only 6 groups.
After increasing the number of clusters to 50, we come closer to what we expect to obtain. We still need to find the best method for clusterizing the sizes such as there would be a finer mesh in areas of high density, and more tolerant classes in areas of low density.
Second Clusterization Process
After having obtained a satisfactory classification by sizes, we can focus on individual clusters.
This part is a sketch of what we are planning to do for the coming weeks.
During this second clusterization process, we assume we can resize images without much distortion.
After resizing images, we will use the analysis tool: PCA, which stands for principal component analysis.
The main idea is to find the principal eigenvalues of the correlation matrix and project the data on the subspace spanned by the corresponding eigenvectors.
This is used to focus on the main features (=eigenvectors).
Finally we can clusterize again using MLDemos. We might use a measure of distance based on the CK1 distance measure. This robust measure would allow us to handle rotated or distorted images.
Final Clusterization Process
In this part, we will look into clusters created using the previous process. We assume we will have obtained images that are hard to distinguish (original ornaments and counterfeits).