The purpose of this project is to implement a classifier of ornaments in Matlab aiming at distinguishing if two ornaments are the same, if one is a counterfeit of the other, or if the two are completely different. In collaboration with the archivist of the precious documents department at Bibliothèque cantonale et universitaire Silvio Corsini, we have access to an important dataset containing approximately 1 million images of ornaments. They are extracted from around 30,000 books dating from the 17th and 18th century.
We actually have been working with a dataset containing 8839 images and our aim has been to classify images by similarity.
Throughout the semester we have tried several features and different softwares to classify our ornaments.
We will list what we have implemented during this semester.
Our first intuition was to firstly cluterize ornaments by similar sizes. The purpose was to drastically reduce the computation time, since the assumption that similar ornaments have similar sizes seems reasonable.
To do so, and after extractin the relevant information from the files, we used the freely available machine learning software MLDemos (http://mldemos.epfl.ch).
After several trials, we built 50 clusters and the 2D representation using the width and height as coordinates to represent each images is the following.
We then wanted to resize images per cluster by taking for example the mean size inside of each cluster, and then use MLDemos to re-clusterize into each of the 50 clusters by using the CK1 distance as a feature. (link to paper)
In this part we proceeded to hierarchical clustering.
During the first step, each image is in a cluster on its own.
At each iteration, the two closest clusters are merged. We influence the process by choosing a distance function as well as a criterion to determine the distance between two clusters.
In the above matrix illustration, one can see the CK1 distances used to compute the hierarchical clustering graph displayed below. Indeed, one of the 50 clusters has 8 different images, labelled 1 to 8 in the below graph.
As for the determining the distance between two clusters, we chose the maximum distance between each element of two given clusters. This is known as a complete-linkage clustering.
However, the results we obtained after clusterization of the 50 clusters using the CK1 distance were disappointing relative to the time spent understanding and implementing this.
Afterwards, we decided to let the MLDemos clusters aside and rethink our project.
We tried another measure of the statistical similarity between two images: structural similarity (SSIM) defined below (from here).
We implemented this SSIM measure and used it on the data set, but once again, the results obtained were not satisfying.
Building feature vectors for each image
Each image is represented as a vector v corresponding to its pixel distributions along x- and y-axis.
In order to have all our images represented as vectors of same dimensions, we set the vectors’ size n to:
n= xmax + ymax
where xmax is the biggest width among the set of images and ymax the biggest height among the set of images.
By doing so, we ensure our images to be represented into the same space.
We built our feature vectors by doing the following:
Two parts: v1 and v2; v1 of size xmax (biggest width among images) and v2 of size ymax (biggest height).
We fill v1 with the histogram distribution values: histx on the x-axis and complete the resulting parts of v1 with zeros; we do the same for v2: the first components of v2 are the histogram distribution values: histy on the y-axis and the rest are zeros.
Our vectors form a matrix to which we will apply the linkage Matlab function.
Linkage works as follow:
- At the beginning, each image belongs to its own cluster.
- Linkage groups two images that have the shortest Euclidian distance among all the possible combinations of Euclidian distances of two images.
- Tuned parameter: complete: linkage takes the biggest Euclidian distance to compare a cluster to an image.
If we now on focus on one threshold from which images are considered similar, the above image is an illustration of what we obtain with a threshold of Euclidian distance=500. Each color corresponds to one sub-cluster. We suspect the ideal threshold would be lower, but this would require more experimentation by actually opening the images one by one.
This classification is successful for some examples.
Here is a part we zoomed on to get a subdendrogram that we could study more precisely.
In this tree we can see the file numbers on the x-axis, and the distance between the clusters on the y-axis.
You can see the result on the following illustrated tree (click for full size image)
This approach seems to work better on mid-sized to big-sized images. On small ornaments, the differences between the histograms are much smaller and each coordinate has less freedom, so all the distances are smaller when considering small images.
To see if this feature associated with a euclidian distance is relevant, we looked into its discriminating power.
Distance Feature Check Curves
On the first graph below, we compare an image (e.g. image number 1000) to all the other 8838 using the Euclidian distance feature. We want to check if the feature is discriminating enough; this might be characterized by a fast increase in the curve starting where the images are considered as very different by the Euclidian distance feature.
On the second graph, we see a zoomed version of the Euclidian distance feature curve (leftmost part of the first graph). The fast increase in the curve is distinguished. Note that the image numbers have be reordered to have a monotically increasing distance function.
On average, among the smallest distances (leftmost part of the graph), the greatest increase in distance from one image to the next relative to the same reference image is 100. This corresponds to a brutal increase of 15 to 20% after a few images only.
This was tested by picking a few hundreds of images and computing their relative distance to all other images, sorting the distances in increasing order, then picking the 500 first values for each reference image.
Conclusion & Further developments
We could push the project further by distinguishing originals from counterfeits; this seems like a challenging task since to do so we need an expert eye. We feel we also need to get more into the discriminating power of the feature we used, even though it seems pretty reliable on the basis of our experiment.
This approach could already be used on the whole database of circa 1 million images to partition the data into many smaller sets that contain relatively similar ornaments, and that are very distinct from each other. This would cut further processing time by a heavy factor.