The purpose of this project is to implement a ternary classifier of ornaments in C++ aiming at distinguishing if two ornaments are the same, counterfeit or different. In collaboration with the archivist of the precious documents department at Bibliothèque cantonale et universitaire Silvio Corsini, we have access to an important dataset containing approximately 1 million images of ornaments. They are extracted from around 30,000 books dating from the 17th and 18th century.
Important thing to know
The Hand-Press period started around 1454 (with Gutenberg) and ended in the first half of the 19th (when mechanized presses started to appear).
Hand-pressing used carved blocks of wood to print the ornaments. Each publisher had its own set of carved ornaments, such that they could be used to distinguish hand-pressed documents by publisher.
Usually, ornaments are found at the beginning of a paragraph, such as the one below.
To be able to classify different ornaments, we need to establish a set of features that allow efficient comparison between ornaments. This is the most critical step in the process of classification and the choice of features is not an obvious one. The challenge is to design them with a high discriminating power. There are already a few available, of several kinds : morphological, colorimetric, positional etc. However, we would like to expand the set of features to achieve a better result.
Given the images of ornaments, the important work is to extract a mathematical representation of the comparisons, which means measuring the similarity or the differences between one image and another. This requires skills in computer science and image processing. We will thus have to write a program in C++ that transforms each pair of ornaments into a set of features. It is under this form that the machine learning algorithms will be able to perform a classification of the ornaments.
To improve the already existing set of features, we will experiment with a few new ones.
For instance, the MPEG-7 format normalizes a set of description tools allowing us to define and identify multimedia data. We could use the edge histogram descriptor. The edge histogram represents the pixel distribution over two axes. This allows us to evaluate the distance between two images with respect to their edge histograms by calculating a distance between the two histograms.
The CK1 distance measure is another possible feature calculated with the size of mpeg files. The function mpegSize returns the size of an MPEG-1 file made of two frames (each of the two images x and y we are testing). The size of the resulting file depends on the differences between two consecutive frames. Two similar frames will have a smaller compression rate than radically different frames.
Once we have completed the feature extraction, we will experiment different classification algorithms, using supervised along with unsupervised learning techniques. At the end, we want the comparison of two images to fall into one of the three following categories: the two images are the same (copies of each other), they are different, or one is a counterfeit of the other.
One possible approach could be to use complete-linkage clustering. By choosing k relevant features, we can place each ornament on a k-dimensionnal space as a cluster of its own at the beginning. Then the clusters are combined sequentially with the closest neighbor until all elements end up as part of the same cluster. Two clusters are closest to each other if their furthest elements are still the closest as in the next figure.
The sequence of clustering, along with the distances between clusters allow us to discriminate between ornaments.
We need to consider the case in which a group of original ornaments and a group of counterfeit copies are very close.
One possible way to determine a frontier within a cloud of points could be the Fischer discrimination mathematical tool. It is a linear classifier (such as the simplest machine learning classifier: the Perceptron) illustrated below.
These tests will allow us to see how effective the set of features we chose is in differentiating the ornaments. Then, by the means of the feedback from this experimental evaluation, we might want to try new features to see if we can improve our results and get a better classification.