Work during these last few weeks progressed at a slower pace than expected. There are different reasons to this. The first, and most prominent one, is the lack of time due to projects and exams in other courses. The second is that we encountered several obstacles in the segmentation of the Sommarioni. We will focus on the latter and explain which issues we encountered. Even though we are not on schedule, progress has been made and we will be able to catch up.
At the moment we are able to extract the required column “Numeri della Mappa” in a particular page by manually tweaking the parameters of the algorithm.
But as expected, these parameters were not ideal for all the other pages, thus extracting the wrong columns or even no columns at all. Each page that has been scanned is kind of unique: sometimes the page is rotated by a certain angle, other times the page is not centered correctly, etc. Another problem occurring quite often is the fact that the columns in the picture are not straight but a bit curved. This probably comes from the way the book has been scanned or perhaps the columns were drawn by hand. This is also probably one of the reasons why some lines are not well-recognized.
For all the reasons listed above, we have not been able to find a generic method that allows us to perfectly extract the appropriate column for each page of the Sommarioni. To overcome this problem, we plan on meeting with our teaching assistant as soon as possible.
In spite of the problems detailed in the previous part, we managed to progress a little. We are currently able to reliably extract numbers from a column, provided this column is correctly segmented (i.e. there are no remaining vertical lines near the border). The next step is to recognize each of these numbers. To this end, we need to segment each of its digits. Our segmentation algorithm is already able to achieve this. Then, we need to recognize each of these digits so as to be able to recognize the whole number. This can be done using different machine learning algorithms. In any case, since we will be doing supervised training, we need both a training and a testing set. The training set, as the name suggests, is used to construct the model. The testing set is used to compare and evaluate the different models we train. We plan to experiment with two different training sets.
The first option is to create a training set from the register, this should not be too cumbersome since numbers are ordered and without gaps. Provided we can reliably extract the numbers, we can automatically associate the correct label (i.e. numerical value) to each of them. Then we can keep a subset of these as a testing set. Nevertheless, we want to be able to recognize handwritten digits from other ancient documents, so we need to generalize our model . Thus we have to be careful if we use such a training set since it would be easy to overfit.
The second option is to use the MNIST database (which stands for Mixed National Institute of Standards and Technology). It is a large database of handwritten digits of various quality collected from high-school students and census bureau employees. It consists of a training set of 60’000 handwritten digits and a test set of 10’000 digits, which should be large enough for us. One possible problem is the fact that handwritten digits from the 19th century might be written in a different manner than during the 20th and 21st centuries. But from our observations, the digits present in the Sommarioni are written clearly and seem to not differ greatly from today’s handwritten numbers, as seen on the following figure.
Fig.3 : (Left) Digits from the MNIST database. (Right) Digits from the Sommarioni.
Finally, we could also create a training set using both of these sources. In addition to the training and testing sets, we need to choose among different machine learning algorithms. We settled on two candidates:
- SVM (Support Vector Machine). It is a well-known algorithm and it relies on strong mathematical concepts. These kinds of models are flexible since we can use different kernels, for instance to perform non-linear classification. Moreover they are not too sensitive to the curse of dimensionality: they stay efficient even in high-dimensional spaces (such as images). Nevertheless they can be quite computationally expensive. Hopefully we will be able to take advantage of the sparsity of our dataset (a lot of pixels are “empty”). On the MNIST database, such approaches reached a test error between 1.4% and 0.56%. Finally, some of us already used such models, so we already have some experience with SVM.
- CNN (Convolutional Neural Networks). This approach is relatively new and is based on biological processes. It gives very good result on images but these kind of neural networks are more complicated. They require little preprocessing compared to traditional machine learning algorithms (e.g. no need to manually choose which features to extract). Though they are very complex and none of us experimented with such neural networks before. On the MNIST database, CNN reached a test error between 1.7% and 0.23%.
Both methods will probably give good results as they should generate only a hundred of mislabeled numbers in the register, which is near human performance (estimated at 0.2% but not tested for the whole set). Moreover, such mislabels can be detected using inference. For instance a given number can only appear once and has to belong to a certain range of numbers.
If we look at the milestones that we set, we are about one week behind. Once we overcome these problems we will start experimenting OCR techniques on the digits we extracted.
Baptiste Billardon, Samuel Gilliéron, Louis Séguy