Cadasters are fundamental in town representation. Associated with a good index or register, it allows to represent buildings precisely and with a high degree of completeness. A cadaster is a simplified map where each plot of land is annotated with a unique number. In an associated register, these numbers are linked to the owner of the parcel and some other information. For Venice, there exist three of them that are very well defined: the 1808 Napoleonic cadaster, the 1838-1842 Austrian cadaster and the 1848-1911 Italian cadaster. Each follows the intense history of the town.
If we want a complete representation of the town at a certain point in time, it requires an excruciating amount of work to link manually each plot of land to a number in the register and then to link that number to a person or a family, thus the need for automation. We worked on the register associated with the Napoleonic cadaster, called the “Sommarioni”. It is composed of several column, as seen in the following figure. The first one, entitled “numeri della Mappa”, is the index with each unique number corresponding to a plot of land in the cadaster. It is written sequentially (9001,9002, …) with the same handwriting.
The main goal of the project was to be able to extract and identify the handwritten index of each plot of land in the register. The secondary goal was to generate a web service for handwritten digits recognition, thus allowing other people to reuse our project.
Before doing any optical character recognition (OCR), the images must be binarized. After trying several algorithms, we chose to use Otsu’s method. Its main advantage is that it computes the threshold automatically. To do so, it computes the histogram of the grayscale image. Assuming it is bimodal, the algorithm finds the threshold that will reduce the intra-class variance of the two classes.
There is a lot of unwanted text on the page (title, description, …) that would clog our OCR system. To prevent this, the first column should be extracted. We built an algorithm for this and applied it only on the first 20% of the page as we know that it is the approximate position of the column. The vertical lines were detected using a probabilistic Hough transform. However, there were several problems with our register. Some of the pages were curved, the lines were not straight and some numbers are overlapping the lines. Therefore we could not reach a precision of 100%.
Once a column is extracted, numbers could be extracted using morphological transformations and a contour detection algorithm. Morphological transformations are a set of mathematical operators that modify an image. For instance, dilation which increases the thickness of an object or erosion which decreases it. All these operations require a shape element called kernel. It can be anything from a simple square to a very complex shape. In our case, a dilation is applied with a rectangle horizontal kernel so as to join the digits of each number together. Afterwards, we use the contour detection algorithm build in OpenCV to extract each number. However this method requires that the numbers are not overlapping any column.
To extract each digits, we applied a closing operator to make sure each digit is fully connected. Several problems appeared. Some digits are badly segmented, as seen in the previous figure. They either are not fully connected or overlapping and thus are badly segmented such as the nine which is cut in this example.
Optical Character Recognition
The main part of the project was to be able to recognize numbers. To do so, machine learning algorithms were investigated. But, before applying them, a dataset is needed. The MNIST dataset was selected as it is a standard in handwritten digits recognition tasks. It is a large database of handwritten digits of various quality collected from high-school students and census bureau employees. It consists of a training set of 60,000 handwritten digits and a test set of 10,000 digits, which was large enough for us. One expected complication of using this database is that handwritten digits of the 18th century are different from modern handwritten digits. They are written in italics, with some decoration such as a very long tail for nines.
To have a dataset consisting of numbers from the register, with corresponding labels, we had to find a semi-automatic algorithm. Indeed, numbers from the register are consecutive. This means that labeling a whole dataset is extremely easy, given that all of the number are extracted in the correct order. Our extraction algorithm could not reach 100% precision, meaning that if we wanted to use it, we couldn’t use the sequential labeling trick. Manual supervision of the extraction was needed. Using Localturk, which is a framework similar to the Amazon Mechanical Turk platform, we made sure to extract all of the numbers in the right order. Then we could label each number automatically. Using this method we created a dataset of 3,110 digits. Moreover we had to carefully normalize them in the same way as the ones from the MNIST.
With these datasets, several model were investigated. The first model we tried was a very simple neural network using raw pixels as features for classification. The second one was SVM with different kernels and different features. The features we used were raw pixels and HOG (histogram of oriented gradients). However none of these models could achieve a precision of more than 90%.
Finally we used a Convolutional Neural Network (CNN) with two convolutional layers, one fully-connected layer and a final softmax layer for multiclass classification. After training solely on the MNIST, we obtained an accuracy of 98% on the test set of the MNIST. However results were disappointing on the numbers from the Sommarioni on which we obtained a mere 75% accuracy. This was due to the specificities of the handwriting and to the segmentation problems of our algorithm. To solve these two problems we fine-tuned our CNN using a few hundred samples from our dataset. We then achieved an accuracy of 95% on our test set while keeping an accuracy of 95% on the MNIST.
To predict a number using the CNN, we have to segment each of its digits, normalize them in the same way as the MNIST and feed all of them together to the neural network. For each digit, we get the probabilities that it belongs to a class. We assign each digit to the class with the highest score. Finally the confidence score we give for the whole number is the product of these scores.
To make our work available for other projects, we decided to create a web service. We implemented it using Flask, a Python microframework. The user can send a REST request containing a cropped image of the number to be recognized. The requests is in JSON and the image encoded as base64. Computation happens server-side and the result is sent back to the user as a JSON file containing the value and a confidence index.
We were able to generate an algorithm that is able to automatically recognize numbers from the register of the cadaster. Moreover we created a web service in order to make the OCR part of the project available for other projects that need to recognize handwritten digits.
Nevertheless, there are still issues with the current state of the project. The segmentation of columns took us a lot of time and no satisfactory solution was found. It was a large unexpected time sink. We had to stop development on this algorithm in order to be able continue on the most important part of the project. There are several issues with the segmentation of numbers and digits. This is clearly the main bottleneck for both precision and generalisation. Using the fine-tuning of the MNIST, it allows the CNN to compensate for the badly segmented digits, largely improving the precision on the Sommarionni. To be more general, more work should be done on the segmentation algorithms, but none of us had enough knowledge in computer vision for this.
All of these results are still very encouraging. It shows that it is possible to use a modern standard dataset (MNIST) with slight adjustments to the model to recognize number from old documents. This is a common task for digital humanities and being able to correctly and automatically analyze these numbers would save a lot of time.