The first part of our project focuses on the register associated with the Napoleonic cadaster, known as the “Sommarioni”. The goal is to be able to identify, extract and recognize the handwritten indices corresponding to each plot of land.
We dedicated the first week to setting up our project environment. As stated in the introduction post, we decided to work with the Python programming language. The reason behind this choice is that many open-source Python libraries and documentation are available to achieve computer vision and machine learning tasks. We spent the first week experimenting with different libraries to find the ones which suit us best. We settled for OpenCV and NumPy which are well-known and reliable. We also decided to set up a Jupyter installation – a web application allowing to create and share live code, equations, visualisations, etc. – to be able to experiment and to collaborate easily. All this framework has been set up on a private server to allow us to work together from anywhere.
The next step was to apply image processing methods on pages of the register. We first need to binarize the pages to simplify future work on them. To this end, we convert color images to grayscale images. We then apply a thresholding algorithm to get binary images.
Fortunately, OpenCV offers some built-in thresholding functions. We experimented with all of them. The first thresholding method is global, i.e. it works on the image as a whole. This is the simplest approach though it falls short if there are large variations of intensity in the image. To overcome this problem, we can use adaptive thresholding methods which work locally and thus offer better results on non uniform backgrounds. All of these methods, global or adaptive, are semi-automated, in the sense that we still need to specify the appropriate threshold manually. Moreover for the adaptive methods we also need to specify the neighborhood they work on. OpenCV offers a way to automate the selection of the threshold, this is done using Otsu’s method. Basically, it computes the histogram of the image. If the histogram is bimodal, the algorithm finds the threshold that will reduce the intra-class variances.
Additionally, we preprocess each image before the thresholding operation. Indeed, we want to use Otsu’s method to avoid having to specify the threshold manually. Unfortunately, OpenCV does not offer any adaptive (i.e. local) version of Otsu’s method. Thus we need the image to be as uniform as possible. We first tried to blur a bit the image, this works but sometimes leads to the loss of some details. Instead, we apply a morphological operation known as “closing” on the grayscale image to get some kind of “mean intensity” at each point. Then, dividing the original image by this result allows us to obtain uniform brightness. We found that this preprocessing dramatically improves the final result.
To reduce computation time and to reduce the difficulty of segmenting the image, we need a way to extract the column we are interested in, i.e. the one containing the digits. We proceed by finding the lines in the document using a Hough transform. We then filter out the lines that are not vertical or not on the left side of the document. We also use a high enough threshold to make sure only the thicker lines are detected. Moreover, we only keep lines that are not too close of each other since sometimes a line is detected many times because of the noise in the image. Finally we detect the appropriate column using some additional criteria, for instance the distance between the two lines, the percentage of non-empty pixels between the lines, etc. We still need to refine these conditions to find the most accurate ones. Furthermore, a safe margin of a few pixels will also be set up to prevent cutting the number if the lines are not straight.
Once the column we want to extract the indices from is in a properly binarized form, we need to separate the numbers from one another. After some trial and error, we finally found an algorithm which gives us good results and which is versatile enough for our future use.
Again we make use of morphological transformations to achieve this. We first use the gradient operator to get the outline of digits. Then we use the closing operator with a rectangular kernel, this allows us to “merge” the digits composing a number. Finally, we use OpenCV to find the contours in the resulting image. After filtering out the contours which are not corresponding to any number, we obtain a bounding box for each number. Currently, we are using basic criteria to select the contours (e.g. the ratio of non-empty pixels inside the region, the width of the region, etc.).
Even tough it is working on our input images, we still need to refine these criteria to get it right all the time. For instance, we want to vertically sum the pixels in each box to obtain some kind of histogram. Then we can detect peaks in this histogram corresponding to each digit. We think that using some “peak detection” algorithm would allow for an interesting condition to make sure the region really contains handwritten characters. Finally, we can also segment each digit composing a number if necessary, to achieve this we can simply use a smaller width for the kernel used during the closing operation.
The current state of the project puts us right on time according to the milestones set a few months ago (see our introduction post). We will hopefully finish the segmenting part soon, allowing to continue with future development.
In the near future, we will have to find quantitative metrics in order to be able to better compare different algorithms or parameters. We see multiple possible candidates. One is a cost function that maximizes the number of black pixels inside the segmentation box while minimizing the size of the box. The other is to compare the boxes produced automatically with boxes made by hand.
Finally, once we can reliably extract the numbers from the document, we will start the OCR part of the project to recognize handwritten digits automatically.
Samuel Gilliéron, Louis Séguy, Baptiste Billardon