Cadasters are fundamental in town representation. Associated with a good index or register, it allows to represent buildings precisely and with a high degree of completeness. A cadaster is a simplified map where each plot of land is annotated with a unique number. In an associated register, these numbers are linked to the owner of the parcel and some other information. For Venice, there exist three of them that are very well defined: the 1808 Napoleonic cadaster, the 1838-1842 Austrian cadaster and the 1848-1911 Italian cadaster. Each follows the intense history of the town.
If we want a complete representation of the town at a certain point in time, it requires an excruciating amount of work to link manually each plot of land to a number in the register and then to link that number to a person or a family, thus the need for automation. The first part of our project is to work on the register associated to the Napoleonic cadaster, which is called the “sommarioni”. The goal is to be able to identify and extract the handwritten index of each plot of land in the register. The second part is to extend this number recognition to the actual cadaster. To these ends, we will create a web service which is able to recognize a handwritten number given a cropped image containing this number. This web service will then serve as the cornerstone of the two parts of the project and can be reused for other projects involving recognition of handwritten numbers in old documents.
The main challenge of this project will be to apply image processing and optical character recognition techniques to identify and extract what we want from the documents provided to us. Here is a sample of a page from the sommarioni.
As you may have noticed, indices are all written in the first column of the register, annotated as “Numeri della Mappa”. These are the numbers we need to extract. The extraction will need the use of some image processing, which basically consists in simplifying images before applying further analysis. A typical approach is to first binarize the image, segment the text and then extract it from each segment using some OCR – Optical Character Recognition – algorithm. Since the characters that need to be recognized are neatly handwritten numbers and since there are more than 10,000 of these indices, this gives us enough data to train an accurate model for number recognition. Thus we should not have too much trouble to achieve an efficient OCR. Moreover many assumptions can be made to guarantee the efficiency of our method: all those numbers are unique, they all lay in a predefined range and are written sequentially.
Now that we have a theoretical approach, we will need to investigate the different tools available to us to perform image processing and OCR. We already looked into it and there are many suitable open-source libraries that are available for free on the web such as OCRopus, OpenCV, SciPy, etc. We plan on using Python which seems convenient given the number of resources and libraries available for machine learning and computer vision. Unlike Matlab, it is open-source and should be easy to deploy on a web server.
Based on this, we can tackle the first part of the project: automatically drawing boxes around numbers in the register so as to segment the image. At this point, no explicit character recognition will be done. In order to achieve this, our algorithm will start from the image of a page of the register, reduce it to a 8-bit grayscale image and then binarize it into a black and white image. To do so we must take into account the fact that the image is not homogeneously illuminated and thus we must use a local thresholding method. One candidate is Phansalkar local thresholding which is used for low-contrast images. Other methods will be investigated. With a binary image, we can then apply orientation correction algorithms. One possible thing to do, since we are only interested in the first column, is to automatically crop the image between the first two vertical lines. This will allow to improve the computation time. Finally, to extract blocks of text, a possible choice of algorithm is the recursive X-Y cut1 algorithm. It projects the image on the y-axis (obtaining the vertical component of the image) and on the x-axis (obtaining the horizontal component) and then cut in the valleys of those two histograms. With that, we can separate each of the numbers.
Next step is to make a web service which will apply optical character recognition on the images that will be provided to it. One big advantage is the modularity of this approach. After some training phase on the register, we hope to be able to use this service for recognition of handwritten numbers from other documents. The input image is supposed to be cropped to represent a single number and to be horizontally aligned. It will then return a plain-text version of the identified number and the estimated accuracy of the guess. We chose to implement this tool as a web service so that it can be used to enhance the DHCanvas tool created by the DHLab, which is already a web-based application. Moreover it is relatively easy to interface a web API with other programs. For the actual character recognition, we will first experiment with OCRopus which is based on the Tesseract engine developed by Google. However it is made for typed characters but not for handwritten text, thus it is possible that results are not as reliable as expected. Hence we will also probably explore classification with neural networks that we could train and tune using data from the “sommarioni”.
Finally, if time permits, we would like to be able to extract numbers directly from the cadaster itself. However this is a much harder task than recognition of handwritten numbers from a cropped and aligned image. The main problem here is to be able to reliably segment the cadaster so as to extract and align each individual number. Indeed, they are mixed with other elements of the map. They also don’t have the same size and orientation, some are overlapping with lines or are not clearly legible. Nevertheless, if we manage to segment the cadaster properly, we should be able to identify these numbers using our web service.
- Week 1: Gather information and tools for image processing and OCR
- Week 2 to 4: Binarization and segmentation of the register
- Week 5 to 8: Experiment with different OCR techniques on the numbers
- Week 9 to 10: Implement the best OCR technique as a web service
- Week 11 to 13: Use this web service to identify the numbers from the register and if times permits segment the cadaster to use the OCR web service on it
- Week 14: Final presentation
Samuel Gilliéron, Louis Séguy, Baptiste Billardon
 Jaekyu Ha ; Haralick, R.M. ; Phillips, I.T. (1995) Recursive X-Y Cut using Bounding Boxes of Connected Components