The goal of our project was to investigate the possible use of the RWTH OCR framework as a possible Optical Character Recognition system to use in the Venice Time Machine project.
The RWTH OCR framework is a set of programs able to perform handwritten character recognition, and would be useful to digitize the Venice archives. This framework uses the so called Hidden Markov Models to improve the quality of the recognition.
Hidden Markov Models have already been successfully applied in speech recognition systems for which it is considered as state of the art, and also in character recognition systems.
This framework has been used in several research papers by the members of the Human Language Technology and Pattern Recognition Group at the RWTH Aachen University, which has created it. It has for example been applied to the recognition of Arabic handwriting.
During our project, we have investigated its possible application on some parts of the Venice archives.
We will present here the framework in a first part, our results in a second one, and what are the potential extensions in a third part.
1. Presentation of the RWTH OCR framework
The RWTH OCR framework is based on the RWTH ASR framework, which has been used for several years to perform Automatic Speech Recognition. Coded in C++, it is an open-source framework, and thus can be modified by its users. The system is available at the following address: http://www-i6.informatik.rwth-aachen.de/rwth-ocr/ .
The framework uses Hidden Markov Models to improve recognition rates compared to other frameworks.
1.1. Framework’s functioning
The functioning of the framework is the following: it has first to be trained on a large enough dataset, which has to be close enough to what will be processed in order to allow good recognition. Also, this means we must have the transcription for this training dataset. Then, the system is run onto real images which transcription is not known.
The processing pipeline is composed of the following stages. First, images are processed and some features are extracted from it. The features are computed on different contiguous, overlapping sliding windows (blocks of pixels in the original images) in order to have information about what would be “the past” in the context of speech processing, and “left” for left to right hand-writings (and “right” for right to left hand-writings).
During the training stage, Hidden Markov Models are trained for every character seen in the whole set of images using the Baum-Welch algorithm for example. For that, several transformations are applied on the previously extracted features. (The features are for example reduced using a PCA). The idea is to train the parameters of the Hidden Markov Models that will represented each type of character. Typically, a character is represented by different states that follow one another, the transitions go to either the same state, the next one, or the one after. The training goal is then to define the probability of each transition and the probability of the associated observation, which is the feature. This is the only thing we will be able to see during the recognition stage, the inner states of the feature flow being unknown.
The role of the recognition stage is thus to guess the inner state of this features flow, knowing the different probabilities that have been computed before. It associates the most probable succession of characters (i.e. states because the characters are composed of several states) using the Viterbi algorithm. A language model can be applied to tweak the probabilities for the next characters, which improves the recognition rate.
1.2 Framework’s composition
Once compiled, the framework consists in several programs, some to generate the features, some to perform training, and some to perform recognition.
Several files have to be generated to handle a given corpus of images. For more detailed instructions on how to install and install the framework, see this previous blogpost that we have written.
2. Experiments results
2.1. Preliminary experiments
Our first experiment was to use the framework on the sample corpus (called USPS) given by the University of Aachen. This well-known dataset, which is already considered quite hard, consists in handwritten American zip codes sampled by the United States Postal Service. It consists in 7291 training digits and 2007 testing samples. We ran the system on it, and obtained very good results with about 95,68% of well recognized digits. With these good results, we then wanted to use the system on a synthetic dataset of Latin alphabet characters.
2.2. Synthetic Latin alphabet characters experiment
We then adapted the system to run our second experiment on a synthetic corpus of Latin characters. The corpus consisted on various Latin letters written with different fonts and in different sizes. The adaptation of the system for this dataset was described in this previous blogpost. The results were also very good, with a 90,5% word recognition rate on the testing set. Most of the errors were done on small characters, with very curvy fonts like on the image pictured below. The next experiment we wanted to run was on a real dataset from the Venice archives.
2.3. The Venice archives dataset experiments
The dataset we had at our disposal was a set of text lines from the Venice archives. The text is written in Old Italian, and thus doesn’t map exactly to current Italian. This dataset being freshly extracted, we had to run several scripts in order to build the corpus structure necessary for the RWTH OCR system. We had to extract the set of UTF-8 symbols used in the corpus to generate a language model. We paid attention to make sure that the same characters were always encoded in the same in binary (which is not always the case in UTF-8).
After generating this corpus structure, we tried to run the system on it. Sadly, in spite of being really time consuming, the training of the Hidden Markov Models failed. We think that this is due to the fact that most of the images were slightly slanted, thus hindering the feature extraction and training of the models.
4. Potential extensions
The RWTH OCR proved promising on our experiments on the zip code and synthetic Latin characters datasets. However, we were not able to finish its integration on the Venice archive dataset. It would be interesting to try to improve the feature extraction and training on this dataset to diminish the slanting on the images, which could avoid the crash of the system when training it.
From the research papers published by the people at the University of Aachen, good results seems to be possible from it.
However, the adaptation of this framework is not an easy task with the lack of documentation available.
When working with the RWTH OCR package, our impression was that this framework seems very capable and interesting as it is open-source. However, we encountered lots of problems which we were only partially able to solve: for example, due to some dependencies of the framework on old software libraries, we could only install it on an old Linux Ubuntu system (see the second blogpost). Also the system was really hard to adapt as the only documentation available is made of the zip code sample corpus, and the error messages we have run into were really cryptic. Indeed, the project is still described as under progress, but no addition has been made in recent years.
Due to that, we would not recommend the application of this framework in the Venice Time Machine project. It would become a viable option only if the documentation is improved, or if the integration is directly involved.
However, if more documentation, this framework could be one the best as it is using cutting edge methods which proved really efficient in several research papers.
It would be interesting to compare this framework to the more established and documented HTK framework which provides tools to manipulate Hidden Markov Models. However, feature extraction would have to be coded this time and this would require further investigation.