Purpose of the project
First, let us make a small reminder of what our project is about. The goal of this project is to apply machine learning methods based on hidden Markov models to the handwritten recognition problem, which is really important to the Venice Atlas project. It is a complex problem which involves to be able to segment the text beforehand, at least in words, if not in characters, but this is for another group of students…
We started the project by taking a slight look on Hidden Markov Models Methods as was planned on our initial road-map. But as proposed by our adviser, we also started to look at the RWTH-OCR package at the same time.
Our initial plan is to test the Hidden Markov Models methods first by using a software library “RWTH – OCR”, developed by the RWTH university of Aachen which is at our disposal, and then make some adaptations for our special needs. RWTH – OCR has already been applied to a very well known dataset made of American ZIP codes which are read optically and interpreted by machines. It is called the USPS dataset, and is composed of around 9300 samples, two thousands of them being dedicated to testing. The goal here was to figure out the digits which compose the ZIP code. The RWTH – OCR algorithm applied to this problem, which is reputed quite difficult, achieves the pretty impressive error rate of only 2.5%.
The RWTH-OCR setup
The first two weeks were dedicated to the setup of the software library, and to some initial testing.
This step proved to not be as simple as one could have previously thought. Indeed, we were not able to build the library package on recent Linux distributions, but only on old (and unsupported) Linux distributions (due to the new Bison package not being compatible with the library needs). Thankfully, we still had an old LTS version of Linux available, and were able to finally install after an epic (or not) struggle with the “Bison parser”.
Once we were able to build the RWTH-OCR, we made some testing with the American ZIP code dataset. We were thinking that the RWTH-OCR package was a set of libraries, but it proved to be only a set of executables: the training and testing for the ZIP code dataset was only being done through some obscure scripts, providing the right arguments to the right executable.
Because of those early observations, and discussion with our advisor, we may take a look at other packages doing the same thing, but that may be more flexible or supported on recent platforms.
We are now trying to list the necessary changes to be able to use RWTH-OCR on the Venice Time Machine datasets, and evaluating the potential of the RWTH-OCR package in terms of flexibility, maintainability and performance.
Situation on the road-map
We are currently in total agreement with our initial road-map and actually going faster than planned as we have started some tasks in parallel. We hope to be able to say the same thing in the next blog post!