In the previous blog post, we presented to you our very first steps on the project. The blog post was mainly about our preliminary experiments with the RWTH OCR package with the USPS dataset used to test the recognition capacities on single digits.
In this blog post we will present with more detail how to setup RWTH OCR on an linux based machine (ubuntu) as we encountered some troubles during this step and would like to share our solutions.
Since the previous blog post we have run new experiments on a synthetic dataset consisting of latin letters, we will present how to setup a new dataset (consist in the creation of training set, test set, creation of a lexicon and language model), and also include the scripts we have coded to generate automatically the corpus.
We will then present the results of the RWTH OCR system on both the USPS dataset, and our synthetic latin dataset.
Our next step is to test the system on sentences, and we will present how we think we will do that in the last section of this post.
RWTH OCR installation, step by step
The RWTH OCR system is based on the RWTH ASR system which is used for speech recognition. OCR is provided through some patches to apply to the ASR system.
Due to some dependencies of the RWTH ASR system (bison is too recent on new sytems), we were only able to install RWTH OCR on an ubuntu 12.04 machine.
Therefore the installation of the RWTH OCR system can be separated in the following steps:
- Installing Ubuntu 12.04
- Downloading the RWTH ASR system, the RWTH OCR patches and applying them
- Installing the dependencies of the ASR and OCR systems
- Building the system
- testing it on the given example
In this section, we will go through all those steps.
Installing Ubuntu 12.04
You can find an ISO image for Ubuntu 12.04 here
Downloading RWTH ASR and OCR
The latest version (0.6.1 at the time of the writing of this post) can be used with OCR, and can be downloaded once the information form is filled.
The RWTH OCR patches (0.1 at the time of the writing of this post) can be downloaded from here once the information form is filled.
After extracting the ASR source code from the archive, you can copy the compressed OCR patch inside the ASR source code directory. After that, within a command shell, go to the ASR directory, decompress the patch and apply it. It can be done with the following commands for example:
gzip -dc rwth-ocr-0.1.patch-for-0.6.1.gz | patch -p1
Installing the dependencies of the ASR and OCR systems
Before building the system, it is necessary to install some dependencies, which we will list here.
The system failed to build on a recent ubuntu distribution due to the Bison dependency. We were forced to use an old ubuntu distribution (12.04) to be able to use the old Bison 1.2.5 library.
We add in the parentheses the ubuntu package names that provide the tools and the version with which we have successfully built the OCR system.
The ASR system depends on the following libraries and tools:
- GCC 4.0 <= version <= 4.8 (build-essential)
- GNU Bison version 1.2.5 (bison version 1.2.5)
- GNU Make (build-essential)
- Z lib (zlib1g-dev)
- libxml2 (libxml2-dev)
- libsndfile (libsndfile1-dev version 1.0.25)
- libcppunit (libcppunit, libcppunit-dev version 1.12.1)
- LAPACK (lliblapack-dev version 3.3.1)
- blas (libblas3gf, libblas-dev version 1.2.20110419)
The OCR system adds the following dependencies:
- PNG library (libpng12-dev version 1.2.46)
- JPEG library (libjpeg-dev version 8c-2ubuntu7 )
- Netpbm (libnetpbm10-dev 2.10.0)
All those dependencies can be installed with the following command line on ubuntu:
sudo apt-get install build-essential liblapack-dev bison zlib1g-dev libxml2-dev libsndfile1-dev libcppunit-dev libpng12-dev libjpeg-dev libnetpbm10-dev
Building the system
You can just go to the package directory and build everything by typing the command:
Testing the OCR system
The University of Aachen provides an example dataset on their wiki, on this page .
The page gives (good) explanations on how to run it and on how to take a look at the results.
You can test the installation of the OCR system with this example.
Creating a new corpus
In order to create a new corpus, there are several files and folders to modify or create.
Creating the new database folders and basic files
First include all the files contained in the OCR example. They will serve as a basis for the new corpus and will either be used as is or updated to fit our need.
You might want to keep your USPS example database. Then create your new database folder “\data\name-of-the-database” and transfer to it the files from “data\USPS” so as to have, like before, the basic files that will be keep or modify.
Creating test and train folders and generating the corpus files
It is then required to generate the test and train folders containing the different samples. These folders should be located under “\data\name-of-the-database\data” in your project path and should contain the respectively the test and train sample pictures.
Once this is done, there are several config file that need to be updated. Under “\data\name-of-the-database\corpus”, the files complete.recordings, train.recordings and test.recordings and their homologues with the .corpus.plain extension need to be generated, respecting the format of the file. The former are an xml version of the latter. The speaker files do not need to be modified.
Note that in our case, this files were created at the same time than the test and train folder using the script preprocess.py in Python, created for the occasion.
Updating the lexicon files
In the folder “\data\name-of-the-database\lexicon” the files in xml train.lex and train.noemptysilence.lex need to updated. Luckily, in the case of handwriting recognition, both file are the same. To complete them, you first need to complete the “Phoneme inventory” respecting the format of the file. Here one phoneme corresponds to one latin letter (case-sensitive) or symbol. The second part of the files contains the a list of lemma elements. Each of them contains the transcription of one of the picture sample, in the “orth” element, and its decomposition in “phoneme” in the “phon” element, i.e. its decomposition letters (the so-called “phonemes” described in the first part of the file.
The different files in the “\data\name-of-the-database\lm” folder also need to be updated with the new phonemes. The list of indices might have to be elongated.
Moreover, if you changed the folder of the database, do not forget to change every config files in the project’s subfolders. Be wary some of them are hiding quite well.
You can use the grep command to find more easily the changes to perform.
Preliminary results on the USPS and synthetic Latin databases
Presentation of the databases
For the first tries, we made the tests on two different databases, the USPS database, and a latin synthetic latin script database.
USPS database The first database is the well-known USPS dataset. It is composed of handwritten digits (from 0 to 9) that have been sampled by the U.S. Postal Service. There are 9298 of them, already segmented. The training and testing sets contain respectively 7291 and 2007 observations. This is the tutorial/example database proposed in the RWTH OCR documentation.
Latin database This dataset is composed of the latin synthetic upper- and lowercase already segmented letters (without any special symbols such as accents and punctuation) in different fonts and different sizes. There are thus 52 different actual classes, since uppercase and lowercase letters are differentiated and a total of 3120 samples. The different fonts are specifically the Aharoni, Arial, Courier new, Impact, Informal Roman and Lucida Handwriting, in the sizes 6, 7, 8, 9, 10, 12, 14, 16, 18 and 24. We splitted this database into training and testing set by choosing one out of four fonts, considering a font is defined by font style and size. For example we chose Aharoni-6, 7 and 8 as training, and Aharoni-9 as testing. Note that this repartition can easily changed via the some parameters in the preprocessing.py script. This database was generated in APTI style.
We applied the RWTH OCR system to both the USPS and Latin databases.
For the USPS database, we get an error ratio of 4.38%, which is very good, considering that the USPS database is held as a quite hard dataset for handwriting recognition.
On the Latin database, we obtain a error ratio of 9.5%. This is also very good, considering that we have a lot more classes but still fewer samples. Also some handwriting like script, such as roman informal, can be difficult to decipher in very small fonts, even for a human being.
The next step
Now that we know how to train on database in the same format as the USPS database, the stage of this project is to train the RWTH system on lines of text using the specificity of the Hidden Markov’s Models used for classification. In order to do this, we are given in lines, with their transcription, 72 pages of an handwritten document.
The first main challenge of this step will be to build the corpus. Indeed, we will have to adapt to the new format of lines (and not segmented letters), which means taking the time to build different scripts which create the corpus files to find which symbols are used (punctuation but almost characters like ‘ç’, or ‘é’ present in some languages). Amongst other tasks we will now also have to split the lines into characters, so that the system has the data required for the training.
The other challenges we might encounter are less predictable. Indeed, because of the different nature of this test, different problems are also to be expected. Moreover, the difficulty of the database will be greatly increased, since special characters and punctuation will now be included, and also because of the now very real handwritten style of the writing and the lack of occurrence of some letters (such as the z letter, which is supposed to appear only one or two times).
We will also have to understand exactly what are the features that the system uses, so that we can add some new features, or tune better the currently used ones.
We will also have to verify where the system is currently failing to recognize the characters properly, so that we can tune the system to improve the situation. With the synthectic Lain database, we assume that the errors are made on the smaller and the more slanted fonts, but we will have to check it consciously.
Since the last blog post, we spent a lot of time to build the scripts that we will in the next steps, and also to do some preliminary testing. We are now going to move on the real datasets to see what the RWTH OCR system can achieve on real historical documents.