Handwritten text recognition using Hidden Markov Models: The awakening of the Python scripts – project report III

A little reminder

In the last blogpost, we explained the different steps for installing the RWTH-framework, covering the steps from the installation of a virtual box using the right distribution of linux to handling the database.

Then, we presented how to build a corpus for character recognition from one’s own database. We had already been able to apply the system to a synthetic set of letters (more precisely, the set of the 52 upper- and lowercase letters) and we are now in the process of trying the system on a different database composed no more of single letters but of full lines of an already transcripted italian document.

Building this corpus is far more challenging than everything already done, and we will report here the progress made since last blogpost on that matter.

Methodology

Objectives

Our current aim is to build the corpus. This involves different tasks:

  • Create a dictionary containing every single different character existing in the database and the associated file. (This must be done carefully as some characters have different binary encoding).
  • Create the associated lexicons, containing the different characters used in the documents
  • Create the associated language model (Hidden Markov Model states) given the used characters
  • Choose training and testing sets
  • Create the corpus file (i.e. the files that contains the transcription for each line, formated in a special way)

We are using a Python script to generate all these files. Python is particularly useful for this task, as it is natively handling the Unicode standard, used to handle all possible characters in the entire world. With Python, it is easy to make sure that the same character is encoded always in the same way, which is an important point for us, especially concerning the quality of the training of the HMM model.

Shape of the database

The database we have (figuratively) in our hands, is composed of images in the PNG format, and several .txt files which contains the transcriptions by page.

We will thus have to process those .txt in order to build the corpus.

Preliminary steps

As a matter of facts, we noticed that the different text files did not all have the same encoding. Some files were encoded in UTF-8, some not. This prevented us to create correctly the dictionary, as we could not handle the special characters as the “ç” which do not have the same binary encoding depending on the format.

Thus, our first action was to code a bash script in order to convert all the different text files to the utf-8 encoding and get rid of this problem.

The script uses the iconv program to save a *.txt file into another *.txt file encoded in UTF-8.

To detect the original encoding of the document to re-encode, we parse the output of the file -bi command.

Here is the full code of this short but very useful script:

for file in *.txt; do

iconv -f `file -bi $file | sed ‘s/text\/plain; charset=//’` -t utf-8 “$file” -o “${file%.txt}.utf8.txt”

done

Generation of the Lexicon and Language Model

Once we have our files encoded in UTF-8, we can use them to determine the set of used characters in the corpus in order to generate the lexicon and language models.

For that, we have started working on a Python script. This script first normalize the documents so that each character is always represented in the same way across all the corpus.

We then generate the lexicons files, in which we store the list of phonemes (here, there are characters) used in the corpus, and also add the special phonemes that for example mark the beginning and end of the sentences, or represent spaces between words.

We still need to generate the associated language model file. At first, we won’t really generate a proper language model, but only consider a pseudo-language model in which all the transitions are uniformly distributed.

We may then generate a language model based on 2-grams, or 1-grams if the results are not good. (But we don’t know yet if it is a good idea, as the training data may not be sufficient to generate a good language model.)

Generation of the training and testing set

After having generated the lexicon and language model, we will have to separate our data into training and testing datasets.

It will let us know how the algorithm is performing, and will guide us on which things to tune.

Current advancement and future plans

The construction of the new database is currently a work in progress. For now, we have already managed to extract the dictionary of letters (in the same encoding) and integrated it into the lexicon file.There, they are described with their original symbol as seen in the data. We might want to be able to represent some of those characters by some other symbols, if the RWTH-OCR package can not handle them correctly. We will apply this correction if necessary after our initial tests.

At this point, we still need to generate the language models, the training and testing corpus. The chosen language model shouldn’t be very difficult to generate. Indeed as a simple model has been chosen we already have all the necessary information for the generation of the language model file. The simple model of equiprobability (or maybe monograms) has been chosen because the considered language is not in its modern form, and we thus have very little statistical information about it.

As for the generation of the training and testing set, we will need to separate the database in two parts and, more importantly, decorticate every line of the transcription to split into what is called “phonemes” in the system, which consists of a basic digital representation of a character in the system.

In conclusion, we are continuing the project actively and are hoping to soon get some results, that are as promising as the ones obtained with the character database, even though the complexity of the problem is much more important.