In our last post we explained some of the difficulties we encountered with our dataset and the next step towards the decryption of these documents was to digitalize them in order to analyse them.
There are several difficulties in our dataset which complicate the process of digitalisation.
Our dataset contains probably enough data to build a frequency analysis of the texts, but it is too small to train an OCR. This is because the ambassador had assistants, and his letters were written from at least 3 different people, which also means 3 different handwritings, therefore it is impractical to train an OCR system to recognise individual handwriting styles. Then there is the problem of the paper itself, since it is pretty old, it grew colored and transparent, it could be difficult to tell a computer what is written on one side and what is written on the other, but can still be read through the paper. Finally there is the difficulty of venetian calligraphy. While the handwriting is readable and that the letters can be distinguished easily, there are multiple ways of writing the same letters in venetian depending of the letters which follows, or depending on other reasons which are not known by us.
Because of all these reasons, we decided that we were going to digitalise the data by hand. But since the amount of data is huge, we are going to crowdsource the digitalisation process.
For this we are going to binarize the documents, then segment the ciphered data into smaller pieces, then present the participants a piece of reasonable sized ciphertext and a table of ciphertext to ASCii character mapping to digitalize (the majority of the ciphertext is not embedded in the current ASCii system, and ciphertext comes in different forms e.g text as an “exponential factor / tag” of another text). The participants transcribe a set of segments of data. For verification, we make the different pool of segments intersect. We will advertise our crowdsourcing event through posters at EPFL and through social media. The participants will receive a small reward to motivate and thank them for their participation, for example a piece of chocolate.
Currently we have achieved a working solution for the user interaction. We use a Google Sheets document in combination with Google App Scripts to produce a personalised sheet to each participant. This sheet will contain the image to transcribe chosen from the database of the extracts, the criterias for this image are that nobody has already worked on it or is currently working on it. The sheet also contains the text fields for the transcription, and some other fields for user data. Having a excel-like sheet is useful because one of the cipher codes contains superscripted letters and normal letters, having a table is an easy way to fill in this information and later to process it.
Once the user has finished the transcription, the data will be updated into our database and the sheet will be deleted.
While the interaction part is done, we still need to build and extract the database.
Concerning more technical facts, the binarization process is ongoing smoothly using Ocropy but segmentation is proving harder to achieve. We are in the process of using new software and producing new code for segmentation. In parallel we’re building a code library for decryption, namely frequency analysis, Kasiski examination and Kerchoffs’ method.