We have briefly sketched our plan to digitize the encrypted documents in our previous post. This time we will talk about our progress and minor adjustment to our plan and difficulties encountered while completing the tasks.
As we described in the previous post, we need to binarize the document in order to have a good segmentation result. We have chosen The OCRopus software suite for the job. Ocropus is a collection of Python-based tools for document analysis. Ocropus implements the Otsu’s method for binarization. The method is used to automatically perform clustering-based image thresholding, or, the reduction of a graylevel image to a binary image.
The results we obtained are satisfying. The text remains readable and the transparency issue (see through text from next/back page because of paper aging and ink over-infiltrating the page) didn’t greatly affect the quality of the outcome.
Again we used OCRopus software suite for the line segmentation of the document as well. It did a great job in classifying clustered text regions and separate it from empty(non-text) areas. The stages (Binarization and segmentation) run reasonably fast and are well adapted for large datasets, but a bash-type script is required to integrate and automate the whole pipeline and manage efficiently the outputs. Below extracts from the two ciphers segmented.
For good quality pages, the binarization and the segmentation are really clean, we got really few artifacts and noise coming from page transparency
/Image p4 contain more nonsense data due to the quality of the binarized image, however it is not an issue, we could hand pick out the bad images and we believe it won’t affect our study on the digitized text. Several artifacts from page transparency :
3. OCR (An attempt using OCRopus)
Since there is no training set/ dictionary available for ancient venetian and because our datasets are rather small, all our attempts at character recognition have been unsuccessful.
We have also performed frequency analysis on a very large dataset : more than 600 pages from a diary written between 1496 and 1533.
Such an analysis produces the following results :
As we can see in the results, there are some variations in the frequencies depending on the type of text and the time period.
5. Crowdsourcing (Amazon)
In our previous post, we were planning to crowdsource the digitization work to EPFL students, and we indeed have created a google spreadsheet that embeds basic functions to have to job done. However considering there are more undigitized encrypted document out there in the archive, we think it’ll be better if we provide a functional method of solving similar problems of similar project in the future.
After doing some research, we found that the Amazon Mechanical Turk  service suits our needs. MTurk is a crowdsourcing platform/marketplace for companies who want to have a large amount of HITs done in a short period of time. HIT stands for Human Intelligent Tasks, which are tasks are currently difficult for computers to do. Amazon claim they have enough workforce available 24/7, jobs would be done in minutes and also customers (us) only have to pay when we are satisfied with the results.
We will try to refine our text segmentation results, then build a good lookup table (encryption character to ASCII), next we will find out exactly what service Amazon MTurk provide, all the details and regulations. If everything turns out to be fine, we’d try to get approval from the professor and execute the crowdsourcing task on Amazon.
We also want to get a useful document concerning the encryption methods and techniques used in this period. Which is sadly only available in paper at the french national library, we are planning to attempt to retrieve this document and we are currently getting in contact with the institution.
Chr. Villain-Gandolfi, Les dépèches chiffrées de Vettore Bragadin, baile de Constantinople, in “Turcica”, IX/2-X (1978), pp. 56-106.
Team 1 :