Tag Archives: handwritten recognition

Venetian Cryptography 1: diplomatic documents deciphering — Final Report


Our project is based on a corpus of manuscript letters from Vettore Bragadin, the venetian ambassador of Constantinople, written between 1565 and 1566. Those letters were given to us as raw photographs. Parts of the data is clear text, the rest is composed of two distinct ciphers. Through this poster, we will present our methodology to unveil the secrets of venetian diplomacy.

During the XVI century, the Republic of Venice was a wealthy city deeply connected to the Mediterranean world through its solid naval fleet. At that time, the growing strength of the Ottoman empire, illustrated through the fall of Constantinople in 1453, agitated the geopolitical context around the Mediterranean Sea. However, Venice maintained its commercial relationships with both the Turks and the Western world. In light of these facts, the need to communicate securely and privately between embassies became increasingly important for the prosperity of Venice.

Contrary to what one can expect, the deciphering part of the project isn’t the most difficult nor the most time consuming, since all ciphering methods of these times have been successfully broken. The transcription of the documents ended up being the most complex part of the project because of the diversity of handwritings and the limited amount of data available.

Cipher techniques

Polyalphabetic cipher

Alphabetic encryption methods rely on two main components: the cipher alphabet and/or the key. Polyalphabetic encryption increases complexity by changing the table during encryption.In this case, the table mapping plaintext to ciphertext characters is changed at a determined frequency f (every x characters). Eventually the key is repeated until all text is ciphered. This method was thought to be safe as ciphertext hides the true letter frequencies and changing the key makes all previous attacks irrelevant.

During a preparatory analysis purely based on historical data, it was inferred that instances of the Alberti cipher, composed of a disk and key, would be encountered. By confronting the data, it emerged that the encryption methods used would more likely be an improved version of the Trithemius cipher (very close to the Alberti Cipher) proposed by Giovan Battista Bellaso known as the Vigenere cipher or any other close form of running key cipher. In this case, the key is changed after each letter encryption but it seems that the key or partial information about it is also encoded in the text. Today running key cipher are not relevant nor secure anymore and numerous mehtods are applicable to resolve this encryption.

Homophonic cipher

A less complex form of alphabetic cipher encountered is the homophonic cipher which maps plaintext letters to several cipher characters/symbols. Doing so homogenizes the letters frequencies in ciphertext making frequency analysis hardly exploitable. Homophonic cipher were extensively used in diplomatic communications with the help of nomenclators, namely a book gathering character encoding tables. Homophonic cipher were systematically broken since the mid-sixteenth century and are now a trivial cryptanalysis tasks thanks to computational power. A set of letters of the dataset considered is very likely to have been encrypted by this method.


Metadata gathering is first conducted to identify the cryptosystem faced. Depending on which combination of
cipher’s component available, the problem is approached accordingly.
Possible Metadata :
  • Historical context
  • Cryptographic systems at that time
  • Plaintext and ciphertext pair
  • Previous reports on subject

Frequency analysis

The first dimension of our project was to gather statistical data about ancient venetian language, a task that has not been performed until today mainly because of the rarity and difficulty of access to digitized ancient  venetian. Hence we ran frequency analysis over various sources : embassy documents from 1498, diary extracts from 1519 and a consequent 600 pages diary written between 1496 and 1533.
These miscellaneous groups of document types and age are meant to provide a solid basis of statistical data.
Frequency analysis is only interesting for the alphabetic substitution cipher. The frequency of the letters in the cipher text is compared to the the frequency of typical Venetian and letters correspondences are found.
Bigrams and trigrams (combination of 2 and 3 letters) statistical data is also very important in order to break more convoluted ciphers. We use the google API datasets for n-grams in Italian as a basis
Figure 1: Frequency analysis of the 1496-1533 diary
Figure 2: Frequency analysis of the 1519 diary




Using automated character recognition was not fruitful as several handwritings appear on relatively small datasets. Transcription complexity is further increased by pages quality and transparency and the polymorphic nature venetian calligraphy. For example, c is written like [tf] before e or i, and [k] elsewhere. Thus recognition of characters is done using a crowdsourcing approach. We therefore need to condition the dataset accordingly.

Figure 3: Sample of clear text. The text on the backside is clearly visible
Figure 3: Sample of clear text. The text on the backside is clearly visible


Binarization is used to filter out all the noise and defects of the picture to obtain black on white text.


Binarized scans are then splitted linewise to be harnessed as crowdsourcing material. This job consist in classifying clustered text regions and separate it from empty(non-text) areas in order to obtain lines of text.


As OCR is not applicable and the dataset size is too large for single handed transcription, it is suitable to use crowdsourcing for digitalization. The task assigned to participants consists of transcripting a single line of the original document. This line obtained through the binarization/segmentation workflow is thus cut out of the whole text. Effectively, the participants classify the manuscript characters by mapping them to the letter they may represent. In the case of the change of alphabet, they will use a visual keyboard mapping the symbols to their ASCII representation. Although crowdsourcing is essential, it raises the usual challenges of reliability and redundancy and the problem of optimizing the cost/quality ratio. This is a very interesting subject extensively studied statistically and algorithmically. One could buy absolute reliability at cost of financial efforts however in our case the dataset’s size makes the optimization of the reliability/cost tradeoff compulsory. In order to find appropriate solutions we use the results presented in the paper Efficient Crowdsourcing for Multi-class Labeling from David R. Karger, Devavrat Shah, Massachusetts Institute of Technology and Sewoong Oh, University of Illinois.

Digitalisation results


This is how a portion of the transcription should look like, compared to an extract of the original document

Figure 4: Original of the polyalphabetic cipher
Figure 4: Original of the polyalphabetic cipher
Figure 5: Example of possible transcription
Figure 5: Example of possible transcription



The OCRopus software is used to automatically perform clustering-based image thresholding, or, the reduction of a gray level image to a binary image. Transparency is significantly reduced.


The segmentation was performed using the OCRopus software again. This binarization/segmentation workflow yields good results with very few artifacts that can be visually inspected and excluded.

Figure 6: Polyalphabetic (line 1) and monoalphabetic cipher (line 2) binarized and segmented
Figure 6: Polyalphabetic (line 1) and monoalphabetic cipher (line 2) binarized and segmented
Figure 7: Example of segmented word
Figure 7: Example of segmented word



We first tried automated character recognition, but the results were unusable. For instance, the automated transcription obtained for the first line in Fig. 6 is

– P’se-P’F 6a’?e N/ea’erMAir4sf t’s4ytl

Four our project, we first made a working prototype of crowdsourcing platform using APIs from Google Doc and Google Scripts. However it was not very user-friendly, and not adapted to get the inputs of hundreds of people working on the subject. We then looked at professional crowdsourcing platforms and found a suited platform : Amazon Mechanical Turk which offers large and constant workforce for human intelligent tasks.