Our project is based on a corpus of manuscript letters from Vettore Bragadin, the venetian ambassador of Constantinople, written between 1565 and 1566. Those letters were given to us as raw photographs. Parts of the data is clear text, the rest is composed of two distinct ciphers. Through this poster, we will present our methodology to unveil the secrets of venetian diplomacy.
During the XVI century, the Republic of Venice was a wealthy city deeply connected to the Mediterranean world through its solid naval fleet. At that time, the growing strength of the Ottoman empire, illustrated through the fall of Constantinople in 1453, agitated the geopolitical context around the Mediterranean Sea. However, Venice maintained its commercial relationships with both the Turks and the Western world. In light of these facts, the need to communicate securely and privately between embassies became increasingly important for the prosperity of Venice.
Contrary to what one can expect, the deciphering part of the project isn’t the most difficult nor the most time consuming, since all ciphering methods of these times have been successfully broken. The transcription of the documents ended up being the most complex part of the project because of the diversity of handwritings and the limited amount of data available.
Alphabetic encryption methods rely on two main components: the cipher alphabet and/or the key. Polyalphabetic encryption increases complexity by changing the table during encryption.In this case, the table mapping plaintext to ciphertext characters is changed at a determined frequency f (every x characters). Eventually the key is repeated until all text is ciphered. This method was thought to be safe as ciphertext hides the true letter frequencies and changing the key makes all previous attacks irrelevant.
During a preparatory analysis purely based on historical data, it was inferred that instances of the Alberti cipher, composed of a disk and key, would be encountered. By confronting the data, it emerged that the encryption methods used would more likely be an improved version of the Trithemius cipher (very close to the Alberti Cipher) proposed by Giovan Battista Bellaso known as the Vigenere cipher or any other close form of running key cipher. In this case, the key is changed after each letter encryption but it seems that the key or partial information about it is also encoded in the text. Today running key cipher are not relevant nor secure anymore and numerous mehtods are applicable to resolve this encryption.
A less complex form of alphabetic cipher encountered is the homophonic cipher which maps plaintext letters to several cipher characters/symbols. Doing so homogenizes the letters frequencies in ciphertext making frequency analysis hardly exploitable. Homophonic cipher were extensively used in diplomatic communications with the help of nomenclators, namely a book gathering character encoding tables. Homophonic cipher were systematically broken since the mid-sixteenth century and are now a trivial cryptanalysis tasks thanks to computational power. A set of letters of the dataset considered is very likely to have been encrypted by this method.
- Historical context
- Cryptographic systems at that time
- Plaintext and ciphertext pair
- Previous reports on subject
Using automated character recognition was not fruitful as several handwritings appear on relatively small datasets. Transcription complexity is further increased by pages quality and transparency and the polymorphic nature venetian calligraphy. For example, c is written like [tf] before e or i, and [k] elsewhere. Thus recognition of characters is done using a crowdsourcing approach. We therefore need to condition the dataset accordingly.
Binarization is used to filter out all the noise and defects of the picture to obtain black on white text.
Binarized scans are then splitted linewise to be harnessed as crowdsourcing material. This job consist in classifying clustered text regions and separate it from empty(non-text) areas in order to obtain lines of text.
As OCR is not applicable and the dataset size is too large for single handed transcription, it is suitable to use crowdsourcing for digitalization. The task assigned to participants consists of transcripting a single line of the original document. This line obtained through the binarization/segmentation workflow is thus cut out of the whole text. Effectively, the participants classify the manuscript characters by mapping them to the letter they may represent. In the case of the change of alphabet, they will use a visual keyboard mapping the symbols to their ASCII representation. Although crowdsourcing is essential, it raises the usual challenges of reliability and redundancy and the problem of optimizing the cost/quality ratio. This is a very interesting subject extensively studied statistically and algorithmically. One could buy absolute reliability at cost of financial efforts however in our case the dataset’s size makes the optimization of the reliability/cost tradeoff compulsory. In order to find appropriate solutions we use the results presented in the paper Efficient Crowdsourcing for Multi-class Labeling from David R. Karger, Devavrat Shah, Massachusetts Institute of Technology and Sewoong Oh, University of Illinois.
This is how a portion of the transcription should look like, compared to an extract of the original document
The OCRopus software is used to automatically perform clustering-based image thresholding, or, the reduction of a gray level image to a binary image. Transparency is significantly reduced.
The segmentation was performed using the OCRopus software again. This binarization/segmentation workflow yields good results with very few artifacts that can be visually inspected and excluded.
We first tried automated character recognition, but the results were unusable. For instance, the automated transcription obtained for the first line in Fig. 6 is
– P’se-P’F 6a’?e N/ea’erMAir4sf t’s4ytl
Four our project, we first made a working prototype of crowdsourcing platform using APIs from Google Doc and Google Scripts. However it was not very user-friendly, and not adapted to get the inputs of hundreds of people working on the subject. We then looked at professional crowdsourcing platforms and found a suited platform : Amazon Mechanical Turk which offers large and constant workforce for human intelligent tasks.