Our project proposal can be found here. In a nutshell, our aim is to decipher Venetian encrypted documents. These handwritten documents belong to the years of 1560s and they are communications between Venice and the ambassadors in Constantinople. They are delivered to us in the form of jpeg scanned images.
As scheduled in our proposal, we continued our literature review of the Venetian cryptography and did some work on transcribing the data from handwriting to digital writing.
With a literature review, we learned about the advances in cryptography during those times in Europe which is approximately the communication range of Venice in those years. In the proposal, we primarily talked about Alberti disk and Vigenère cipher. Additionally, other noteworthy advances were as follows [1,2]:
- Johannes Trithemius, 1518: substitution of letters with words
- Giovanni Battista Porta, 1563:
- intra-alphabetic substitution
- inter-alphabetic substitution (symbol exchange with another uncommon or special alphabet)
- synonym usage and misspellings for defense
Another idea which is disconnected from the history, but nevertheless inline with the cryptography approach of the 16. century, is to substitute words of substantial significance with others according to a ‘codebook’ and to scramble the word order .
The knowledge of these cryptography techniques are aiding us in constructing our approach to the deciphering problem. To exemplify, there are cases of documents which utilizes mixing an alphabet foreign to Latin with Venetian to encrypt the communications. Hence, it is hard to determine this alphabet without going through most, if not all, the documents. Therefore, we are currently devising and implementing an algorithm to go through the image files and gradually expand the foreign alphabet.
Early Data Analysis
We have also started to investigate the documents. There seems to be four different handwriting styles as illustrated in Figure 1.
Figure 1: Different Handwriting Styles
In these documents, we have identified at least two different encryption techniques. One of them uses an extended version of the Venetian alphabet and the other uses a completely different alphabet with symbols we did not recognize.
We have searched for several options/programs to do OCR for the Venetian alphabet and failed to find one that can properly digitize the handwriting in our documents. In the following weeks, our first plan is to manipulate the images so that we can diminish the background noise which is exemplified in Figure 2 as the shady lines.
Figure 2: Example of Background Noise
We will send the images through a filter to diminish the locally weakly colored areas.
There are other obstacles before OCR which are as follows:
- The fact that there sometimes are not hard margin (a.k.a. soft margin) separations between consecutive lines or the separation is not perfectly horizontal. Examples are in Figure 3 and 4.
Figure 3: Soft Margin Separations
Figure 4: Soft Margin Separation with Arbitrary Connection
- The letters are not always (in fact not most of the time in our case) completely separated as already observed in Figure 1 and 2.
To circumvent these obstacles, for the moment, we plan to train a program to identify the handwriting after we initially introduce some structure of the handwriting to the program manually. This may be done in the following way:
- Separating the letters and introducing these mini images to the program one at a time with the corresponding letters (a.k.a. labels in supervised learning)
At some point, we also started to manually transcribe the documents; however, that task proved to be more tedious than we expected. First of all, the disadvantage of not knowing Italian started to show itself. We were not exactly able to easily read the texts and digitize them via a keyboard. Secondly, the handwriting are not exactly very geometric. Consequently, as already illustrated in the figures, these styles make reading them hard for people not used to them.
For the moment, we are able to progress as scheduled and we hope to keep this pace until the end of the semester.
- Kahn, David, The Codebreakers (New York: Macmillan, 1967)
- “CME’s Cryptology Timeline”, Math.cornell.edu, 2016 <http://www.math.cornell.edu/~morris/135/timeline.html> [accessed 7 March 2016]
- “Open Codes”, Crypto Almanac 50th, 2016 <https://www.nsa.gov/public_info/_files/crypto_almanac_50th/Open%20Codes.pdf> [accessed 7 March 2016]