Cryptography is a word of Greek origin which translates as the study of secret writings . It can be described as the study and the practice of securing communications against adversaries. In Venice between 15.-19. centuries, the ambassadors utilized cryptography to secure their communications with contacts from other cities such as London, Paris and Istanbul. In this project, the aim will be to understand the cryptography methods used by these Venetian ambassadors. Furthermore, we will attempt to decipher documentations of such communications which still remain encrypted. In the decryption process, the conventional methods will be implemented initially. Afterwards, we will work on devising novel approaches well-suited towards deciphering said Venetian encrypted documents. Overall, the output of the project will be a special bot tailored for our deliverables: decryption, encryption and recognition of such documents.
In this section, we will describe several stages essential to the project.
Our input data will be the scans of certain encrypted documents from the State Archive of Venice. First of all, these graphical representations (the scanned images) needs to be converted into a textual format for our cryptanalysis. There are two possible approaches for that: manual and automated transcription.
In manual transcription, manual examination of the document scans is required. The examiners reconstruct the documents in textual format in accordance with their personal judgments. In automated transcription, image processing techniques are employed for text recognition in scanned images. The automation of the transcription step would result in faster access to a larger collection of data in textual format.
For the following cryptanalysis subsection we will assume to have access to the textual format of the documents. Afterwards, in the text extraction subsection we will explore how to automate the transcription step.
At the moment, there is still uncertainty regarding the material we will be able to access. We assume to be able to access the following: a set of ciphertext-plaintext pairs, another set of still-encrypted ciphertexts. Additionally, we also expect to further categorize these sets into two subsets with the addition of the following characteristic: the knowledge of the used cryptosystem (device, algorithm etc.).
Our approach to the decryption problem will depend on this knowledge about the cryptosystem. Further below, we will explain the arising differences in our approach after we first go through the necessary step of learning about the cryptosystems utilized in Venice between 15.-19. centuries.
Before starting our analysis next semester, we will study the different Venetian cryptosystems (cryptographic devices and algorithms). In our article, by Venetian cryptosystems, we refer to the cryptosystems which were developed in the vicinity of Venice (i.e. Europe) around the 15. century and have been potentially utilized in communications with Venice until the 19. century after which the modern age of cryptology flourished.
Some examples to such cryptosystems can be found in another project on Venetian cryptography from the previous years . A compilation of exemplary polyalphabetic ciphers is as follows: Alberti cipher disk, Trithemius cipher, Vigenère cipher and autokey cipher. Among them, Vigenère cipher is the most general one which can be used with one-time pad keys to achieve perfect secrecy. The others can be derived as special cases of Vigenère cipher except the Alberti disk which combines an ordered alphabet with a mixed one. Additionally, the alphabets of the inner and outer circles of the Alberti disk do not exactly match which is important in deciphering.
Figure 1: Alberti Cipher Disk
In the remainder of the methodology section, we will see how this literature review stage will be beneficial for our project planning and implementation approach.
To decipher the still-encrypted documents, we need to apply cryptographic attacks. The two attack types we would be interested in are ciphertext-only attack and known-plaintext attack, respectively for our sets of just-ciphertexts and ciphertext-plaintext pairs. In addition to that, if we have knowledge about the cryptosystem in use, we can specifically tailor our decryption process for improved results. Below, we will discuss the different aspects affecting our decryption strategies.
Knowing the Cryptosystem in use:
This case can be summarized as follows. We have the devices (or the algorithms) used but not necessarily the exact specifications of the cryptosystems. A basic literature review has shown us how one device (cipher) can be utilized in different operational modes which generate different cryptosystems (e.g. Alberti Cipher Disk ). Additionally, we will have ciphertexts for some of which we may also have the corresponding plaintexts.
If we do not have information about the device or the algorithm, consequently, we will not have access to the cryptosystem utilized. In such a case, we need to compile a list of possible Venetian cryptosystems (devices, mode of operations, algorithms and configurations) and try our chances at deciphering under each of these settings.
In the end, before proceeding with our attack, we decide on a cryptosystem setting whether we know it beforehand or we estimate it somehow with some assumption we will make (high estimation accuracy is neither expected nor needed at this stage).
Just-Ciphertexts vs Ciphertext-Plaintext Pairs:
Firstly, for the set of just-ciphertexts, we can do the following ciphertext-only attacks:
Brute force exhaustive search
This algorithm outputs a key which will convert the ciphertext into a meaningful text. The key used to cipher a text can only be as long as the text itself. If the text has N letters overall, then there are L^N possible keys where L is the length of the alphabet. If we include all possible key lengths from 1 to N and conduct our brute force search for the key from 1 to N, we will find a key match in less than (L/(L-1))x(L^N). When using keys of shorter length than the text, the key is just appended to itself repeatedly until the desired length is reached.
Kasiski examination is of great usage to determine the key length which would speed up the deciphering algorithm in comparison with the brute force. We attempt to detect unusually frequent occurrences of n-grams for n in [1,N]. We collect the letter-wise distances between these n-grams in a list and one of the factors of the greatest common divider of those listed numbers will likely equal to the key length. After we estimate a key length, we separate the letters in the text into n groups depending on the (mod n) equivalence of their linear position in the text (position of the first letter of row (line) m is 1 plus the position of the last letter in row (m-1)). We treat each of these groups as if they were encrypted with Caesar’s cipher which is the 1-length Vigenère key. The distribution of the letters in a language is not generally uniform. By comparing the expected letter distribution with the empirical distribution in each group we can potentially decide on a Caesar’s key (letter) for each group. Consequently, we would obtain the key and hence the plaintext. If the performance is unsatisfactory for some specific cases, a promising approach could be the following:
- Make the gram length variable for each list (not all of them are 1) and find the corresponding most likely n-gram keys for each lists separately.
- Decide on the key lengths as to maximize the likelihood of the found key
- Work on a smart algorithm to do 1 and 2 efficiently
The above strategies can be also applied for Alberti Cipher with certain modifications. Additionally, even though the Alberti cipher can have varying one-time pad key length, it also has indicating letters for the key in the ciphertext itself.
Secondly, when we have plaintext-ciphertext pairs, we can integrate a known-plaintext attack to achieve the following:
If we know both the ciphertext and the plaintext, we could easily find the key if the cryptosystem was in the form of Vigenère cipher (padding texts corresponds to modulus L operations). However, finding the key of an already deciphered document would not be beneficial for us unless the same key was used in other still-encrypted documents. Aside from that, we can benefit from the known-plaintext attack to have more information about the cryptosystem in active use. This could lead to discovering new algorithms and modes of operations for existing devices considering the fact that we may work with freshly digitized, hence never-before computationally investigated documents.
Tools to Use
After converting the documents into textual content, we will do our analysis in MATLAB, SageMath and possibly R. After constructing the algorithms and deciphering schemes, we will convert them to Python language and by using a web framework like Django we will integrate the program into DHCanvas.
Text Extraction: Optical Character Recognition
In this section, we will investigate an approach to do a text recognition which is called optical character recognition (OCR).
The definition of “optical character recognition” is transferring visuals of letters written by humans or printed by machines into machine encoded texts. For example, in English, except i and j, computer recognizes alphabet letters as connecting components. A similar claim can be made for almost any Latin-alphabet languages (including Italian). The algorithm of such a process would compare an input data (original writing) with a training set of digitized alphabet.
OCR Algorithm Summarized:
- Segmentation of the scanned document page into distinct blocks each containing a letter
- Comparing each block with the training set, outputting the estimation
Archived documents tend to have a certain type of noisy data on them in addition to their context. They can be in the form of arbitrary shaded lines and dots (human error in writing, speckle noise). To increase the success rate of recognition, it would be wise to purify the image via filters. Below are three examples to such filters.
- Gaussian smoothing filter:
- best for natural noise, it reduces noise while preserving the information in the proximate pixels
- Threshold application:
- changes the image into binary visualization (black and white)
- recognizing objects becomes easier with the binary visualization
- through “double thresholding”, object matching is more efficient
- Removal of “small regions”:
- deletes regions with dominant noise if the size is smaller than the least letter size
We need to choose the optimal filter in accordance with the characteristics of the visual.
Figure 2: Example showing different handwritings and arbitrary noises (drawings) 
Region Segmentation and Letter Recognition
Figure 3: Depiction of Letter Recognition
As the blue lines scan the visual, we find an “object” (letter candidate). The pixels on the intersections of the objects and the blue lines are identified. The intersection pixels are connected in a primarily upwards and leftwards fashion. If the connected set of pixels intersects with both the uppermost and lowermost horizontal lines, it detects a letter. As we see above, all objects form tree structures so different letters are disconnected.
A thing to note here is that the algorithm may require fine tuning for different hand writing styles without the application of machine learning. A future discussion regarding teaching the program to learn from unidentified letters could be beneficial for the end performance. This way, we could feed the program different distorted versions of the same letter to improve the rate of recognition in the applications afterwards.
Project Plan – Conclusion
After we receive our data set, we will transcribe it either manually or automatically via text extraction with character recognition. Afterwards, we will separate our data into different groups depending on their characteristics (knowledge of cryptosystem, existence of plaintext).
Then, our cryptanalysis will follow which will include the following ciphertext-only attacks (ciphertexts are still-encrypted documents):
- Exhaustive, brute force search
- Kasiski examination, frequency analysis
- Attempt: Enhanced Kasiski examination (with variable n-gram lengths for key recovery)
Also, we will do known-plaintext attacks to retrieve more information on the cryptosystems. Possibly, we may discover new algorithms or modes of operations for existing devices.
Finally, we will transfer all our findings into Python and, from there, to DHCanvas.
- Comprehensive literature review of Venetian cryptography (1 week):
- regularly continued upon during the semester when required
- Transcribing (manual and/or automated) and preparing the data set for cryptanalysis (5 weeks)
- Analysis of data and initiation of cryptographic attacks in easy to handle platforms such as MATLAB, R and SageMath (4 weeks)
- Collecting the decryption, resulting encryption and -if successful- textual image recognition algorithms in the special bot for DHCanvas written with Python using a web framework like Django (4 weeks)
- Finalizing the project deliverables (1 week)
- H. Liddell, R. Scott, H. Jones and R. McKenzie, A Greek-English lexicon. Oxford: Clarendon Press, 1996.
- The Venice Atlas, ‘Venetian cryptographies’, 2013. [Online]. Available: http://veniceatlas.epfl.ch/atlas/gis-and-databases/named-entities/venetian-cryptographies/. [Accessed: 09- Dec- 2015].
- Wikipedia, ‘Alberti cipher disk’, 2015. [Online]. Available: https://en.wikipedia.org/wiki/Alberti_cipher_disk. [Accessed: 09- Dec- 2015].
- H. Wolfe, ‘Learning to write the alphabet – The Collation’, The Collation, 2013. [Online]. Available: http://collation.folger.edu/2013/05/learning-to-write-the-alphabet/. [Accessed: 09- Dec- 2015].