Encryption is the process of hiding the original content (or semantics) of data from an initial observer. It is one of the two fundamental processes (the other being decryption) in the field of cryptography which can described as the study and the practice of securing communications against adversaries. In Venice, between 15.-19. centuries, the ambassadors encrypted their communications with contacts from other cities such as London, Paris and Istanbul. The aim of this project is to make progress towards the decryption of such documents.
In this paper, we show our findings relating to a system which can be used as the pre-processing sub-module of a textual recognition module for handwritten diplomatic exchanges. Our input consists of 72 scanned documents, all in handwriting and 16. century Venetian. In the following section, we detail our progresses in a multitude of aspects: image cropping, global noise removal, regional black-white quantization, artifact elimination, line separation. Parallel to these, we also made progresses with manual transcription.
1. Automated Image Cropping
After a quick examination of the scans, we realized that the images contained more than the texts. There were non-textual regions near the image borders. To eliminate these regions, we do the following:
We do a preliminary regional black-white quantization of the image with Otsu’s method such that the intra-class variance is minimized (Otsu, 1979). We separate the image into regions (specifically a total of 10×10 equi-sized regions) before quantization because as can be seen in the figure below, the best threshold varies across regions in the scans, most likely due to a number of factors such as heterogeneous illumination, deterioration of the paper and ink, arbitrary stains…
Figure 1: Otsu’s method’s threshold distribution across a scan
Figure 2: Result of the preliminary BW quantization (low-res figure)
Afterwards, we get the L1-norm of the rows in the BW image. Then, we smooth the L1-norm vector (1 value for each row) with Savitzky-Golay filter. This filter is a generalized version of the moving average filter. The coefficients are found via least squares regression (Savitzky, 1964). The reason for this low-pass filtering is to capture the trend across the rows of the image, mainly where the text starts where it ends etc.
Figure 3: L1-norms of the binary (BW) vectors in rows and the smoothed version
We find the two local maxima indices such that they are corresponding global maximums in the former and latter halves of the smoothed signal. These will correspond to the cropping indices in Y (vertical) since they will correspond to the portions with most whites (i.e. upper and lower page margins). We then crop the upper and lower non-text margins falling outside of these indices.
A similar procedure is applied for the horizontal margins except the following. We first identify a local sharp minimum which corresponds to the book binding. We check if the minimum is sharp enough via manually tuned parameters. We eliminate the section falling on the wrong side of the binding and search for our local maxima afterwards. Results:
Figure 4: Crop contour shown in green for an exemplary image
2. Global Noise Removal
When we look at the intensity histogram of the image we observe bimodality as expected. This justifies our attempt to use black-white quantization for the foreground (ink) and the background (paper). Before proceeding with the actual quantization we apply the following iterative noise removal algorithm.
Figure 5: Iterative Global Noise Removal
Each pixel has four direct neighbors (except border pixels): two vertical and two horizontal. We identify to which neighbor the pixel is closest to in absolute value and move (increment or decrement) in that direction. At steady state (absolute difference is 0 or 1 for all pixels), we observe that the mode peaks get larger. The amount of increase in the peaks depends on the image.
3. Black White Quantization
After cropping the image, we reapply the regional BW quantization.
Figure 6: Histogram change (grey-level intensity) after BW quantization
As can be seen in Figure 6, we successfully separate the foreground and background, and map the modes on the left to black and white.
4. Artifact Elimination
In the BW image, we can observe small clusters of non-textual parts identified as black. Consequently, we need to filter these out. We use the Flood Fill algorithm to determine these clusters (components) and if their size is under a certain threshold we turn them into white. The below threshold satisfies that with probability near 1, we eliminate non-text artifacts. We assume a priori probability distribution of character sizes as gaussian.
Figure 7: Smallness threshold for artifact elimination
In Figure 8, we can see the intermediary result after all the above steps.
Figure 8: BW quantization example
5. Line Separation
The next step is to identify the lines in the image. For that we separate the complement of the BW image into column-wise regions as shown in Figure 9. Column regions are equi-width on the horizontal axis. Taking the complement makes the writings white (=1) and the paper black (=0).
Figure 9: Illustration of column region separation
In each region, we get the L1-norms of the rows and identify the line separation when the norm is 0 which would indicate that row belong to the line separations. After identifying these separative column region rows (essentially row segments in the overall image), we create a new image where we label all the separative row segment elements logical 1 and the remaining parts logical 0. This 2D matrix, or image, has the same size with the cropped image. In this 2D image, we identify the connected components with Flood Fill algorithm and draw lines across these components by averaging the vertical indices of the contained pixels across the vertical dimension.
Figure 10: Results of the line separation
6. Manual Transcription
We noticed the dataset is not suitable for commercial OCR (optical character recognition) programs due to the following reasons. The fact that the letters are in handwriting hinders the effectiveness of OCR due to the calligraphic form the writer uses. Additionally, since cryptography is involved, we occasionally encounter foreign alphabet and symbols. Available OCR programs cannot be easily tuned for the recognition of these.
We tried manual transcription; however, it proved to be a very challenging and time consuming task. Since, in the end, it would not generalize to larger datasets, the required effort easily outweighs the prospects. Nevertheless, we made some progress on that end.
We encountered many letters we were not able to distinct between. For example, in a group of documents from the same writer, (‘f’, ‘s’, ‘l’) and (‘x’, ‘k’, ‘r’) showed similarities respectively. To fasten the amount of time in transcription, we decided to denote the letters in each of such groups with a single letter from the corresponding group. We concluded that the final letter assignment could be by utilizing the semantics after completing the transcription.
Furthermore, occasionally, we could not decide whether a character was a letter or a number such as with the letter-number pairs of (‘b’,’6′) and (‘o’,’0′). To circumvent that, we scanned a page to see if there were other numbers and, if not encountered, we identified a character as the letter from the corresponding pair.
There were also cases where the characters was placed such that one of them was the superscript of the other. We placed the superscripts in parenthesis.
Figure 11: Manual transcription result
So far, we implemented a system which can successfully crop a scanned document to eliminate non-textual regions. Afterwards, it can quantize the image to clearly separate the paper (background) and the letters (foreground). Furthermore, it can also eliminate artifacts arising from a number of factors such as deteriorated paper, stains, shadows of the ink on the backside of the paper etc. Then, we can do line separation and correct the orientation of the image (line by line). We also managed some manual transcription.
The continuation of this work would be to achieve letter separation and consequently, letter recognition. For that, we devised and started implementing a system. The system starts with Flood Fill algorithm to identify the letters, remaining larger components. After correcting the orientation of the image, these letters are identified in the correct ordered.
We derive a dictionary of letters. However, for that we would need to be at ease with manual transcription which has proved to be very hard and time-consuming. Afterwards, a subset of letters is selected and identified via SIFT (scale-invariant feature map). Using these letters as a training set, we decide on the remaining letters with a supervised learning method named bagged classification trees . It is basically an ensemble of decision trees.
Otsu, Nobuyuki, “A Threshold Selection Method From Gray-Level Histograms”, IEEE Transactions on Systems, Man, and Cybernetics, 9 (1979), 62-66 <http://dx.doi.org/10.1109/tsmc.1979.4310076>
Savitzky, Abraham. and M. J. E. Golay, “Smoothing And Differentiation Of Data By Simplified Least Squares Procedures.”, Analytical Chemistry, 36 (1964), 1627-1639 <http://dx.doi.org/10.1021/ac60214a047>
For our poster presentation, see here.
For the progress reports (1, 2, 3) and the initial project proposal, please see the older posts in http://veniceatlas.epfl.ch/author/hakangokcesu/.