We have been continuing our research on transcription and decryption of a set of Venetian documents. During this stage, we focused our efforts mostly on the transcription problem. Our recent progresses are as follows.
OCR Testing With An Existing Program
We tried to pass the images through an OCR. As can be seen below in Figure 1, we were able to identify the textual regions on the image (shown on the image as Region 11).
The corresponding transcription was not as promising, even though the program was able segment the handwriting into separate lines, evident from the line by line transcription. The resulting transcription is below in Figure 2.
Relatively clear words in the text are transcribed correctly. This suggests us to pre-process the image before feeding it to OCR.
Possible Discussion of Approach Deviations
Other approaches to continue with our transcription process could be the following:
1. Transcribe the images manually which was our last resort as mentioned in our earlier reports.
2. Rewrite the text in images by hand (as it can be faster than typing) and feed that into OCR. One reason this may work is because when we tried this on a small set of the texts, the OCR output was much more correct. Apparently, the old scripts are too “stylish” for plain digital recognition.
3. Do 1 and 2 with the help of many.
Considering the above alternative approaches, we think the best way is still improving our OCR. The alternative approach 3 is a viable option if we had easier means of crowd-sourcing. For the moment, we are thinking whether we can utilize social media channels for reliable crowd-sourced transcription.
Algorithm Design and Implementation for Data Pre-processing Before OCR
We implemented an iterative pre-processing method on MATLAB which works better than in-built image sharpening methods albeit a tad slower.
The input to the algorithm is an image. It does the following for all available color channels. There are three channels in our case where each channel is allocated one byte of data.
1. Read the image. This is Img(0) saved in memory as Img_Prev. Denote the iteration step as ‘i’. Consider the image is laid on the XY-plane. Finally, we denote the image we construct in the iteration step as Img_Curr.
2. Do, for each pixel:
3. Take the following pixel values in the previous step: the pixel we are going to adjust (M) and the four pixels nearest to it in L2-norm (N(1:4)). Ignore the non-existing pixels N(j) for the pixels M on the border:
M = Img_Prev(x,y)
N(1) = Img_Prev(x-1,y)
N(2) = Img_Prev(x+1,y)
N(3) = Img_Prev(x,y-1)
N(4) = Img_Prev(x,y+1)
4. Check M against all N(j)’s and move M with the smallest possible iteration step -which equals to 1- towards the value of the nearest N(j) value:
N = argmin(|M-N(j)|) where “argmin” is over j=1,2,3,4
Img_Curr(x,y) = M + sgn(N-M)
5. While (A or B or C) is true for all pixels:
Return to step 2 for repeating the process if the (A or B or C) holds for all pixels where for each pixel:
A = true if j1 and j2 exists such that:
j1 /= j2
|M – N(j1)| = |M – N(j2)|
sgn(N(j1) – M) /= sgn(N(j2) – M)
B = true if (|M-N| = 1):
This is a stopping condition because in such a case, in each iteration M and N would just exchange values.
C = true if (sgn(N-M)=0)
This algorithm works very good for separating foreground (text) from the background (the worn out papers). Examples to the results we obtained can be found below in Figure 3A, 3B, 3C:
Figure 3A: Initial Data Example
Figure 3B: Data After Sharpening
Figure 3C: Data After Fed Through The Algorithm
A Remark on Image Segmentation and Machine Learning Progress
We are mostly done with our image segmentation (automatically separating the regions with one letter or word into separate images) and learning (to convert handwriting to text as explained in our previous progress report) algorithms, and are currently working on their implementations.
We are a bit behind the schedule as determined in our proposal since we are not done finalizing the transcription process. Our justification for this is that at the time, we did not have access to the data we are currently dealing with, thus we were not aware of the following issues:
- The scripts have very personalized (“stylish”) handwriting types which are hard to transcribe via OCR.
- A significant portion of the texts has unfamiliar alphabet (most likely specialized for these intercity communication) have unknown number of letters and, maybe, unique word construction structures.
For the remainder of our transcription process, we plan to finish it as soon as possible. If we cannot, we will try to change approaches and postpone the remaining transcription related stages as much as possible so that we can work on the other aspects of the project.