This is our 3rd tri-weekly progress report. Our recent progresses are as follows.
After applying our pre-processing algorithm discussed in the previous blog post, we were able to obtain better foreground (text) extraction. This showed its usefulness (better results) when we applied feature point detection on the characters. The process is below:
We were near-perfectly able to binarize the images where the foreground (text) is black and the background (worn-out paper) is white. We used MATLAB’s inbuilt ‘im2bw’ and ‘graythresh’ functions. The ‘graythresh’ utilizes Otsu’s method to determine a threshold after which the pixel values are mapped to logical 0s and 1s after being compared against the threshold.
To achieve better results, we segmented the image into smaller blocks (about 100×100) and applied binarization in parallel, separately. The reasons for choosing the block resolution (100×100):
- The blocks are small enough for negligible threshold variance inside.
- The block are big enough so that they cover at least one character at a time.
We adjusted this block resolution as we applied this binarization on different pages, in accordance with the letter sizes on them.
In one exemplary page, when we gathered the normalized threshold values inside a vector we saw that they ranged between (0.4706, 0.8).
Line Separation Detection
After the binarization, we did try to detect the separation of the lines. So that after we separate the lines, we can separate further into individual letter for OCR. Doing OCR on a full text image (at least with the available tools we have found) is an impossibility. Therefore, we concluded this and the next step of letter segmentation to be pre-processing necessities.
We tried to apply a rectangular averaging filter to detect the horizontal character densities so that we can divide into lines where the densities were lowest, hopefully non-existent. However, this did not worked out so great as the writings are tilted (they do not align perfectly horizontally). We realized image rotation would not provide sufficient results since the tiltations are not uniform, the degree of slant varies between lines and even inside some lines.
Consequently, this time we apply our rectangular filter after separating the image into column blocks. This provides us with a line separation with levels. For the moment, we did not come across a downside in doing so.
A similar tactic is followed as in line separation detection. Only this time, we apply the filter vertically. This showed promising results, albeit not as good as we had hoped. The reason seems to be the fact that not we are working on a smaller section of the image compared to the line detection. Therefore, the slants in individual letters seem to have a bigger impact on our results. We are working on to improve on that at the moment. Nevertheless, we are able to separate a significant portion of the letters which we now can attempt to classify.
Feature Detection on Letters
In this stage, we again make use of the very high resolution of the images. We segment the letters into blocks so that we can create an image digest which will act as the feature vector of the test letter. For now, we use black pixel counting to determine the values for the dimensions of the feature vector, but since our program is modular, we can easily change it to something else in the coming weeks for better foreseeable results.
As we do not have an available training data set for these letters, we create one. How? The procedure is as follows:
- Get the feature vector for a test letter
- Compare it against the ones in the training data set.
- If the metric used returns dissimilarity for all training letters, add the test letter manually to the training letter.
- If not, assign the test letter with the corresponding letter in the train set.
For now, we are using euclidean distance metric, however there maybe more suitable metrics available. For the moment, we did finalize the theoretical approach necessary to decide on this matter.
We are still investigating the crowd-sourcing option. This is actually a very viable option, but at the same time very unreliable on its finalization. It could take a humongous amount of time depending on the overall speed of text conversion by the participant. Even so, our progress so far paves a path into this as we near-finished segmenting the image into letters which can be fed into crowd-sourcing systems for transcription.
Additionally, we are currently in the progress of meeting with Italian colleagues to discuss the context of the documents in depth.