The work described is a part in the project of recognizing and classifying the ornaments in digitized books of Venetian archives. First step in this process is breaking the page contents into “blobs” of connected components. Then the contours of these blobs are extracted. Here is an example of the result of this process: contours are represented by colored polygons:
After the extraction part each blob is classified as text or ornament based on different features such as size, average intensity, etc. The problem that can be easily seen with the example above is that ornaments can have complex structure and they end up broken into many small pieces, where each small piece can really look similar to the character or a word (at least in terms of features that are used). So, one solution may be to combine blobs that represent small parts of the ornament, so the resulting blob is big enough to represent a piece of the ornament which is really different from the text.
The idea behind the algorithm is to identify the blobs that intersect or overlap in some sense and combine them into one bigger blob. This step is repeated until convergence.
Step 1: first we remove all the polygons that are entirely inside some other polygon. After this step the amount of polygons decreases. Which is really important as further steps are more computationally expensive.
Step 2: in this step we are trying to combine the polygons. We loop over all polygons and check if the convex hull of a particular one intersects with some other polygon. If it is, we create a new polygon as a convex hull of all the points in the intersecting polygons, add to the polygon list and remove two intersecting ones. This second step is repeated until the list of polygons doesn’t change anymore.
We considered many variations of this algorithm, but this one seems to be the most appropriate. More aggressive merging (for example, combining polygons if their convex hulls intersect, as opposed to taking only one convex hull as in the algorithm) may lead to letters in different lines getting combined into one blob which is highly undesirable (usually this happens when there is a tall letter like “f” in the line below one with letter like “g”).
One interesting idea is to add blur, so the letters will be merged into words. This, however, may again lead to parts of adjacent lines being combined. So, we tried adding only horizontal blur, which seems promising.
To find the parameters for blurring and compare different versions of algorithms we are planning to perform proper training of the model. However, current version of the algorithm is too complex to use it on the big data sets, so we need to find the way to optimize the performance and possibly simplify the algorithm.
Project status update
I found the task of blob aggregation much more difficult as I thought initially because of the problems stated above and also me being inexperienced with C++ in general and computational geometry libraries in particular. That’s why the first part was delayed and further issues arose, which need to be solved. Here is the updated version of the milestones:
- 09.04.2014: finished “blob” aggregation algorithm
- 16.04.2014: optimize the performance of the algorithm
- 30.04.2014: learn the best parameters for the algorithm(s) on the training set and choose the best one