In the previous post we introduced the algorithm used to aggregate the blobs. The idea behind it is merging the polygons, that are overlapping in some sense, and repeating the process until convergence. Although, the algorithm was producing seemingly good results, two questions remained:
- The average running time for one page was about five minutes. This rendered whole thing useless, as processing one book (300 pages) would have taken around one day.
- The better we aggregate the pieces of ornaments, the more we risk merging different lines of text, which can cause classification errors. We have to select a version
Complexity of the algorithm
The initial number of the contours identified on the page can reach up to 2000. Our algorithm tries to identify the contours that should be merged. So, in the worst case we have to consider all possible pairs of contours. But then we have to repeat the process, as we created new polygons on the first iteration, and they can be merged further. The described process in general requires cubic (in the initial number of polygons) number of pairwise polygon “checks.” And each check is non-trivial operation by itself (we are taking the convex hull of a polygon), so overall that made the naive implementation too slow.
One could try to memorize the pairs of the polygons that were already checked, but this implies a significant overhead for searching the indices, updating them and complicates the algorithm. Another possibility we considered is trying to get the neighboring polygons, so we don’t check if should merge two polygons that are far away from each other. However, estimating this neighborhood is expensive and the neighborhoods need to be updated at each step.
The intuition behind the idea that we found the most useful comes from the following picture.
As we start merging the pieces of the ornament there is a good chance that our polygon will “grow” further, as we add the small closely-located parts of the ornament to it. So, we keep the list of all polygons. As soon as we merge two of them, we remove them from the list, and try to merge everything else with the new, bigger polygon. So, this way we quickly decrease the number of the polygons we are looping over. And we are “checking” the most promising polygons first.
New algortihm proved to be ~50x times faster, dropping the time needed for one page to seconds.
The aggregation quality
As we discussed there is a trade off between quality of aggregation of ornaments and accidentally merging the polygons containing words/symbols from different lines.
We compute the upper bound on the number of such unwanted mergers by comparing the resulting polygons to the, so called word boxes, created by another algorithm for word extraction. We find that in some cases, our algorithm performs better (see the picture below) and it almost never does incorrect merging.
We find only 380 cases of possibly incorrect mergers on 192467 initial polygons —less than 0.2% error rate. And most of these cases actually account for noise that gets merged with something, so the number of lines merged is really infinitesimal.
We also run the algorithm on ornaments to see, what is the effect of aggregation. Note, that many ornaments do actually consist of many separated pieces, and one should not expect to always get one big merged polygon. Results are: 90 000 polygons before aggregation, 15 000 after. So, on average we get the reduction by a factor of 6
First, we have to check re-evaluate the aggregation result, as the current dataset of ornaments is noisy in some sense. Finally, we want to see how the algorithm will affect the classification accuracy.