Background As a part of an ongoing project on the Venetian archive a huge number of books and manuscripts have been digitized. Obviously there is a need for automatic processing of these. One particular task is recognition and classification of ornaments found in these documents.
The important thing to know about the ornaments is that they were printed with hand-carved wooden blocks. That, of course, makes each ornament unique and provides a way to find the books with the same ornaments, i.e. classify them by publisher. This can be used to track the geographic origin of a particular book, approximate period of printing and potentially identify the counterfeits. The problem The first step for all tasks described above is to identify the ornaments on the pages, and separate them from the text. For that we could identify separated areas on the page and extract their features to classify them as either text or ornaments.
With the segmentation above we could see at the area of each “box” or average intensity inside the box and other features can be used to differentiate text from ornaments. However, to obtain the segmentation as above is a nontrivial task. More complicated ornaments can have small separate pieces that could really look like text in terms of the features that we described. To get the initial segmentation we identify the contours that stand out in contrast from the background. There is an example of typical result.
Notice how many contours are completely separated from each other, they don’t intersect and can be even into non-intersecting boxes. If you take a small contour out, without its surroundings it may very well be classified as a word or letter. So, the problem becomes quite obvious — we need to aggregate the polygons that are parts of one ornament, so we would get something bigger, something that can’t be confused with text. The challenge One can imagine many different ways to combine all the pieces of the ornament, it doesn’t seem that hard. But the problem is that as soon as we start merging different contours, we have a risk of merging text contours with each other, or even text with ornaments if they are located near each other. This may cause another classification mistake: confusing text for the ornament. Hence there is an obvious trade off: quality — we shouldn’t aggregate text too much, vs quantity — we want to merge everything that comes from one ornament. The algorithm The first thing we do is removing the contours that are completely inside another contour. With the examples you’ve seen above there are a lot of very small, precise contours that are located in the middle of the big ornament and we want to get rid of those. Now, to actually aggregate something we should look at the contours that are close to each other in some sense. Our first idea came from observing contours like the following one:
As you can see the contours can be really big and complex, sometimes they literally wrap around some other parts of the ornaments. Therefore we decided to take convex hulls of the contours and try to use them further. With the example above we get the following hull, which covers a big part of the ornament already. Let’s say we have these convex polygons, we removed everything that was inside some bigger polygon. The next step is to try and combine the polygons that are “close” to each other in some sense. The obvious thing is to combine intersecting polygons, but unfortunately it proves to be too “aggressive and leads to text from different lines getting merged. And this really undesirable as it results in contours that no longer have the shape specific to words or groups of letters. Our final version of algorithm loops over all polygons and check if the convex hull of a particular one intersects with some other polygon, not its convex hull. If it does, we create a new polygon as a convex hull of all the points in the intersecting polygons, add to the polygon list and remove two intersecting ones. This second step is repeated until convergence. Complexity The initial number of the contours identified on the page can reach up to 2000. Our algorithm tries to identify the contours that should be merged. So, in the worst case we have to consider all possible pairs of contours. But then we have to repeat the process, as we created new polygons on the first iteration, and they can be merged further. The described process in general requires cubic (in the initial number of polygons) number of pairwise polygon “checks.” And each check is non-trivial operation by itself (we are taking the convex hull of a polygon), so overall that made the naive implementation too slow. We improve the performance using the observation that very often polygons tend to “grow” further and further, absorbing many small pieces of ornament.
We keep the list of all polygons. As soon as we merge two of them, we remove them from the list, and try to merge everything else with the new, bigger polygon. So, this way we quickly decrease the number of the polygons we are looping over. And we are “checking” the most promising polygons first. This approach proved to be ~50x times faster than the naive implementation, dropping the time needed for one page to seconds. The aggregation quality
We find only 380 cases of possibly incorrect mergers on 192467 initial polygons —less than 0.2% error rate. And most of these cases actually account for noise: text from different lines can be connected due to blots and scanning artifacts. We also run the algorithm only ornaments to see, what is the effect of aggregation. Note, that many ornaments do actually consist of many separated pieces, and one should not expect to always get one big merged polygon. Results are: 90 000 polygons before aggregation, 15 000 after. So, on average we get the reduction by a factor of 6
We find the following results for classification accuracy: Even though the accuracy was very high initially, we were able to achieve some visible improvement. One interesting detail for further study is that by aggregation we changed in some way the “average” contour. So, the features that we used for the classification before might need to be reconsidered to get an even better performance.
- Herbert Edelsbrunner (1987). Algorithms in Combinatorial Geometry. Springer-Verlag. ISBN 0-89791-517-8.
Parker, Algorithms and Data Structures in C++ , ISBN-13: 978-0849371714