The overall objective of this project is to be able to recognize and classify the ornament s from the digitized books of Venetian archives. It can be useful for tracking the origin of the book as ornaments were made by unique, handcrafted woodblocks.
As a first step towards this goal one needs to be able to locate the ornament on the page, and separate it for further classification. The technique that is used for that is connected components aggregation: the page contents (text and possibly ornaments) are broken into “blobs”, each of them representing a piece that looks as a whole. This can be a word, a sentence, few lines of words if the interval is small, or an ornament. The polygon then extracted as an approximation of the shape of the “blob”. Further, several features of these polygons are used to differentiate the pieces of text from the ornaments, such as size, orientation, some characteristics of the image inside the polygon (contrast, density, etc.).
This method is already implemented and achieves quite high precision: about 98% of the “blobs” are correctly classified. But there is a problem with construction of a polygon shape for the ornaments. Consider this one for example:
One can imagine that algorithm could extract the toric-like polygon for the outside ring and another circular for the inner part of the ornament. If we use only convex polygons there are still a lot of examples with the ornament being broken into several pieces corresponding to different connected components Which is of course a big problem for further classification. And a small piece of the ornament may not be representative in terms of features used for ornament-text classification and therefore may end up classified as text. Currently I am working on modifying the algorithm in a way that the ornaments are recognized as a whole.
The idea I am trying to implement is to use geometric properties of the “blobs” and their relative positions to understand which ones should be merged and how the merging needs to be executed. Unfortunately, I don’t have any interesting results to show right now, but I’ll be updating the post as soon as I get some.
Project status update
The problem described above wasn’t originally a part of project plan, so it has changed obviously. Also, I initially underestimated the effort needed to learn to work with the libraries used for the task which delayed the progress. Here is the updated project calendar:
- 20.03.2014: finish the work on the “blob” aggregation algorithm
- 09.04.2014: design the first classifier to obtain clusters of “look-alike” ornaments, the ornaments belonging to one publisher
- 09.05.2014: design the second classifier to separate counterfeit ornaments from original ones
I should notice, that the first step is done by me individually, but for the the classification the plan is to combine the efforts of all project members as it seems to be much more complex. So the project plan for the later tasks should be consider as a first approximation and will be updated as soon as we start with the classification.