The aim of our project was to build a Web-based Image Recognition Application for recognizing Venetian palaces. Venice has over 200 palaces. These palaces are very unique, and are known for their careful selection of architectural features. Thus, an ideal recognition algorithm should have an inherent design to make use of these features.
Image recognition is a fairly recent technology and has huge application in diverse aspects of our day-to-day life. There are plethora of Image recognition algorithms in the literature and it is a daunting task to make a selection, as to which algorithm to prefer over which one. There is a fundamental tradeoff between precision and complexity of the algorithm. In our case, we wish to maximize the accuracy, but at the same time obtain results quickly as the end user of the application are impatient humans!
The basic objective in our application, was to come up with a model or classifier, which can take an image uploaded by an user, and try to recognize it as accurately as possible. The important feature of this application is that it should put no constraint on the type of image being uploaded i.e lighting conditions, angle from which image was taken. Constructing a classifier requires training using an existing annotated dataset. Thus, the project also involved collecting a huge dataset of images and labeling them.
Venetian Palace Dataset Creation
The first attempt towards creating a dataset was done using images scanned from two books , . This helped in crafting categories for a basic classifier. The next task was to populate these categories, and our natural selection for this task was Google Image search. We wrote a python script for downloading bulk images for each categories, using Google API. For certain categories i.e palaces, we were successful in getting more than 60 good images (considering that google API only allows 64 images). In order to further boost our dataset collection, we considered taking multiple patches from the downloaded images.
Palace Recognition – Methodology
In our pipeline, one of the significant part is to choose a proper feature descriptor which has a close match to architectural feature. We implemented hierarchical patch extraction from the palace image followed by SIFT feature extraction from the patches. SIFT features are the popularly used features for the images. Since our palace images have concentrated unique structures, we extract SIFT features from patches which are rich in architectural features. We say certain patch to be architecturally rich if we have more keypoints detected. We clustered all the SIFT features into some number of groups/clusters which is the size of our vocabulary/dictionary. We incorporated the spatial information of these features using Spatial pyramid matching , resulting into a final feature vector for the learning model. We made the spatial pyramids taking into account the high density regions of keypoints where we expect to have more architectural features. This was different from the traditional SPM where we make grids over the entire image. Hierarchical pyramid construction is as shown in the figure 1:
We tried with SVM, boosting classifier, Random Forests and among others. We decided to use K-Nearest neighbours in a supervised fashion for our learning model. It is one of the very standard technique which was fast and accurate in our predictions as shown in the table1. Our pipeline for visual recognition of palaces is shown in figure 2.
We collected images of 60 categories of palaces and buildings in Venice for our project, naming it as – Venetian Palace dataset. We were successful in collecting more than 60 images for 10 categories, which we used for our project. Our project can be scaled for more categories as we get more images for other categories. For training, all the images were converted to 256×256 size to make it uniform and then applied whitening. First of all we computed the SIFT feature descriptor for each image patch which is taken hierarchically from the image determined by the density of keypoints. We created a dictionary of size 200 by clustering of SIFT features into groups by K-means clustering with 5000 iterations. This dictionary is used later for the hierarchical Spatial Pyramid Matching with pyramid levels of 2, which yields the final feature vector. We used this feature vector to K-Nearest Neighbour supervised learning model which we trained with 80% of our dataset and tested our model with the rest 20%. K-Nearest Neighbours in the model is assigned to have 10 as the maximum number of nearest neighbours to be considered.
Visual recognition tasks favour fast and accurate classification methods. We tried with different machine learning techniques such as Multi-Layer perceptron, Random Forests and Kernel methods like SVM. MLP was both slow and low prediction accuracy. We moved on to using ensemble of Random trees, that is, Random forest which was good at fast computation but gave a low accuracy. K-Nearest Neighbours predicted the labels of the image with a good accuracy and in an appreciably low time as shown in table 1.
|Classifier||Accuracy(%)||Time(for 1 iteration check) in min|
|Random Forests||40.00 ± 2.05||1|
|SVM||46.58 ± 2.73||0.5|
|K-Nearest Neighbours||75.52 ± 1.84||0.2|
We see that K-Nearest Neighbours outperforms other classifiers in the accuracy. Our entire pipeline of visual recognition of palaces is coded in Python using OpenCV and scipy libraries.
To develop the web application, we used a python based web application framework Django. We built a basic file-manager application which takes in a valid image input from the user and supplies the image to the image processing pipeline implemented in python. We have tested the integration of OpenCV with the web framework and verified that the images supplied by the user were being used for image processing by OpenCV. We then output the results of the image classification on our results page of our web application. The results page of the classification will also include the location of the palace on google maps and some additional information of the palaces. The demonstration of the website will be done on the day of poster presentation. The following images will give you an idea as to how our website looks like.
Thus, as could be seen from the results that K-Nearest Neighbours outperforms the other algorithms, both in accuracy and latency, for our application. The classifier could now be extended to classify and recognize more palaces. This will require collection of more images, and hence, visit to the city of Venice, will definitely pay off well! Another way to enhance dataset could be to introduce online learning algorithms, so that every time a user uses the web application, the uploaded image could be saved into our database. For this, of course, we need the user’s consent, to allow us to save their clicked pictures!
 Rössler, Jan-Christoph. I Palazzi Veneziani. Venezia: Fondazione Giorgio Cini, 2010. Print.
 Fasolo, Andrea. Palazzi Di Venezia. Venezia: Arsenale, 2003. Print.
 Lazebnik, Svetlana, Cordelia Schmid, and Jean Ponce. “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories.” Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. Vol. 2. IEEE, 2006.
 Yang, Jianchao, et al. “Linear spatial pyramid matching using sparse coding for image classification.” Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.