Tourists pictures


Tourists Picture project fuses several computer science techniques to achieve the initial research motivation, which is to get 3D reconstruction model from Internet images. By support of python script of semantic query, we retrieve a number of images from Google servers for further actions. Throughout the project, techniques of computer vision are mostly used to filter and classify the Internet images. For computational expensive task as   classification, we come up a method with filtration to filter out the irrelevant image and balance the computational time of canonical image selection and miss-classified rate. Thanks to the support of 123D-Catch, we get several successful 3D models with canonical images as input images.



The way people get to know a city varies and the common practice is to read the official booklets or visiting the official websites. However, we would like to present the way how the tourist see a city by using the pictures tourists take. By analyzing big image dataset, we extract and reconstruct the 2D tourist pictures of Venice into 3D models.

During the research, several computer vision methods are developed to do the picture selection for 3D reconstruction. We first developed a python script using urllib2 to download images automatically. After we get an overabundance of the images of landmarks, we tried series methods to shrink the image set and keep the “good” images for 3D reconstruction. We use 123D Catch to do 3D reconstruction because its professional and reliability.



  1. Picture acquirement 

A python script dealing with urllib2 downloads images from Google Image Server systematically and automatically. The script takes an input of one landmark’s name so as to download the related images. Its function is as same as typing ”landmark’s name” in Google Image search box and download all of the retrieved image items. An automatic way may ensure the diversity and good representativeness of the tourist pictures acquired from website. We use a formal python http API and google API to download images from google ajax servers, which pitifully only allow downloading maximum 64 images using google API. There is no way of downloading thousands of google images by machine code. So the following steps are based on 64 images. Then in the following classification part, all algorithms are all based on a very limited number of images but our method can also be applied on a big dataset.

  1. Classification

The classification here is not as same as that in machine learning because technically we did not use any machine learning approaches. Classification in the project contains only two categories: images proper for 3D reconstruction and images improper for 3D reconstruction. The filtration steps are applied to shrink the size of image set and reduce the computational expense of canonical image selection.

2.1 filtration of downloaded images

Using the query result of “San Giorgio Maggiore” for example, there are typically 4 types of images from the query dataset, which are exterior landmarks, interior of landmarks, paintings and sketches. The 3D reconstruction only takes images exterior landmarks. Therefore, we develop a classification strategy to detect and deleted them.

The general strategy of classification, which here also can be regarded as filter out the images inappropriate for 3d reconstruction, includes 2 filtering steps. By observing the automatic downloading results, we discover the blue sky and sea are very common in most of the tourist pictures. The first step filtering was just based on this certain characteristic by eliminating those images with too fewer or too many blue component. For example, the 2a contains a reasonable blue component but 2b does not. So we filter 2b out.


Figure 1  Image and mask.

The white pixel on mask represent this pixel is either belongs to the sky part or lake part.


Figure 2 image and mask.

The mask is almost black because the image doesn’t have sky part or lake part. And we don’t want images taken from interior.

After first filter, we would get a subset of the original data set. In experiment, the first filter eliminates 30% of the original dataset. If we have an original dataset of 1000 images, the intermediate result is still large after first filter. We perform second filter to give a score to each image. The score is based on histogram analysis and measure the histogram distance. We calculate the score by firstly calculating the pair-wise distances of all images. If we have an image dataset of n image, then we would get n*n distance matrix. The score is based on the distance matrix. We calculate the score by sum up the column of the distance matrix. Thus the score measure the overall distance from one images to the rest dataset. We choose images with overall small distances to other images to the next step. It also filter out those irrelevant images.

figure 3

Figure 3. Two good candidate images and their histograms.

2.2  SIFT-matching adjacient matrix generation

After filtration on input images set, we have a relevant image subset with certain confidence. But the subset is still too large for 3d construction and has some redundant images. We could come up the idea that first select several pivot images of the landmark for rules of classification. However, as we design our classification process as full-automatically, we choose not doing any manually users’ action. We then perform a pairwise SIFT-matching distance calculation between every two images in the subset, which is similar to the matlab distance computation but is more computational expensive. This is the reason why we have to filter out irrelevant images at the last two steps.

figure 4

Figure 4.

Here we use SIFT detection to extract keypoints on every images. This step is a basis for further action. As showed on the above figure, the matches are calculated based on the pre-calculated SIFT descriptors of two images and FLANN(nearest neighbor method on high dimensional space). The quality of matches between two images defines the distance, which is also called confidence. The confidence is calculated by a successive measure, which is already implemented by OpenCV source code. First for geo-adjacent two images, we could find the homography between then, which defines a projection relation. The OpenCV calculating homography has a certain statistical principle(RANSAC, which is another topics…). If a keypoint on first image project to another keypoint on second image by multiplying homography, this would be count as a good match. Then we could get our confidence value by deal with the number of good matches and the number of all matches.

Confidence = Number(good matches)/(8+ 0.3*Number(all matches))

Confidence = Confidence > 3.0? 0 : Confidence

The weighted value of first formula is set by Computer Scientist, which should have a good performance. The second formula is saying if the confidence between two images is too large, we believe they are too similar and will be treated duplication in 3D reconstruction and in hence we set the confidence to 0 which denote that they can be treated as duplications.

We use this confidence to select canonical images. We also get a n*n distance matrix if we have a dataset of n images after two filters. Every entry(i,j) of this matrix measures how confident confident it is to say image i and image j is related. This value is calculated by SIFT detection and feature matching. Images taken under similar condition usually have a higher confidence.

We then perform a leaveBiggestComponent method to get the biggest component of the matrix. Also the confidence of this biggest component should above a threshold.

2.3   canonical images selection by analysing Eigenvector centrality

After obtaining the adjacency matrix representing the relationship between the 40 images, we would like to see which image is the most important and which is the least important, thus, we need to visualize their relationship. We implement the visualization with d3.js, which is a JavaScript library for manipulating documents based on data.

In graph theory and network analysis, centrality of a vertex measures its relative importance within a graph. Eigenvector centrality is a measure of the influence of a node in a network. It assigns relative scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes. Let’s assume the adjacency is A, from the formula we can get eigenvalues and eigenvector corresponding to each eigenvalues.

AX =λX

But only the greatest eigenvalue results in the desired centrality measure. The vth component of the related eigenvector then gives the centrality score of the vertex v in the network.

Firstly, we calculate eigenvalue centrality of the adjacency matrix by networkx, which is a python language software package. Moreover, we perform a ranking of all the eigenvector centrality values from the largest to the smallest.

figure 5

Figure 5  Ranking of eigenvector centrality.

We save the information of nodes attributes(names of nodes, eigenvector centrality values) and edges attributes(weights represent the value from the matrix). Afterwards, we manipulate the data with force-directed layout by d3.js

  3. 3D reconstruction.

The input image set size for 3D reconstruction should be less than 30 images. Otherwise, the running time of reconstruction would be extremely slow. We choose 123D-Catch to perform such task because of its professional and reliability. Each 3D reconstruction takes about half an hour and returns either a 3D model or a failure message. Occasional failure is due to the input image set is not still geometrically sufficient to do the reconstruction task. The reason is Either the original image set does not contain enough geometrically related images or the classification method fails to retrieve such image subset. A successful 3D model contains mesh and texture calculated by 123D-Catch. With the support of Web App, we could review the 3D model easily.



  1. Results of Classification method.

[d3-source canvas=”wpd3-86-0″]

Figure6   visualisation by d3.js

As is shown in the figure, the dimension of nodes represent the value of eigenvector centrality, while the edges of nodes stand for the connection between nodes with red lines representing strong connection and black lines representing weak connections. Node 32 have the largest number of red lines, thus it is the most important node in the network.

2. 3D-reconstruction results

After we have our classification results, we test them in the 3D reconstruction software. In this test, we tried several certain buildings in Venice. The original well-selected tourist picture is on the left and the 3D-reconstruction result is on the right.

1)       San Giorgio Maggiore


2)       Ponte di Rialto


3)       Basilica di Santa Maria della Salute


4)       Chiesa dei Gesuati


It can be seen, due to the restriction of the place (angle) for tourist to take photo, we can’t have a full angle of pictures on these buildings. Therefore we can only gave a part reconstruction on the buildings. Also, comparing the 3) Basilica di Santa Maria della Salute with the others, it can be told that the more angles


By applying pairwise SIFT-matching algorithm, our research eventually output sufficient image subsets for successful 3D-reconstruction model. Using semantic query, we gain seven samples of landmark, 2 of 7 landmarks were failed due to the reasons mentioned above. Though a lack consistency of the characters of tourists’ pictures, it can be concluded that a proper filtration and recognition method will lead to an image subset which is good enough for a 3D-reconstruction.



 [1] David G. Lowe, Object recognition from local scale- invariant features In The proceedings of the seventh IEEE international conference’99, 1999.
[2] Brown, Matthew, David G. Lowe, Automatic panoramic image stitching using invariant features. In International Journal of Computer Vision’07, 2007.
[3] Autodestk,
[4]d3.js force dircted layout