All posts by

Histogram features

What we have done so far

We used CK1 distance as a feature for classification as explained in the previous blog post. However, CK1 distance happened to be not suitable for our project. Indeed, we found ourselves with complete different images grouped into the same cluster.

We therefore decided to change the CK1 distance and looked for another possible “measure” of the similarity of two images.

We found a method called structural similarity (SSIM) defined below (from here)

Capture d’écran 2015-04-22 à 19.40.37

We implemented this SSIM measure and used it on the data set, but once again, the results obtained were not satisfying.

New approach

We take now one step backward: we skip the clusterization part based on the size of the images (what we did using MLDemos).

Each image is represented as a vector v corresponding to its pixel distributions along x and y-axis.

In order to have all our images represented as vectors of same dimensions, we set the vectors’ size n to:

n= xmax + ymax

where xmax is the biggest width among the set of images and ymax the biggest height among the set of images.

By doing so, we ensure our images to be represented into the same space.

Building vector v for each image

000177-2.jpg from data set
000177-2.jpg from data set
histogram of the pixel distribution along the x-axis
histogram of the pixel distribution along the x-axis of 000177-2.jpg
histogram of the pixel distribution along the y-axis of 000177-2.jpg

Two parts: v1 and v2; v1 of size xmax (biggest width among images) and v2 of size ymax ( biggest height).

Building the vector v
Building the vector v

We fill v1 with the histogram distribution values: histx on the x-axis and complete the resulting parts of v1 with zeros; we do the same for v2: the first components of v2 are the histogram distribution values: histy on the y-axis and the rest are zeros.

Hierarchical clustering

Our data is now the matrix ImMat of 8839 (number of images in data set) lines; each line is the vector v built as explained above for each images.

This matrix is the input of the Matlab function linkage.

Linkage works as follow:

  • At the beginning, each image (line of ImMat ) belongs to its own cluster.
  • Linkage groups two images that have the shortest Euclidian distance among all the possible combinations of Euclidian distances of two images.
  • Tuned parameter: complete: linkage takes the biggest Euclidian distance to compare a cluster to an image.
Dendrogram on 8839 images
Dendrogram on 8839 images

We obtained the above dendrogram.

On the x-axis one can(‘t) see the images’ number and on the y-axis the Euclidian distances.

Intuitively, we want to establish a threshold, for example a horizontal line corresponding to a certain Euclidian distance from which the images are considered similar and another lower threshold from which the images are considered counterfeits.


If we now on focus on one threshold from which images are considered similar, the above image is an illustration of what we obtain with a threshold of Euclidian distance=500.

Dendrogram with threshold of 500
Dendrogram with threshold of 500 – Click to display full size image

Possible problem using threshold on hierarchical clustering data

The distance between the histograms of two small images is bounded since the number of different coefficients is bounded by the size of the images.

When considering two images of size 1×3, even if the images are different, the maximum Euclidian distance between their histograms would be (imagine a completely white and a completely black image).

If you consider two images of size 300×500, the maximum Euclidian distance between their histograms is not of the same order of magnitude.


To be able to set a uniform threshold for all the images that would determine if they are similar or not, we need to get rid of this scale factor.

The maximum Euclidian distance between two images of size WxH is . This is found by using a completely white and a completely black image.

We can also take into account the fact that two images are probably dissimilar if they have very different sizes. This is why, when rescaling the distance between two images, we intend to choose W and H as the smallest values for the two images. This way, if we compare a small image with a big image, the distance between the two is greater.