Clustering is one of the workhorses of multivariate data analysis. It seeks to group multivariate observations that are “close” — I use quotes because in higher dimensions there can be distance Jim, but not as we know it.
(For the record, this is a symmetric 3000x3000 matrix of non-negative values which, itself, can be regarded as a distance matrix, i.e., element i,j can be thought of as the distance between observations i and j. But for the purposes of this post you can just see this as an arbitrary matrix.)
Microarray buffs will be familiar with this representation of a distance matrix as a clustered heatmap as generated by R’s heatmap.2() function.
I used a different approach because I wanted more control over the colour scheme—my matrix is dominated by small values so I used the following R to assign heatmap colours by quantiles:
Anyhoo… I wanted to rearrarange the rows (and columns) of this matrix so that similar rows (and columns) were adjacent.
The classic approach to this is to cluster the rows (columns) of the matrix using Euclidean distance, i.e., if we write rows 1 and 2 as
x1 = x11 x12 ... x1j ... x1 3000
x2 = x21 x22 ... x2j ... x2 3000
then dE2(x1, x2) = sum from j=1 to 3000 of (x1j - x2j)2.
In 2- and 3-D, this is the familiar measure of physical distance, but that's not the only distance metric around. As a result of discussions with Vera Pawlowsky-Glahn, Juan Jose Egozcue and Raimon Tolosano Delgado, I started experimenting with Minkowski distance as a basis for clustering the matrix.
Here's what I got for some different values of p, noting that p=2 is equivalent to Euclidean distance.
My impression was that the clusters were more distinct (i.e., the heatmaps were "blockier") with values of p other than 2.
(OK, so this is an impression - I don't have a formal measure of clustering goodness.)
This got me searching around until I came across Francois, D., Wertz, V. & Verleysen, M. The Concentration of Fractional Distances. IEEE Transactions on Knowledge and Data Engineering 19, 873 –886 (2007) DOI: 10.1109/TKDE.2007.1037 whose abstract begins:
Nearest neighbor search and many other numerical data analysis tools most often rely on the use of the euclidean distance. When data are high dimensional, however, the euclidean distances seem to concentrate; all distances between pairs of data elements seem to be very similar. Therefore, the relevance of the euclidean distance has been questioned in the past, and fractional norms (Minkowski-like norms with an exponent less than one) were introduced to fight the concentration phenomenon.
Alas, the magic "p" value is problem dependent. However, this is the first that I have read about the use of alternative metrics for clustering high-dimensional data and it suggests to me that if your clustered heatmap is looking more like a tartan rug than a crisp block-diagonal, it might be worth exploring some different Minkowski distance matrices before giving up and going down the pub.
Speaking of which, the title of this blog post comes from a colleague, mentor and exponent of real ale. Glenn Stone once quipped that "clustering is neither an art nor a science... it is a sport."