Bootstrap Evaluation of Clusters
Win-Vector Blog 2015-09-04
The goal of cluster analysis is to group the observations in the data into clusters such that every datum in a cluster is more similar to other datums in the same cluster than it is to datums in other clusters. This is an analysis method of choice when annotated training data is not readily available. In this article, based on chapter 8 of Practical Data Science with R, the authors discuss one approach to evaluating the clusters that are discovered by a chosen clustering method.
An important question when evaluating clusters is whether a given cluster is “real”-does the cluster represent actual structure in the data, or is it an artifact of the clustering algorithm? This is especially important with clustering algorithms like k-means, where the user has to specify the number of clusters a priori. It’s been our experience that clustering algorithms will often produce several clusters that represent actual structure or relationships in the data, and then one or two clusters that are buckets that represent “other” or “miscellaneous.” Clusters of “other” tend to be made up of data points that have no real relationship to each other; they just don’t fit anywhere else.
One way to assess whether a cluster represents true structure is to see if the cluster holds up under plausible variations in the dataset. The fpc
package has a function called clusterboot()
that uses bootstrap resampling to evaluate how stable a given cluster is. (For a full description of the algorithm, see Christian Henning, “Cluster-wise assessment of cluster stability,” Research Report 271, Dept. of Statistical Science, University College London, December 2006). clusterboot()
is an integrated function that both performs the clustering and evaluates the final produced clusters. It has interfaces to a number of R clustering algorithms, including both hclust
and kmeans
.
clusterboot
‘s algorithm uses the Jaccard coefficient, a similarity measure between sets. The Jaccard similarity between two sets A and B is the ratio of the number of elements in the intersection of A and B over the number of elements in the union of A and B. The basic general strategy is as follows:
- Cluster the data as usual.
- Draw a new dataset (of the same size as the original) by resampling the original dataset with replacement (meaning that some of the data points may show up more than once, and others not at all). Cluster the new dataset.
- For every cluster in the original clustering, find the most similar cluster in the new clustering (the one that gives the maximum Jaccard coefficient) and record that value. If this maximum Jaccard coefficient is less than 0.5, the original cluster is considered to be dissolved-it didn’t show up in the new clustering. A cluster that’s dissolved too often is probably not a “real” cluster.
- Repeat steps 2-3 several times.
The cluster stability of each cluster in the original clustering is the mean value of its Jaccard coefficient over all the bootstrap iterations. As a rule of thumb, clusters with a stability value less than 0.6 should be considered unstable. Values between 0.6 and 0.75 indicate that the cluster is measuring a pattern in the data, but there isn’t high certainty about which points should be clustered together. Clusters with stability values above about 0.85 can be considered highly stable (they’re likely to be real clusters).
Different clustering algorithms can give different stability values, even when the algorithms produce highly similar clusterings, so clusterboot()
is also measuring how stable the clustering algorithm is.
The protein dataset
To demonstrate clusterboot()
, we’ll use a small dataset from 1973 on protein consumption from nine different food groups in 25 countries in Europe. The original dataset can be found here. A tab-separated text file with the data can be found in this directory. The data file is called protein.txt
; additional information can be found in the file protein_README.txt.
The goal is to group the countries based on patterns in their protein consumption. The dataset is loaded into R as a data frame called protein
, as shown in the next listing.
protein <- read.table("protein.txt", sep="\t", header=TRUE)summary(protein)# Country RedMeat WhiteMeat Eggs# Albania : 1 Min. : 4.400 Min. : 1.400 Min. :0.500# Austria : 1 1st Qu.: 7.800 1st Qu.: 4.900 1st Qu.:2.700# Belgium : 1 Median : 9.500 Median : 7.800 Median :2.900# Bulgaria : 1 Mean : 9.828 Mean : 7.896 Mean :2.936# Czechoslovakia: 1 3rd Qu.:10.600 3rd Qu.:10.800 3rd Qu.:3.700# Denmark : 1 Max. :18.000 Max. :14.000 Max. :4.700# (Other) :19# Milk Fish Cereals Starch# Min. : 4.90 Min. : 0.200 Min. :18.60 Min. :0.600# 1st Qu.:11.10 1st Qu.: 2.100 1st Qu.:24.30 1st Qu.:3.100# Median :17.60 Median : 3.400 Median :28.00 Median :4.700# Mean :17.11 Mean : 4.284 Mean :32.25 Mean :4.276# 3rd Qu.:23.30 3rd Qu.: 5.800 3rd Qu.:40.10 3rd Qu.:5.700# Max. :33.70 Max. :14.200 Max. :56.70 Max. :6.500## Nuts Fr.Veg# Min. :0.700 Min. :1.400# 1st Qu.:1.500 1st Qu.:2.900# Median :2.400 Median :3.800# Mean :3.072 Mean :4.136# 3rd Qu.:4.700 3rd Qu.:4.900# Max. :7.800 Max. :7.900# Use all the columns except the first # (Country). vars.to.use <- colnames(protein)[-1] # Scale the data columns to be zero mean # and unit variance.# The output of scale() is a matricx.pmatrix <- scale(protein[,vars.to.use]) # optionally, store the centers and # standard deviations of the original data,# so you can "unscale" it later.pcenter <- attr(pmatrix, "scaled:center") pscale <- attr(pmatrix, "scaled:scale")
Before running clusterboot()
we’ll cluster the data using a hierarchical clustering algorithm (Ward’s method):
# Create the distance matrix.d <- dist(pmatrix, method="euclidean") # Do the clustering. pfit <- hclust(d, method="ward") # Plot the dendrogram.plot(pfit, labels=protein$Country)
The dendrogram suggests five clusters, as shown in the figure below. You can draw the rectangles on the dendrogram using the command rect.hclust(pfit, k=5)
.
Let’s extract and print the clusters:
# A convenience function for printing out the # countries in each cluster, along with the values # for red meat, fish, and fruit/vegetable # consumption. print_clusters <- function(labels, k) { for(i in 1:k) { print(paste("cluster", i)) print(protein[labels==i,c("Country","RedMeat","Fish","Fr.Veg")]) }}# get the cluster labelsgroups <- cutree(pfit, k=5)# --- results -- > print_clusters(groups, 5)[1] "cluster 1" Country RedMeat Fish Fr.Veg1 Albania 10.1 0.2 1.74 Bulgaria 7.8 1.2 4.218 Romania 6.2 1.0 2.825 Yugoslavia 4.4 0.6 3.2[1] "cluster 2" Country RedMeat Fish Fr.Veg2 Austria 8.9 2.1 4.33 Belgium 13.5 4.5 4.09 France 18.0 5.7 6.512 Ireland 13.9 2.2 2.914 Netherlands 9.5 2.5 3.721 Switzerland 13.1 2.3 4.922 UK 17.4 4.3 3.324 W Germany 11.4 3.4 3.8[1] "cluster 3" Country RedMeat Fish Fr.Veg5 Czechoslovakia 9.7 2.0 4.07 E Germany 8.4 5.4 3.611 Hungary 5.3 0.3 4.216 Poland 6.9 3.0 6.623 USSR 9.3 3.0 2.9[1] "cluster 4" Country RedMeat Fish Fr.Veg6 Denmark 10.6 9.9 2.48 Finland 9.5 5.8 1.415 Norway 9.4 9.7 2.720 Sweden 9.9 7.5 2.0[1] "cluster 5" Country RedMeat Fish Fr.Veg10 Greece 10.2 5.9 6.513 Italy 9.0 3.4 6.717 Portugal 6.2 14.2 7.919 Spain 7.1 7.0 7.2
There’s a certain logic to these clusters: the countries in each cluster tend to be in the same geographical region. It makes sense that countries in the same region would have similar dietary habits. You can also see that
- Cluster 2 is made of countries with higher-than-average red meat consumption.
- Cluster 4 contains countries with higher-than-average fish consumption but low produce consumption.
- Cluster 5 contains countries with high fish and produce consumption.
Let’s run clusterboot()
on the protein data, using hierarchical clustering with five clusters.
# load the fpc packagelibrary(fpc) # set the desired number of clusters kbest.p<-5 # Run clusterboot() with hclust # ('clustermethod=hclustCBI') using Ward's method # ('method="ward"') and kbest.p clusters # ('k=kbest.p'). Return the results in an object # called cboot.hclust.cboot.hclust <- clusterboot(pmatrix,clustermethod=hclustCBI, method="ward", k=kbest.p)# The results of the clustering are in # cboot.hclust$result. The output of the hclust() # function is in cboot.hclust$result$result. ## cboot.hclust$result$partition returns a # vector of clusterlabels. groups<-cboot.hclust$result$partition # -- results --> print_clusters(groups, kbest.p) [1] "cluster 1" Country RedMeat Fish Fr.Veg1 Albania 10.1 0.2 1.74 Bulgaria 7.8 1.2 4.218 Romania 6.2 1.0 2.825 Yugoslavia 4.4 0.6 3.2[1] "cluster 2" Country RedMeat Fish Fr.Veg2 Austria 8.9 2.1 4.33 Belgium 13.5 4.5 4.09 France 18.0 5.7 6.512 Ireland 13.9 2.2 2.914 Netherlands 9.5 2.5 3.721 Switzerland 13.1 2.3 4.922 UK 17.4 4.3 3.324 W Germany 11.4 3.4 3.8[1] "cluster 3" Country RedMeat Fish Fr.Veg5 Czechoslovakia 9.7 2.0 4.07 E Germany 8.4 5.4 3.611 Hungary 5.3 0.3 4.216 Poland 6.9 3.0 6.623 USSR 9.3 3.0 2.9[1] "cluster 4" Country RedMeat Fish Fr.Veg6 Denmark 10.6 9.9 2.48 Finland 9.5 5.8 1.415 Norway 9.4 9.7 2.720 Sweden 9.9 7.5 2.0[1] "cluster 5" Country RedMeat Fish Fr.Veg10 Greece 10.2 5.9 6.513 Italy 9.0 3.4 6.717 Portugal 6.2 14.2 7.919 Spain 7.1 7.0 7.2# The vector of cluster stabilities. # Values close to 1 indicate stable clusters> cboot.hclust$bootmean [1] 0.7905000 0.7990913 0.6173056 0.9312857 0.7560000# The count of how many times each cluster was # dissolved. By default clusterboot() runs 100 # bootstrap iterations. # Clusters that are dissolved often are unstable. > cboot.hclust$bootbrd [1] 25 11 47 8 35
The above results show that the cluster of countries with high fish consumption (cluster 4) is highly stable (cluster stability = 0.93). Clusters 1 and 2 are also quite stable; cluster 5 (cluster stability 0.76) less so. Cluster 3 (cluster stability 0.62) has the characteristics of what we’ve been calling the “other” cluster. Note that clusterboot()
has a random component, so the exact stability values and number of times each cluster is dissolved will vary from run to run.
Based on these results, we can say that the countries in cluster 4 have highly similar eating habits, distinct from those of the other countries (high fish and red meat consumption, with a relatively low amount of fruits and vegetables); we can also say that the countries in clusters 1 and 2 represent distinct eating patterns as well. The countries in cluster 3, on the other hand, show eating patterns that are different from those of the countries in other clusters, but aren’t as strongly similar to each other.
The clusterboot()
algorithm assumes that you have the number of clusters, k. Obviously, determining k will be harder for datasets that are larger than our protein example. There are ways of estimating k, but they are beyond the scope of this article. Once you have an idea of the number of clusters, however, clusterboot()
is a useful tool for evaluating the strength of the patterns that you have discovered.
For more about clustering, please refer to our free sample chapter 8 of Practical Data Science with R.