Demo on Clustering

Cluster prediction strengh

First I will implement a simplified version of cluster prediction strength just to illustrate the logic behind it.

Compare the cluster overlap between data set A (color) and data set B (grey symbols). For which number of clusters do colors and symbols appear "pure" - i.e., not a mix of colors within a symbol or a mix of symbols within a color? For this data set the cluster overlap is maximized with 2 clusters.

Next I manipulate the iris data to make the Virginica data more separated from the rest. This is to illustrate that with stronger separation you might find 3 clusters.

Next I split the first cluster into two with strong separation to see if I can detect 4 clusters.

Instead of doing this from scratch let's use packages instead.

Cleary, the overlap is best for 2 clusters.

Now, try this for the data set with more separation between 3 clusters and an added 4th from splitting Virginica.

Your results here may vary a bit from sample to sample - run a couple of times to see.

Next, try this with a different clustering and classification method (here hierarchical clustering and knn).

With these sets of methods, it seems more clear that 4 clusters are detectable.

Comparing clusters.

A key component to cluster selection and validation is having metrics to compare clusters. One popular metric to compare 2 cluster results is the \emph{Rand Index} defined as

$$RI = \frac{a+b}{a+b+c+d}$$

where $a$ is the number of pairs that are together in both cluster results and $b$ are the number of pairs that are not together in both cluster results. $c$ and $d$ are the number of pairs that are together in the first cluster results but not together in the other, and vice versa. A high Rand Index means that clusters overlap a lot. The Adjusted RI (or corrected RI) is a metric that adjust for what cluster overlap could be expected just by chance.

A second commonly used metric is the \emph{Jaccard Index} defined as the ratio between the intersection and the union of two sets. Here, one can use this at the cluster level to measure how much in agreement a cluster allocation is between two cluster results.

Another metric, used below, is the \emph{Variation of Information}. Two cluster methods result in two different partions of the data $A_1,\cdots,A_K$ and $B_1,\cdots, B_J$. For each partition $A$ you can compute the cluster proportions $a_k, k=1,\cdots,K$ similarly for $B$ we get $b_j, j=1,\cdots,J$. Let $o_{kj} = A_k \cup B_j$ for all pairs of clusters between the two partitions. The Variation of Information is then defined as

$$ VI = - \sum_{k,j} o_{k,j} \bigl(\log(\frac{o_{kj}}{a_{k}}) + \log(\frac{o_{kj}}{b_{j}})\bigr) $$

which is measure of distance between the clusterings. A small value indicates that there is a strong agreement between the clusters.

The kmeans and hierarchical cluster results with 3 clusters do not overlap as you can see from the indices and the scatter plots. Try with more methods, different distances and see what happens.

What about 2 clusters?

Now we have perfect agreement between kmeans and hierarchical clustering.

Cluster stability

Let's try resampling to assess how much clusters may vary for a random subset of observations. Are clusters stable?

For the iris data, bootstrap results are not stable beyond 2 clusters. Try this on the iris data where you created greater separation between the clusters.

Let's run this on more bootstrap examples.

Looks pretty stable for 2 to 3 clusters with kmeans, but not with 4.

What about hierarchical clustering?

The clusters are fairly stable up to 4 clusters with hierarchical clustering with complete linkage. Remember though that the clusters look. Stable clusters but with limited overlap to the kmeans clusters. Choice of method matters.

Consensus Clustering

Let's explore the cancer data I posted on canvas. There are 6 different cancer samples, 2887 in total, each sample measured over 20000+ genes.

You can see the cancer classes fairly well separated in the first leading eigencomponents. However, there is quite some imbalance between the classes here with many more breast cancer samples and very few uterine cases. Also, the uterine seems to be positioned between the other classes.

Kmeans on the 10 leading components splits the breast cancer data set into 2 clusters while combining glioblastoma and kidney cancer. As expected, we don't retrieve all 6 classes when we clusters.

Does it work if we use partion around medoids?

While the clusters look good in the silhouette width plot, the confusion matrix shows that they don't overlap with the cancer classes - breast cancer is split into two groups. PAM was, however, better at separating glioblastoma and kidney cancer.

Let's try a hacky version of consensus clustering from scratch.

I resample 50 times. You can try fewer or more and see what happens. I now have a consensus matrix, DD, that tabulates how pairs of observations co-cluster across the different runs. I use this as new distance metric into hierarchical clustering.

From the dendrogram you can see that there are some samples that form tighter clusters than others (stay together more frequently) and others are scattered between the clusters. You see clear, stable subclusters in the breast cancer (black) class. Try this with different clustering methods, different amount of random features or observation sets and see what happens.

We see as before that it's hard to all the classes and the clustering method favors the splitting of breast cancer into multiple clusters. Increasing to 7 clusters finds all the classes except the small uterine cancer, while splitting breast cancer into two subgroups. The consensus matrix depicts how stable the clusters are. You see that there are 4 fairly stable clusters in the bottom left corner of the consensus matrix that are well separate. The 3 at the top right corner, one looks stable, but the larger ones are clearly not as stable. This is the breast cancer class mixed with a few other samples and the uterine class.

Feature filtering

Let's sort the features by their variance and pick some of the top and bottom ones and see what their distribution looks like.

What you see is that for most observations a feature may have 0 or near 0 values. This means that a gene may not be expressed for some of the cancer classes but highly expresed for others. If the features look multimodal this might indicate they contain cluster information. Feature 530 looks nicely bimodal whereas feature 18500 looks unimodal. Remember this is sorted by decreasing variance.

Let's use the leading 250 features and perform clustering. Here, I use a bi-clustering approach so I also cluster the features. The distance metric is euclidean.

You see a nice patch work here. This is the sorted data matrix, sorted by observation (row) clustering with hierarchical clustering, euclidean distance and sorted by column clustering for the features. The color labels on the left are the true classes. You see a clear overlap with some clusters and classes. But the black class, breast cancer, is clearly split into two groups. You see that there are clusters of genes as well, groups that are either high (blue) or red(low) for a subset of cancers in different patterns. This is a very dense and informative summary of the data. However, always remember that the results are driven by your choices for clustering metric and method.

What if we try a different distance metric. Below, I use correlation as a similarity metric instead.

Notice how the results changed. Now, breast cancer is actually forming a slightly tighter group.

Let's investigate how feature selection may impact clustering (also a topic for project 2).

First I use the top 2000 varying genes and then use the leading eigenvectors from a PCA on those 2000 features.

This gives us a good separation for some of the larger classes but clearly also some overlap. If you plot this without the colors you see what I mean.

Now it's not so easy to see how many clusters there are or where they are separated.

Let's now only use the top-varying features and see what kind of cluster separation we get then.

Clearly, the top-varying genes are mostly separting breast and lung cancer from the rest. It appears that lung cancer is well separated in the subspace spanned by the 1st and 3rd most varying genes whereas breast cancer is separated along the 2nd most varying genes.

If you look at the next few features, it is clear that some cancers dominate if you use variance as a feature selection.

Let's use the Consensurcluster package and investigate cluster stability and selection.

ConsensusClusterPlus outputs the consensusmatrices for your selected range of number of clusters. It also provides the empirical cdf of the consensusmatrices that helps you assess what a reasonable number of clusters may be. See more on this under Project 2.

Let's have a look at one of the outputs. Do the consensus cluster overlap with the classes?

You see that by increasing the number of clusters we don't retrieve stable clusters for allthe classes.

Let's try a different clustering method, correlation based similarity and hierarchical clustering.

With 5 clusters you almost retrieve the classes. Try this with different methods, different sampling of observations and different features.

Graphical Lasso - Networks and clusters

We have talked about the inverse covariance matrix in class, mainly as to how it's an essential component of DA classification since the inverse appears in our decusion rules. The problem with this was that for large numbers of features, strong correlations between features etc, the inverse may become numerically unstable. We "fixed" this by regularizating by for example adding something to the diagonal of the covariance before taking the inverse. This constitutes a form of shrinkage, mainly along the directions of small variation.

In class I briefly discussed another variant of regularized variance estimation called graphical lasso. This results in a sparse estimate of the inverse covariance where only features that are strongly, directly correlated have a non-zero entry in the matrix (see lecture notes).

The output from graphical lasso is a so-called adjecency matrix with non-zero entries between objects/features that are directly dependent and zero otherwise. The more heavily we regularize the inverse, the fewer direct connections are estimated as non-zero. This is called sparsity, i.e., the number or direct links between observations/features.

Let's try this on the digit data set.

I prefer to use other packages to actually depict the resulting networks or graphs.

With the network package I can add labels, arrows etc to my graph for richer visualizations.

Change the number in the double square brackets to look at networks of a different sparsity.

Notice how this method does a good job at separting the digits from one another while also illustrating that some individual numbers may be close to a digit from a different class.

We can apply this to the cancer data as well.

Again, you can see a clear separating of the cancer classes but also an indication that breast cancer may be split into two groups. Try looking at different levels of sparsity, with different number of PC component inputs etc.

You can apply this method to the features (genes) instead and construct a network of features. For the gene data application this is actually a very common technique. We are looking for gene sub-networks, or pathways, that form a functional component in the cell.

You see some subnetworks of different sizes and complexity. To get clusters (or modules) from a graph you can use various graph cutting algorithms. We will talk more about this and sparse modeling after the easter break.