Comparison of kPCA and PCA

Let's compare PCA and kPCA on the digits data from the textbook.

Let's verify that we can move between the covariance space $X'X$ and the neighborhood space $XX'$ to obtain low rank representations of the data.

Yes, we can get the loadings from either the covariance between features or the distance between observations (with rescaling and projections).

Let's look at just the 0s and see what PCA picks up.

PC1 and PC2 picks up different aspects of the 0 digits.

Run this a couple of times.

Let's try kernel PCA and see how that separates the data.

Let's now apply to the full digits data set.

Notice how your kernel and bandwidth choices changes what is captures by the first kPCA components! It really depends on what you decide is similar and not - see how the zero or one digits are either "compressed" or "blown up".

You can look at more kPCA components of course.

Try this on some of the other data sets from class and see what happens!

Multi-dimensional Scaling

Self-organizing maps was a way to summarize the data in a low-dimensional space by creating a grid-representation where each grid-point was represenated by a prototype (centroid) and grid-neighbor prototypes were iteratively moved toward each other so that the grid represents some global structure in the data.

MDS (multi-dimensional scaling) also tries to represent the data in a lower-dimensional space but not through an articial grid. Here, we instead try to ensure that the distances in the original space are captured by the distances in the low-dimensional space.

Notice how we can feed MDS a distance matrix. Why not be a bit more adventurous then?

ISOMAP is based on the idea of turning distance matrices into graphs by only considering the smallest distances, i.e. the nearest neighbors of each observations. We compute the kNN graph and then compute distance the the path-length between objects on the graph. Depending on our choice of k we can be very local or conservative in what we consider "close".

tSNE

tSNE is a very popular method where local distances are also key to representing the data.

Spectral clustering

From the lecture you saw that you could use a spectral decomposition of the similarity graphs to cluster data. Let's try this on the digits data and use the corresponding kPCA for visualization.

An alternative implementation in R can be found here: https://rpubs.com/nurakawa/spectral-clustering?msclkid=6ebd0ffccee211ecbc0e25ee7ddf2f70 The main function is pasted in the next cell.

You can easily go in and change this function to work with other distance metrics in line 5-26 (kNN graph).

Summary

Now, the methods we have talked about here are mainly for data exploration. It is sometimes not so straight forward to apply these to new data points (the out of sample problem). For MDS there are formulas for how to project the data onto the lower dimensional space. For ISOMAP approaches have also been derived. The key is that you don't want the test data to require a re-computation of the neighborhood graph for the training data. Here is a well-cited paper that discusses some of these out-of-sample predictions: https://www.iro.umontreal.ca/~lisa/pointeurs/tr1238.pdf?msclkid=fd6aa8d2ced611ec9cc4e01568c6c722 and in R there is the following setup for MDS/ISOMAP https://recipes.tidymodels.org/reference/step_isomap.html In Python ISOMAP predictions are implemented through the isomap.transform() function from sklearn.manifold https://scikit-learn.org/stable/modules/generated/sklearn.manifold.Isomap.html?msclkid=e214b7edcedc11ec86522141b973c11d

For tSNE this is even more complicated. The author of the original tSNE method discusses some options here: https://lvdmaaten.github.io/tsne/ There is ongoing research toward making tSNE bridge local and global - check out scholar.google - lots of activity!