Demo on SVD, NMF

I will use the digits data and the heart disease data to illustrate the use of SVD, sparse SVD, NMF and SOMS for data representations and data exploration.

The first PCs do a pretty good job separating 0s and 1s from the rest.

Let's examine how well SVD approximates the digits with few components.

With only two components you can still discern the digits. Try with more components to see how much that improves the reconstruction.

Sparse SVD

We can make SVD/PCA more interpretable by generating sparse loadings - that is, which features can generate direction that capture variation in the data? A couple of different variants of sparse SVD have been proposed

We want to find a sparse SVD. Let's for now assume we have $X=UDV'$ and call the principal components $Z=UD$ and the loadings are in $V$. Let's look at the $i$th PC $Z_i = U_i D_{ii}$

Ridge-penalty $$\min_{\beta} ||Z_i - X\beta||^2 + \lambda ||\beta||^2$$

Solve the ridge-problem $$\beta_r = (X'X+\lambda I)^{-1} X'U_i D_{ii} = V(D^2+ \lambda I)^{-1}V'VDU'U_iD_{ii}=$$ $$ = V(D^2+\lambda I)DU'U_iD_{ii} = V_i \frac{D_{ii}^2}{D_{ii}^2+1}$$ Which means that $V_i = \beta_r/||\beta_r||$

Recall the elastic net formulation $$||Y-X\beta||^2 + (1-\alpha)\lambda ||\beta||^2 + \alpha \lambda ||\beta||$$ Now we add the L1 penalty to the above to get sparse loadings. Of course, in this formulation we needed to already have the SVD so it is an iterative method.

Alternatively, write the whole problem as follows $$\sum_{i=1}^N ||x_i - AB^T x_i||^2$$ where $A'A=I$ and $B \propto V$ and elastic net penalty on $B$

This is just a bunch of independent elastic-net problems!!

$A$ given $B$ $$\min_A ||X - (XB)A^T||^2$$

In R you can use PMA package and nsprcomp package

Let's use sparse SVD on the full digits data set. This will generate sparse loadings (subset of pixels) that you can use to project the digit images on. Of course, the prize of sparsity is less explained variance for a fixed number of components, but you gained interpretability.

Let's see how the projection onto the sparse principal components worked for separating the digits.

Let's explore the heart disease data from the text book.

Self-organizing maps

We can visualize large data set by looking at the leading principal components. SOM - self-organizing maps is a very different way of looking at data.

We're now going to update the prototypes to better summarize the data, which corresponds to bending the PC plane to be able to map it to a rectangular grid.

We can be a bit more clever with the updating, taking neighborhood distance into account.

We can also use supervised techniques, where some variables (dimensions) matter more in the distance calculation and other distance metrics can be used (more appropriate for categorical data).

PCA can fail miserably for mixed data.

Let's try SOM on this data set instead.

Let's try this on the digits data set as well.

The codes are not as easy to interpret in a high-dimensional setting (it's the raster scan of the representative image in the code book). Let's look at these as images instead (not all 625 of course).

Now we have found representatives for the classes on the grid.

One can use these for prediction for example.

Non-negative Matrix Factorization

Pros and Cons with (sparse) SVD

Non-negative matrix factorization

NMF: $$X=WH, \ \ W\geq 0 \ \ H \geq 0$$

Example from the handwritten digits: Let's say we choose to approximate the data with rank $K$

Both SVD and NMF try to find a linear dimension reduction of the data to summarize the data well. The difference lies in the assumed structure of the dimension reduction.

Let's say we have found a rank $K$ approximation. We can approximate the $j$th observation by $$ \hat{X}_j = \sum_{l=1}^K W_{jl} H_{l.}$$

Try different K at home and see which works best for SVD and NMF.

The world is low rank....

Let's try to summarize an image.