High-dimensional inference

Embedded regularization worked really well for us for selecting features in a (semi-)automated sense. We still needed to be careful when it came to selecting tuning parameter(s) and we have seen that selection can be quite unstable in the presence of noise(weak signal) and correlated features.

In addition, because selection is baked into the procedure it is not straightforward to obtain p-values for the estimated parameters.

In this lecture we explore some re-sampling based methods as well as de-biasing approaches.

Let's work with the heart disease data to make the demonstration easy to follow. Note, however, that this is not very high-dimensional per-se. We will use an extended version of the data set with extra noise features below.

Let's use lasso to select feature for this data set and select $\lambda$ via cross-validation.

We can now refit the selected model to the data using the optimal $\lambda$ from the cross-validation.

Is this OK? We are using the data for two purposes now: selection and validation(inference). We can get overly optimistic p-values for the selected features (somewhat compensated for a less complex model fit, but the p-values are not valid).

Simple splitting strategy

What if we split the data in 2 and use one half for training and selection (via cross-validation) and the other half for validating the selected model.

There are two main advantages to this

However - there is also a drawback. If you run the above code cell a couple of times you may see that the results may differ both in terms of p-values but even which features appear among the selected ones that we get p-values for! (remember to remove the set.seed)

What should we do?

Multi-splitting strategy

Re-sampling to the rescue!

Let's just run the simple splitting strategy several times, each time recording the p-value on the test set.

Note, in each run I correct the p-values from the test set by the number of selected features from the training set.

Let's investigate what the p-values look like across the multiple runs.

As you can see, for some feauters the p-values are essentially always 1 - this is because they are never (rarely) selected. For other features you see p-values away from 1 but with a subset of runs (where the feature was not selected) where the p-value is 1.

This is tricky to aggregate.

One possibility is to use a summary statistic like the median (or some other quantile) p-value from the multiple runs. We have to be a bit careful since the data splits partially overlap between runs to the p-values are actually dependent. How to aggregate is discussed in the paper I posted.

Here I illustrate with the median aggregate.

Let's compare with the package 'hdi' in R (I posted a link to the python package in the canvas module).

The results are quite similar - some variability due to the sampling procedure is expected.

The 'hdi' package has a built in tool for selecting the best quantile for p-value estimation for each feature. Try different number of runs $B$, splitting fractions between training and test, etc, at home.

Stability selection

In class I talked about stability selection. This method, and variants thereof, is quite popular for e.g. large-scale network modeling using lasso-type methods. We will again use re-sampling to improve on the selection performance of the lasso.

This is not a p-value estimation method but an alternative to tuning parameter selection but with an eye on testing (false positives).

I start by running lasso on the original data to illustrate the selection paths and the cross-validation errors.

It's not easy to see from the selection paths which $\lambda$ to select and you can see that the cross-validation error band is quite wide and the curve has a flat appearance - it's difficult to select the best amount of regularization here.

Does it matter? Isn't this one step removed from what we're after - the tuning parameter is a tool but it's the seletion we're after.

Let's try multiple splitting and check the proportion of times a feature is included in the model.

The new path plot is called the stability path. On the y-axis is the selection probabilities (proportions) and we can now consider only keeping the features whose stability values exceed a threshold (here 0.8 drawn in the blot as a dashed line).

In the Selection stability paper (posted on canvas) the authors prove that we can control the expected number of falsely selected features by controling the threshold value and the range of the tuning parameter $\lambda$. They provide an explicit form for the relationship between the false positives, the stability threshold and the $\lambda$ range.

In practice you might say that you believe the true model contains at most $\sim \sqrt(p)$ features. You can then choose a contraint on the number of false positives and this gives you the range of $\lambda$ to investigate or you can start with the range and work out the threshold. This is all implemented in the 'stability' function in the 'hdi' package.

The method is geared toward high-dimensional settings so let's create a more challenging case from the heart disease data by adding some noise features.

The stability paths identify a few of the features are quite stable across a range of $\lambda$ and thresholds. Some of the original features are not reaching up to the threshold but this is expected from the orginal fit (check p-values at the top cell in the notebook).

Let's play with some contrainst on the false positives.

To control the expected number of false positives around 25 we can us a threshold around 0.75.

Here you see that stability selection adjusts the range of $\lambda$ (q = average model size) to control EV. I used a fixed range in my code.

Let's run a recap of multiple testing on the expanded data set.

In class we also talked about various de-sparsifying/de-biasing approaches for obtaining p-values features selection in a high-dimensional setting. Here are some runs with the methods that were presented in the lectures.

These procedures select very few features for this data set, on par with the stability selection controling the false positives at around 1-5. This is because the default is using FWER controlling procedures. You can change this to e.g. BH if you want.

In the 'hdi' package we can also test groups of features. For the expanded heart disease data we get the following result. You see that the groups are refined top-to-bottom and in the end the selected results agree quite well the above findings (famhist and age are the strongest selected predictors). Age is correlated with a lot of the other features so they are grouped with this feature in the tests. famhist is not grouped with any of the original predictors.

To demonstrate the clustering, I add an additional feature strongly correlated with age and rerun the above steps.

Finally, let's just apply the multiple testing procedures to the raw p-values from the full fit. Note - we can do so here because $n>p$, otherwise we would have to use the methods above.

Summary

If $n<p$ we can use regularized methods for feature selection.