Model selection demo

Lasso and Adaptive lasso

In the lecture we talked about ridge and lasso regression. Both of these methods adress the high estimation variance that results from either high dimensional settings or high correlations between features. Estimation variance directly impacts predictive performance so surpressing estimation variance can be essential.

However, both ridge and lasso lead to biased estimates of the model coefficients. Increased bias can lead to a deterioration of predictive performance. Thus, we need to find a balance between increasing bias and decreasing estimation variance. This is usually done by selection our penalty/regularzation tuning parameter via cross-validation.

An alternative is to de-bias the estimate.

Let's consider the form for the ridge and lasso estimates (for the simple case of uncorrelated features):

$$\hat{\beta}_{ridge} = \frac{\hat{\beta}}{1+\lambda} $$$$\hat{\beta}_{lasso} = (\mid\hat{\beta}\mid - \lambda)^{+} sgn(\hat{\beta})$$

The bias for the ridge estimates are larger for larger $\beta$ whereas the bias for lasso is a constant $\lambda$. Still, if we could retain the selection property of lasso and remove the bias, this would be even better still!

If we knew which coefficients were truly non-zero we shouldn't penalize these but only those that we want to "select out" from the model. Of course, we don't know this but by running e.g. ridge first we do get a sense for which coefficients are large and which are near 0.

This is the motivation for adaptive lasso.

Let assume we have an (near) unbiased estimate, $\hat{\beta}_u$, which we obtain from either OLS or ridge regression with a small degree of regularization. We then use these estimates to adjust the penalty for each coefficient $j$ as $$\lambda_j = \bigl( \frac{\lambda}{\mid \hat{\beta}_{u,j} \mid^{\gamma}}\bigr),$$ where $\gamma > 0$ leads to slightly different debiasing and selection results (see paper I posted with the lecture).

This means that those ridge coefficient estimates that are large are penalized less in the adaptive lasso step and those with small ridge coefficients are penalized more and may be set to 0 (sparse model).

Note that the ridge estimates are not sparse. All lasso methods tended to remove the 0 coefficients for the model with these settings (try with other settings at home). The bias reduction is usually more noticable for later coefficients. The cross-validation may impact "who's the winner" for different runs.

The solution paths for lasso and adaptive lasso showcases the bias reduction for the large coefficients.

Now try with different correlations, sample sizes etc.

Below, I make several runs and look at the bias for the different estimation methods as well as predictive performance and selection performance.

For this run, adaptive lasso resulted in a better model both in terms of selection performance and predictive performance. Still, Ridge is better in terms of prediction. Remember from class - interpretability and prediction performance are not necessarily the same goal.

Next, I add some correlation to the feature groups and increase the sample size.

Of course, with this much data both lasso and adaptive lasso do a great job selection the true coefficients. The predictive performance for all methods are good (check a scatter plot of prediction vs true values to verify).

Let's now use the same simulation set up and compare lasso and elastic net. Remember from class that elastic net incorporates both a ridge and a lasso penalty - where the ridge penalty tends to allow for correlated features to enter the model together.

Here, group 1 and group 3 contained correlated features. Notice how ridge and elastic net tends to select all the features in the group as part of the model in a coordinated fashion.

Notice that Enet will have larger FPR because features may join the model based on their correlation with true predictors. Depending on the sample size, correlation structure etc, you may get a better TPR with elastic net.

Notice how the lasso estimates are all over the place!

Try with different sample sizes, group correlations etc.

The black and green curves are the 1st and 3rd group of features, respectively. Notice how the elastic net (right panel) produced solution paths that tend to select many of the green and black features. The wider lines are the true features. Notice that while elastic net picks many features, the true ones tend to be included early on.

The lasso (left panel) really struggles here. Notice that it "randomly" selects one of the group features (not the true one) early on while the true ones are not even part of the model until almost no penalty is applied.

Try with less correlation, bigger/smaller groups etc and see what happens.