Big n methods

As you discussed in class, when the sample size is large we run into a couple of new problems.

What are we going to do?

Let's start by looking at a big data set on housing pricing from King County (Seattle, WA).

We have 20000+ samples and 21 dimensions. I will limit the study to a few of the features here to keep things simple. Explore more of them at home.

I start by eliminating indeces, dates and zipcodes. Note, these can be important predictors for house pricing (location and time of sale if there are trends in the data). In fact, since you have so much data you can indeed use the zipcode as a high-level categorical feature for example.

Let's run a simple visual exploration of the data, scatter plots. You will notice this takes a while because you have a lot of data. Also, the scatter plots are really dense which make them difficult to explore.

After some (very simple) data exploration steps we are ready to run a linear model fit.

With 20000+ observations all coefficient estimates are highly significant! Compare this to the patterns you see in the scatter plots above....

There is some indiciation that you might want to turn e.g. condition into a categorical/ordinal and for low grade the linear trend appears broken. Note also that the lot size migth be better to truncate.

Explore this at home but for now let's move on to explore the big n problems.

First - random projections were introduced in class. If the true data matrix is low rank you can estimate its low rank summary quite fast via random projections.

For this size data we don't have a big problem running svd as is. We will push the limit a bit more further down.

In class we talked about leveraging. Here, we reduce the size of the data based on their leverage (diagonal entries from the hat-matrix) to retain those observations that most drive the fit of the model.

As we saw from class, we don't need to fit the data to get the leverages. We only need to run SVD (or randomized SVD) on the data matrix!

We can now use leverage (without computing the fit) to reduce the data set to a set of informative observations with some effectice sample size that's more reasonable to analyze/explore.

I run regression on the reduced sample data. You get a better fit when you used leverage sampling by design - you asked for the observations in the extremes of x which of course increases the R-squared.

Leverage can be used to subsample the data for fast model exploration.

Leverage was a way to subsample the data for exploration and modeling. Still, if you subsample a large amount of data we have a difficult time interpreting the p-values.

How about we focus on the effect size or explained variance instead.

Effect size is just looking at coefficient estimate magnitudes (easier if you standardize the data where appropriate - perhaps not for binary or categorical....).

R-squared is another way of looking at the "usefulness" of the model. However, if we want to translate this to the coefficient we have to think about R-squared per feature.

The package 'relaimpo' in R and 'https://pingouin-stats.org/generated/pingouin.linear_regression.html' for python has several metrics to do this. The 'relaimpo' paper: https://www.jstatsoft.org/article/view/v017i01

You can use bootstrap to assess stability of the feature importance meaures.

Let's switch focus to p-values.

So, when the sample size is large, p-values become essentially meaningless. We can either use the feature importance metrics from above or we can investigate how the p-values depend on the sample size. See the paper I posted on canvas.

Let's visualize how the p-values depend on the sample size and the effect size (true vector mean). Notice how rapidly the p-values approach 0 as the sample size grows.

Let's investigate the pricing data using the same technique. We will chart both the p-value and coefficient value evolution as a function of sample size.

Now you can trace when coefficient "become" significant - here you see that 'bedrooms' need 5000+ observations and 'condition' around 1000 whereas 'grade' and 'sqft_living' enter the model as significant after about 100 observations!