Multiple testing and High-dimensional Inference

In this lecture we will discuss methods for accounting for multiple tests to safe guard against finding too many positives/rejecting too many null hypotheses for useful interpretation or follow-up of analysis tasks.

Let's start by reviewing simple statistical tests.

I generate normally distributed data $X \sim N(0,1)$ and $Y \sim N(0,1)$ and test the Hypothesis: $H_0: \mu_x > \mu_Y 0$. I use a one-sided hypothesis because it's simpler to illustrate.

I use a one-sided t-test. What if we don't want to make distribution assumptions? I also run a simple permutation based test to illustrate. That is, I use permutations of the $X$ and $Y$ to estimate the distribution of the test statistic under the null hypothesis - here actually a more general assumption of equal distribution.

OK - that was running one test under the null.

What do we usually do? We analyze data, e.g. with a model, view the model summary which often includes p-values and draw conclusions. I run logistic regression on the South African heart disease data and view the model summary.

Here we performed 9 tests on coefficients. If we use $\alpha=0.05$ as our significance cutoff, 5 coefficients are found to be significantly different from 0. From previous classes you (hopefully) know to be cautious to interpret p-values one-by-one. First, if there are correlations among the predictors this affects the estimation variance and the p-values. Second, since this is a generalized linear model the p-values are derived from an asymptotic result and last linear approximation in iterative reweighted least squares.

Here we now also consider one more complication - the fact that you performed multiple tests. Ignoring the possible correlation between the estimates (and therefore the p-values) we know that $$P(\text{reject at least one of the 9 null hypotheses at level} \ \alpha) = 1-(1-\alpha)^9 \simeq 0.37$$ So, it's quite likely that at least one of the "discoveries" here are false.

How does $P(\text{reject at least one of the n null hypotheses at level} \ \alpha)$ scale with the number of tests $n$. Let's plot it.

Once you reach 100 tests you are essentially guaranteed to make at least one false discovery!

More on p-values

What can we expect to see if we perform multiple tests? What's the distribution of the p-values?

Let's start with the case of all null hypotheses being true.

Notice that the distribution of the p-values are essentially uniform! That is, when the null is true a p-value can fall anywhere in the interval (0,1). If you use the threshhold $\alpha$ you can expect $n \alpha$ of the $n$ p-values to be below this threshhold.

What if the null isn't true? Let's generate data with a few of samples being generated from a different distribution.

Notice how there is now an "excess" of small p-values compared to the uniform distribution (and a slightly fatter tail of the test statistic). This is an indication that some of the null hypotheses are not true. Remember though that for the $n_0$ true nulls, the p-values still come from U(0,1) so may well also be localized to the low end of the distribution.

One way to safe-guard against making too many false rejections was using a more stringent level for the test. The Bonferroni correction used $\alpha/n$ as the adjusted level of the test.

Let's try it.

This is really conservative!

In the lecture we talked about changing focus from safe guarding agains making any false rejections and instead focusing on the proportion of rejections that are false. That is, among your detections how many are likely to be false?

Plotting the p-values on a log10 scale puts more focus on the small p-values.

Let's zoom in on the small p-values and add the Bonferroni threshhold to the figure (in blue).

Here you can see the few detections that we got using the Bonferroni correction (blue points below the blue dashed line at $\alpha/n$).

Benjamini-Hochberg and False Discovery Rate

Let's try controlling the false discovery rate instead of the FWER (familywise error rate, probability of making at least one false rejection).

The BH procedure compares the sorted p-value to the diagonal cutoff with slope $\alpha$, i.e. the $r$-th smallest p-value should not exceed $\alpha (r/n)$ if you want to safe-guard against a false discovery rate of $\alpha$. Note, here you don't have to use $\alpha=0.05$ or $0.01$ but can use a thresshold that results in an "affordable" number of discoveries/follow-up experiments.

At $\alpha=0.05$ we obtain more discoveries but of course some are not true. The observed false discovery proportion may not be $\alpha$ since we only control FDR, i.e. the FDP in expectation.

Let's run this multiple times to observe.

The BH procedure controls the expected FDP, the False Discovery Rate.

Let's have another look at the digits data. Now, there are a lot of digits so we will use only a few of to test which pixels differ between the digits.

Next few lectures we will discuss testing when the sample size is large - when p-values are difficult to use because everything becomes significant...

Let's start by just comparing 0s and 1s.

Let's plot the pixel-wise p-values and the significant pixels for the 0-1 comparison.

Try another set of digits.

Finally, let's revisit the SA heart disease data. Does p-value correction or multiple testing correction make a difference here?