AJP - Regu Fuel your research with LabChart
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Am J Physiol Regul Integr Comp Physiol 279: R1-R8, 2000;
0363-6119/00 $5.00
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (145)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Curran-Everett, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Curran-Everett, D.
Vol. 279, Issue 1, R1-R8, July 2000

INVITED REVIEW
Multiple comparisons: philosophies and illustrations

Douglas Curran-Everett

Departments of Preventive Medicine and Biometrics and of Physiology and Biophysics, School of Medicine, University of Colorado Health Sciences Center, Denver, Colorado 80262


    ABSTRACT
TOP
ABSTRACT
INTRODUCTION
THE ISSUE EMBEDDED IN...
PHILOSOPHIES ABOUT MULTIPLE...
THE GENERAL STRATEGY
SIMULATED SAMPLE OBSERVATIONS
NEWMAN-KEULS PROCEDURE
BONFERRONI PROCEDURE
LEAST SIGNIFICANT DIFFERENCE...
FALSE DISCOVERY RATE PROCEDURE:...
SUMMARY
APPENDIX
REFERENCES

Statistical procedures underpin the process of scientific discovery. As researchers, one way we use these procedures is to test the validity of a null hypothesis. Often, we test the validity of more than one null hypothesis. If we fail to use an appropriate procedure to account for this multiplicity, then we are more likely to reach a wrong scientific conclusion---we are more likely to make a mistake. In physiology, experiments that involve multiple comparisons are common: of the original articles published in 1997 by the American Physiological Society, ~40% cite a multiple comparison procedure. In this review, I demonstrate the statistical issue embedded in multiple comparisons, and I summarize the philosophies of handling this issue. I also illustrate the three procedures---Newman-Keuls, Bonferroni, least significant difference---cited most often in my literature review; each of these procedures is of limited practical value. Last, I demonstrate the false discovery rate procedure, a promising development in multiple comparisons. The false discovery rate procedure may be the best practical solution to the problems of multiple comparisons that exist within physiology and other scientific disciplines.

Bonferroni inequality, false discovery rate, least significant difference, Newman-Keuls, statistics


    INTRODUCTION
TOP
ABSTRACT
INTRODUCTION
THE ISSUE EMBEDDED IN...
PHILOSOPHIES ABOUT MULTIPLE...
THE GENERAL STRATEGY
SIMULATED SAMPLE OBSERVATIONS
NEWMAN-KEULS PROCEDURE
BONFERRONI PROCEDURE
LEAST SIGNIFICANT DIFFERENCE...
FALSE DISCOVERY RATE PROCEDURE:...
SUMMARY
APPENDIX
REFERENCES

STATISTICAL PROCEDURES are inherent to scientific discovery. As researchers, we use these procedures for two main reasons: to obtain point and interval estimates about the value of a population parameter, and to test the validity of a null hypothesis (5). Point and interval estimates emphasize the magnitude and uncertainty of the experimental results. The test of a null hypothesis helps guard against an unwarranted scientific conclusion, or it helps argue for a real experimental effect (18). When more than one hypothesis is tested---when multiple comparisons are made---the validity of our scientific conclusions may be weakened if we fail to use an appropriate multiple comparison procedure (6, 8, 11, 14, 19, 20).

In studies published recently by the American Physiological Society (APS), the citation of a multiple comparison procedure is common (Table 1). This finding raises an important question: do physiologists understand the philosophies and assumptions behind competing multiple comparison procedures? This question is relevant for three reasons: there are many procedures available, textbooks of statistics (for example, Refs. 1, 13, and 18) provide little more than a cursory description of the procedures themselves, and there can be several solutions to the problem created by multiple comparisons.

                              
View this table:
[in this window]
[in a new window]
 
Table 1.   Manuscripts of APS journals in 1997: use of multiple comparison procedures

In this paper, I summarize the statistical issue embedded in multiple comparisons, and I review the philosophies of handling this issue. Then, I illustrate the three procedures---Newman-Keuls, Bonferroni, least significant difference---cited most often in my literature review. Last, I review the false discovery rate, a promising development in multiple comparisons.

Glossary


 alpha    Error rate for a single comparison
 alpha ℱ   Error rate for a family of k comparisons
H0   Null hypothesis
µ   Population mean
P   Achieved significance level
Pr{A}   Probability of event A
 <A><AC>y</AC><AC>&cjs1171;</AC></A>   Sample mean
 Delta <A><AC>y</AC><AC>&cjs1171;</AC></A>*   Critical difference between two sample means


    THE ISSUE EMBEDDED IN MULTIPLE COMPARISONS
TOP
ABSTRACT
INTRODUCTION
THE ISSUE EMBEDDED IN...
PHILOSOPHIES ABOUT MULTIPLE...
THE GENERAL STRATEGY
SIMULATED SAMPLE OBSERVATIONS
NEWMAN-KEULS PROCEDURE
BONFERRONI PROCEDURE
LEAST SIGNIFICANT DIFFERENCE...
FALSE DISCOVERY RATE PROCEDURE:...
SUMMARY
APPENDIX
REFERENCES

To test a null hypothesis, we must formulate the hypothesis beforehand. Then, using data collected during the experiment, we must compute the observed value T of some test statistic. Last, we must compare the observed value T to a critical value T*, chosen from the distribution of the test statistic that is based on the null hypothesis. If T is more extreme than T*, then that is surprising if the null hypothesis is true, and we are entitled to become skeptical about the scientific validity of the null hypothesis.

Suppose we want to assess renal blood flow in two independent samples. If our objective is to compare the underlying population means, µ1 and µ2, then one pair of null and alternative hypotheses, H0 and H1, is
H<SUB>0</SUB>: &mgr;<SUB>1</SUB>=&mgr;<SUB>2</SUB>

H<SUB>1</SUB>: &mgr;<SUB>1</SUB>≠&mgr;<SUB>2</SUB>
The probability that we reject H0 given that H0 is true is the error rate alpha . We can use mathematical notation1 to write this statement as
Pr{reject<IT> H<SUB>0</SUB> ‖ H<SUB>0</SUB> </IT>is true}<IT>=&agr;</IT> (1)
Note that the critical value T* is the 100[1 - (alpha  / 2)]th percentile from the distribution of the test statistic given that the null hypothesis is true. Equation 1 can be rewritten as
1−Pr{fail to reject<IT> H<SUB>0</SUB> ‖ H<SUB>0</SUB> </IT>is true}<IT>=</IT><IT>1−</IT>(<IT>1−&agr;</IT>) (2)

=&agr;

Multiple comparisons. Suppose we want to assess renal blood flow in three independent samples.2 In this setting, there are three alternative hypotheses, H1-H3, that correspond to the comparisons among population means:
H<SUB>0</SUB>: &mgr;<SUB>1</SUB>=&mgr;<SUB>2</SUB>=&mgr;<SUB>3</SUB>

H<SUB>1</SUB>: &mgr;<SUB>1</SUB>≠&mgr;<SUB>2</SUB>

H<SUB>2</SUB>: &mgr;<SUB>1</SUB>≠&mgr;<SUB>3</SUB>

H<SUB>3</SUB>: &mgr;<SUB>2</SUB>≠&mgr;<SUB>3</SUB>
Associated with each of these comparisons is an error rate of magnitude alpha . If the three comparisons are considered to be a family, then the family will have an error rate alpha ℱ, where alpha ℱ > alpha . As a result, it is more likely that a true null hypothesis will be rejected erroneously. This is the statistical issue that lies at the heart of multiple comparison procedures.

To see why this issue warrants our attention, imagine that each of k independent comparisons is tested at an error rate of alpha . Assume that the underlying populations are identical and that each of the k null hypotheses is true. What is alpha ℱ, the probability that at least one of the k comparisons will reject a true null hypothesis? As in Eq. 2, the probability of rejecting at least one H0 given that all H0 are true can be written
1−Pr{fail to reject all<IT> H<SUB>0</SUB> ‖ </IT>all<IT> k H<SUB>0</SUB> </IT>are true}

=1−(1−&agr;)<SUP>k</SUP>

=&agr;<SUB>ℱ</SUB>
For a single comparison, alpha ℱ = alpha . When the number of comparisons increases, alpha  remains constant, but alpha ℱ increases. For example, if alpha  = 0.05, then for k = 1, 2, 3, 4, 5, ... , 10,
<AR><R><C>k</C><C>1</C><C>2</C><C>3</C><C>4</C><C>5</C><C>…</C><C>10</C></R><R><C>&agr;<SUB>ℱ</SUB></C><C>0.05</C><C>0.10</C><C>0.14</C><C>0.19</C><C>0.23</C><C>…</C><C>0.40</C></R></AR>
For k = 10 comparisons, there is a 40% chance that we will reject erroneously at least one true null hypothesis.

Misguided multiple comparisons. In many of the studies tallied in Table 1, a multiple comparison procedure was used to analyze several groups of observations made on the same subjects. In general, this use of a multiple comparison procedure is misguided: most procedures assume that the groups are independent, but repeated observations on a subject, for example, observations made during baseline and then during several periods after some intervention, create correlation among the groups (9). As a result, the true error variability is underestimated, and the observed values for the standard deviations of the group means underestimate the true variabilities (9). When most multiple comparison procedures are used to analyze groups of repeated observations, the outcome will be an inflated number of statistically significant differences among the group means (see APPENDIX).


    PHILOSOPHIES ABOUT MULTIPLE COMPARISONS
TOP
ABSTRACT
INTRODUCTION
THE ISSUE EMBEDDED IN...
PHILOSOPHIES ABOUT MULTIPLE...
THE GENERAL STRATEGY
SIMULATED SAMPLE OBSERVATIONS
NEWMAN-KEULS PROCEDURE
BONFERRONI PROCEDURE
LEAST SIGNIFICANT DIFFERENCE...
FALSE DISCOVERY RATE PROCEDURE:...
SUMMARY
APPENDIX
REFERENCES

Would you tell me, please, which way I ought to go from   here?---Alice

  That depends a good deal on where you want to get   to.---The Cat

L. Carroll in Alice's Adventures in Wonderland (1865)

When we decide the validity of a single comparison, we can make a mistake: we can reject a true null hypothesis, or we can fail to reject a false null hypothesis. When we decide the validity of k comparisons---this happens in most experiments---we are more likely to reject a true null hypothesis. The challenge for any multiple comparison procedure is to satisfy two conflicting requirements: reduce the risk that we reject a true null hypothesis but maintain the likelihood that we detect an experimental effect if it exists (7, 12, 17). The relative importance assigned to these requirements has produced opposing philosophies about how to handle the issue of multiple comparisons.

Focus on individual comparisons. Proponents of this philosophy argue it is sufficient to control the single comparison error rate alpha , the probability that we reject a true null hypothesis. They base this philosophy on the assumption that most scientific comparisons are preplanned (2, 15, 16). This assumption is naive and unrealistic: many experimental effects are discovered only after an investigator explores---rummages through---the data.

Control for multiple comparisons. In general, physiologists examine the impact of an intervention on a set---a family---of related comparisons: for example, the impact of some drug on renal blood flow and urinary excretion of hormones and electrolytes, or a series of paired comparisons among several groups of observations. In these situations, we base our scientific conclusions on a family of comparisons: that is, multiple comparisons considered as a single entity. As a result, it is not the single comparison error rate alpha  that we must control but the family error rate alpha ℱ, the probability that we reject at least one true null hypothesis in the family of comparisons (7, 8, 11-13, 17, 19-20). Multiple comparison procedures provide control of the family error rate alpha ℱ.


    THE GENERAL STRATEGY
TOP
ABSTRACT
INTRODUCTION
THE ISSUE EMBEDDED IN...
PHILOSOPHIES ABOUT MULTIPLE...
THE GENERAL STRATEGY
SIMULATED SAMPLE OBSERVATIONS
NEWMAN-KEULS PROCEDURE
BONFERRONI PROCEDURE
LEAST SIGNIFICANT DIFFERENCE...
FALSE DISCOVERY RATE PROCEDURE:...
SUMMARY
APPENDIX
REFERENCES

Most multiple comparison procedures use the same basic strategy: to make inferences about the population means for two groups, µell and µphi , they compare the magnitude of the difference between the sample means <A><AC>y</AC><AC>&cjs1171;</AC></A>ell and <A><AC>y</AC><AC>&cjs1171;</AC></A>phi to a critical difference Delta <A><AC>y</AC><AC>&cjs1171;</AC></A>*. If
‖ <A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>ϕ</SUB>−<A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>ℓ</SUB> ‖>&Dgr;<A><AC>y</AC><AC>&cjs1171;</AC></A>*
where
&Dgr;<A><AC>y</AC><AC>&cjs1171;</AC></A>*=c·SE{<IT>u</IT>} (3)
and where SE{u} is the standard error of the quantity u, then that is statistical evidence that µell  not equal  µphi . Procedures differ in the statistics substituted for the coefficient c and the quantity u. Table 2 lists the statistics for the Newman-Keuls, Bonferroni, and least significant difference tests.

                              
View this table:
[in this window]
[in a new window]
 
Table 2.   Calculation of the critical difference between sample means, Delta <A><AC>y</AC><AC>&cjs1171;</AC></A>*


    SIMULATED SAMPLE OBSERVATIONS
TOP
ABSTRACT
INTRODUCTION
THE ISSUE EMBEDDED IN...
PHILOSOPHIES ABOUT MULTIPLE...
THE GENERAL STRATEGY
SIMULATED SAMPLE OBSERVATIONS
NEWMAN-KEULS PROCEDURE
BONFERRONI PROCEDURE
LEAST SIGNIFICANT DIFFERENCE...
FALSE DISCOVERY RATE PROCEDURE:...
SUMMARY
APPENDIX
REFERENCES

An article published recently in the Journal provides an ideal framework with which to illustrate multiple comparison procedures. In the experiment, Koch et al. (10) explored the heritability of running endurance, measured as distance run, in rats. I used the observed sample statistics from 10 experimental groups (Fig. 1) as the empirical foundation for the simulated sample observations.3


View larger version (11K):
[in this window]
[in a new window]
 
Fig. 1.   Experimental groups 1 - 10 associated with the simulated sample observations and derived sample statistics listed in Table 3. This diagram is based on the selective breeding procedure described in Ref. 10. The initial generation is generation 0. In each generation, the 2 female (open circle ) and 2 male () rats at the extremes of observed running endurance were paired and bred to produce the subsequent generation.

This is how I generated the simulated sample observations---the data. Let the random variable Yj represent the distance run by a rat in group j, where j = 1, 2, ... , 10. Assume that each Yj is distributed normally with mean µj and variance sigma j2
Y<SUB>j</SUB>∼N(&mgr;<SUB>j</SUB>, &sfgr;<SUP>2</SUP><SUB>j</SUB>)
I estimated each µj and sigma j using approximate values for the observed group means and standard deviations (see Ref. 10, Tables 1 and 2). For simplicity, I limited each sample to 10 observations. One set of 10 simulated samples is listed in Table 3. For the rest of the review, I use the resulting sample means
<A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>1</SUB>=474, <A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>2</SUB>=291, … , <A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>10</SUB>=612
and the resulting sample standard deviations
s<SUB>1</SUB>=100, s<SUB>2</SUB>=102, … , s<SUB>10</SUB>=65
as the basis for my illustration of specific multiple comparison procedures.

                              
View this table:
[in this window]
[in a new window]
 
Table 3.   Simulated sample observations and derived sample statistics


    NEWMAN-KEULS PROCEDURE
TOP
ABSTRACT
INTRODUCTION
THE ISSUE EMBEDDED IN...
PHILOSOPHIES ABOUT MULTIPLE...
THE GENERAL STRATEGY
SIMULATED SAMPLE OBSERVATIONS
NEWMAN-KEULS PROCEDURE
BONFERRONI PROCEDURE
LEAST SIGNIFICANT DIFFERENCE...
FALSE DISCOVERY RATE PROCEDURE:...
SUMMARY
APPENDIX
REFERENCES

The Newman-Keuls procedure4 is a multiple range test that compares the underlying population means of r experimental groups. That is, it evaluates the null hypothesis
H<SUB>0</SUB>: &mgr;<SUB>1</SUB>=&mgr;<SUB>2</SUB>= … =&mgr;<SUB>r</SUB> (4)
The procedure sets the family error rate alpha ℱ at alpha , the single comparison error rate, by using studentized range distributions to calculate critical differences (see Eq. 5).

Another multiple range test is the Duncan procedure.5 It is only the specification of alpha ℱ that differentiates the method of Duncan from that of Newman-Keuls. The Duncan family error rate is alpha ℱ = 1 - (1 - alpha )m-1, where m is the number of means being compared. The Duncan multiple range test is a noted ancestor of modern multiple comparison procedures, but because alpha ℱ grows with m, the test violates a basic tenet of multiple comparisons: the control of alpha ℱ despite a large number of comparisons (see Ref. 12, p. 87-89).

The example. To make inferences about the equality of two population means, µell and µphi , the Newman-Keuls procedure uses the critical difference Delta <A><AC>y</AC><AC>&cjs1171;</AC></A>*m, defined as
&Dgr;<A><AC>y</AC><AC>&cjs1171;</AC></A><SUP>*</SUP><SUB>m</SUB>=q<SUP>&agr;<SUB>ℱ</SUB></SUP><SUB>m,&ngr;</SUB>·SE{<IT><A><AC>y</AC><AC>&cjs1171;</AC></A></IT>} (5)
In Eq. 5, the coefficient qm,nu alpha ℱ is the 100[1 - alpha ℱ]th percentile from a studentized range distribution with m means and nu  degrees of freedom, and SE{<A><AC>y</AC><AC>&cjs1171;</AC></A>} is the standard error of the sample mean. Using the pooled sample variance s2 = 6,883 (see Table 3), the standard error of the sample mean is estimated as
SE{<IT><A><AC>y</AC><AC>&cjs1171;</AC></A></IT>}<IT>=s /</IT><RAD><RCD><IT>n</IT></RCD></RAD><IT>=83 /</IT><RAD><RCD><IT>10</IT></RCD></RAD><IT>=26.2</IT>
Suppose we define alpha ℱ = 0.05. In this simulated experiment, there are nu  = 90 degrees of freedom (see Table 3). Because there can be groups of m = 2, 3, ... , 10 consecutive sample means, there are nine critical differences to be calculated using Eq. 5 (Table 4).

                              
View this table:
[in this window]
[in a new window]
 
Table 4.   Critical differences for the Newman-Keuls procedure

A simple graphical technique can communicate the inferences based on these critical differences. First, we list the sample means in ascending order (see Table 3)
<AR><R><C><IT>Group</IT><IT> j</IT></C><C><IT>2</IT></C><C><IT>8</IT></C><C><IT>7</IT></C><C><IT>4</IT></C><C><IT>1</IT></C><C><IT>3</IT></C><C><IT>6</IT></C><C><IT>10</IT></C><C><IT>5</IT></C><C><IT>9</IT></C></R><R><C><IT><A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>j</SUB></IT></C><C><IT>291</IT></C><C><IT>370</IT></C><C><IT>373</IT></C><C><IT>404</IT></C><C><IT>474</IT></C><C><IT>487</IT></C><C><IT>503</IT></C><C><IT>612</IT></C><C><IT>632</IT></C><C><IT>770</IT></C></R></AR>
Then, for each group of m consecutive means, progressing from largest to smallest m, we compare the magnitude of the m-mean range, <A><AC>y</AC><AC>&cjs1171;</AC></A>phi  - <A><AC>y</AC><AC>&cjs1171;</AC></A>ell , to its corresponding critical difference Delta <A><AC>y</AC><AC>&cjs1171;</AC></A>*m. If
<A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>ϕ</SUB>−<A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>ℓ</SUB>≤&Dgr;<A><AC>y</AC><AC>&cjs1171;</AC></A><SUP>*</SUP><SUB>m</SUB>
then we underline the group of m means: we are unable to discriminate among them. If
<A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>ϕ</SUB>−<A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>ℓ</SUB>>&Dgr;<A><AC>y</AC><AC>&cjs1171;</AC></A><SUP>*</SUP><SUB>m</SUB>
then we draw no line: we have identified at least one difference. At the end of this process, it is only those means that remain unconnected that we can discriminate statistically.

To illustrate this technique, we begin with m = 10. The initial step is
770−291=479>120, draw no line
In fact, for m = 9, 8, ... , 4, <A><AC>y</AC><AC>&cjs1171;</AC></A>phi  - <A><AC>y</AC><AC>&cjs1171;</AC></A>ell  > Delta <A><AC>y</AC><AC>&cjs1171;</AC></A>*m, therefore draw no lines.

The next step is to evaluate groups of m = 3 consecutive means
770−612=158>88, draw no line<IT>;</IT>

632−503=129>88, draw no line<IT>;</IT>

612−487=125>88, draw no line<IT>;</IT>

503−474=29<88, underline<IT>;</IT>

487−404=83<88, underline<IT>;</IT>

474−373=101>88, draw no line<IT>;</IT>

404−370= 34<88, underline<IT>;</IT>

373−291= 82<88, underline
The final step is to evaluate pairs (m = 2) of adjacent means
770−632=138>74, draw no line<IT>;</IT>

632−612=20<74, underline<IT>;</IT>

612−503=109>74, draw no line
At this point, we can stop: all remaining pairs of consecutive means were underlined in the preceding step, when m = 3.

The Newman-Keuls procedure leads to these conclusions about the 10 sample means
<AR><R><C><IT>Group</IT><IT> j</IT></C><C><IT>2</IT></C><C><IT>8</IT></C><C><IT>7</IT></C><C><IT>4</IT></C><C><IT>1</IT></C><C><IT>3</IT></C><C><IT>6</IT></C><C><IT>10</IT></C><C><IT>5</IT></C><C><IT>9</IT></C></R><R><C><IT><A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>j</SUB></IT></C><C><IT>291</IT></C><C><IT>370</IT></C><C><IT>373</IT></C><C><IT>404</IT></C><C><IT>474</IT></C><C><IT>487</IT></C><C><IT>503</IT></C><C><IT>612</IT></C><C><IT>632</IT></C><C><IT>770</IT></C></R></AR>
These are examples of inferences based on this data graphic: µ2 resembles µ8 and µ7 but differs from µ4, µ1, ... , µ9; and µ9 differs from all other means. Table 5 lists the inferences for the 16 preplanned group comparisons.

                              
View this table:
[in this window]
[in a new window]
 
Table 5.   Statistical inferences based on preplanned group comparisons

Practical considerations. The Newman-Keuls procedure evaluates all r (r - 1)/2 paired comparisons among r sample means from a balanced design. The test assumes the r means are independent and are based on identical numbers of observations (Ref. 12, p. 86). When it compares more than three means, the Newman-Keuls procedure no longer caps the family error rate alpha ℱ at alpha ; instead, alpha ℱ > alpha  (Ref. 8, p. 127). For this reason, the Newman-Keuls procedure is of limited value for multiple comparisons.


    BONFERRONI PROCEDURE
TOP
ABSTRACT
INTRODUCTION
THE ISSUE EMBEDDED IN...
PHILOSOPHIES ABOUT MULTIPLE...
THE GENERAL STRATEGY
SIMULATED SAMPLE OBSERVATIONS
NEWMAN-KEULS PROCEDURE
BONFERRONI PROCEDURE
LEAST SIGNIFICANT DIFFERENCE...
FALSE DISCOVERY RATE PROCEDURE:...
SUMMARY
APPENDIX
REFERENCES

The Bonferroni inequality is a probability inequality that does control the family error rate alpha ℱ. For a family of k comparisons, the Bonferroni inequality defines the upper bound of the family error rate to be
&agr;<SUB>ℱ</SUB>=1−(1−&agr;)<SUP>k</SUP>=k·&agr;
where alpha  is the error rate for each comparison. In other words, the inequality assigns an error rate of alpha ℱ / k to each comparison within the family. Because alpha  can vary among comparisons, the general expression for the family error rate is
&agr;<SUB>ℱ</SUB>=&agr;<SUB>1</SUB>+&agr;<SUB>2</SUB>+ … +&agr;<SUB>k</SUB>

The example. To make inferences about the equality of two population means, µell and µphi , the Bonferroni procedure relies on the critical difference Delta <A><AC>y</AC><AC>&cjs1171;</AC></A>*, defined as
&Dgr;<A><AC>y</AC><AC>&cjs1171;</AC></A>*=t<SUB>&agr;/2,&ngr;</SUB>·SE{<IT><A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>ϕ</SUB>−<A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>ℓ</SUB></IT>} (6)
In Eq. 6, the coefficient talpha / 2,nu is the 100[1 - (alpha  / 2)]th percentile from a t distribution with nu  degrees of freedom, and SE{<A><AC>y</AC><AC>&cjs1171;</AC></A>phi  - <A><AC>y</AC><AC>&cjs1171;</AC></A>ell } is the standard error of the difference between the sample means.

If we define alpha ℱ = 0.05, then for each of the 16 preplanned comparisons listed in Table 5
&agr;=&agr;<SUB>ℱ</SUB>/k=0.05/16=0.0031
Therefore, because there are nu  = 90 degrees of freedom (see Table 3), talpha /2,nu  = 3.04. Using the pooled sample variance s2 = 6,883, the standard error of the difference between sample means is estimated as
SE{<IT><A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>ϕ</SUB>−<A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>ℓ</SUB></IT>}<IT>=</IT><RAD><RCD>(<IT>s<SUP>2</SUP>+s<SUP>2</SUP></IT>)<IT>/n</IT></RCD></RAD><IT>=37.1</IT> (7)
By virtue of Eq. 6, the resulting critical difference for the Bonferroni procedure is
&Dgr;<A><AC>y</AC><AC>&cjs1171;</AC></A>*=3.04×37.1=113
Therefore, the Bonferroni procedure leads to these conclusions about the 10 sample means
<AR><R><C><IT>Group</IT><IT> j</IT></C><C><IT>2</IT></C><C><IT>8</IT></C><C><IT>7</IT></C><C><IT>4</IT></C><C><IT>1</IT></C><C><IT>3</IT></C><C><IT>6</IT></C><C><IT>10</IT></C><C><IT>5</IT></C><C><IT>9</IT></C></R><R><C><IT><A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>j</SUB></IT></C><C><IT>291</IT></C><C><IT>370</IT></C><C><IT>373</IT></C><C><IT>404</IT></C><C><IT>474</IT></C><C><IT>487</IT></C><C><IT>503</IT></C><C><IT>612</IT></C><C><IT>632</IT></C><C><IT>770</IT></C></R></AR>
Table 5 lists the resulting inferences for the 16 preplanned group comparisons.

Practical considerations. Although it is not a multiple comparison procedure per se, the Bonferroni inequality can be used for multiple comparison problems. The technique is valid regardless of whether the r sample means are independent or correlated (Ref. 12, p. 67). The Bonferroni inequality is appealing because it is versatile and simple. Unfortunately, its appeal is diminished by the strict protection of the single comparison error rate alpha . As a consequence, the Bonferroni inequality is conservative: it will be unable to detect some of the actual differences among a family of k comparisons (see Table 5).


    LEAST SIGNIFICANT DIFFERENCE PROCEDURE
TOP
ABSTRACT
INTRODUCTION
THE ISSUE EMBEDDED IN...
PHILOSOPHIES ABOUT MULTIPLE...
THE GENERAL STRATEGY
SIMULATED SAMPLE OBSERVATIONS
NEWMAN-KEULS PROCEDURE
BONFERRONI PROCEDURE
LEAST SIGNIFICANT DIFFERENCE...
FALSE DISCOVERY RATE PROCEDURE:...
SUMMARY
APPENDIX
REFERENCES

The least significant difference (LSD) procedure, developed by Sir R. A. Fisher, preceded the Newman-Keuls multiple range test. Like the Newman-Keuls test, the LSD procedure compares the underlying population means of r experimental groups (see Eq. 4), and it sets the family error rate alpha ℱ at the single comparison error rate alpha .

The example. To make inferences about the equality of two population means, µell and µphi , the LSD procedure uses the critical difference Delta <A><AC>y</AC><AC>&cjs1171;</AC></A>*, defined as
&Dgr;<A><AC>y</AC><AC>&cjs1171;</AC></A>*=t<SUB>&agr;<SUB>ℱ</SUB>/2,&ngr;</SUB>·SE{<IT><A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>ϕ</SUB>−<SUB>ℓ</SUB></IT>} (8)
In Eq. 8, the coefficient talpha ℱ / 2,nu is the 100[1 - (alpha ℱ / 2)]th percentile from a t distribution with nu  degrees of freedom, and SE{<A><AC>y</AC><AC>&cjs1171;</AC></A>phi  - <A><AC>y</AC><AC>&cjs1171;</AC></A>ell } is the standard error of the difference between the sample means.6

If we define alpha ℱ = 0.05, then because there are nu  = 90 degrees of freedom (see Table 3), talpha ℱ / 2,nu  = 1.99. As shown in Eq. 7, SE{<A><AC>y</AC><AC>&cjs1171;</AC></A>phi  - <A><AC>y</AC><AC>&cjs1171;</AC></A>ell } = 37.1. Therefore, by virtue of Eq. 8, the resulting critical difference for the LSD procedure is
&Dgr;<A><AC>y</AC><AC>&cjs1171;</AC></A>*=1.99×37.1=74
The LSD procedure leads to these conclusions about the 10 sample means
<AR><R><C><IT>Group</IT><IT> j</IT></C><C><IT>2</IT></C><C><IT>8</IT></C><C><IT>7</IT></C><C><IT>4</IT></C><C><IT>1</IT></C><C><IT>3</IT></C><C><IT>6</IT></C><C><IT>10</IT></C><C><IT>5</IT></C><C><IT>9</IT></C></R><R><C><IT><A><AC>y</AC><AC>&cjs1171;</AC></A><SUB>j</SUB></IT></C><C><IT>291</IT></C><C><IT>370</IT></C><C><IT>373</IT></C><C><IT>404</IT></C><C><IT>474</IT></C><C><IT>487</IT></C><C><IT>503</IT></C><C><IT>612</IT></C><C><IT>632</IT></C><C><IT>770</IT></C></R></AR>
Table 5 lists the resulting inferences for the 16 preplanned group comparisons.

Practical considerations. The LSD procedure evaluates all r (r - 1) /2 paired comparisons among r sample means. In its protected form, the procedure is done only if a preliminary analysis of variance is statistically significant (18). When it compares more than three means, the LSD procedure fails to maintain the family error rate alpha ℱ at alpha  (Ref. 8, p. 139). The solution to this problem is to replace talpha ℱ / 2,nu in Eq. 8 with a percentile from a studentized range distribution: qr-1,nu alpha ℱ (Ref. 8, p. 139) or qr,nu alpha ℱ (Ref. 12, p. 92).7


    FALSE DISCOVERY RATE PROCEDURE: A RECENT DEVELOPMENT
TOP
ABSTRACT
INTRODUCTION
THE ISSUE EMBEDDED IN...
PHILOSOPHIES ABOUT MULTIPLE...
THE GENERAL STRATEGY
SIMULATED SAMPLE OBSERVATIONS
NEWMAN-KEULS PROCEDURE
BONFERRONI PROCEDURE
LEAST SIGNIFICANT DIFFERENCE...
FALSE DISCOVERY RATE PROCEDURE:...
SUMMARY
APPENDIX
REFERENCES

In most experiments, scientists strive to make a discovery: to reject a null hypothesis. When an experiment involves a family of k comparisons, a scientist is more likely to make a mistaken discovery. The false discovery rate procedure8 is a promising solution to the problem of multiple comparisons. This procedure controls not the family error rate alpha ℱ but the false discovery rate fℱ, the expected fraction of null hypotheses rejected mistakenly
f<SUB>ℱ</SUB>=<FR><NU>number of mistaken<IT> H<SUB>0</SUB> </IT>rejections</NU><DE>total number of<IT> H<SUB>0</SUB> </IT>rejections</DE></FR>
If all k null hypotheses are true,9 then fℱ = alpha ℱ; if at least one null hypothesis is not true, then fℱ <=  alpha ℱ (3). When we define the family error rate alpha ℱ, we also set an upper bound on the false discovery rate fℱ. But if we control fℱ rather than alpha ℱ, we gain statistical power, the ability to detect an experimental effect if it exists (3, 4, 22).

The example. Unlike the preceding methods, the false discovery rate procedure operates on achieved significance levels (P values) to make inferences about a family of k comparisons. Let Pi represent the significance level associated with comparison i. To execute this procedure, we must complete three steps:

Step 1.  Order the k comparisons by decreasing magnitude of Pi.

Step 2.  For i = k, k - 1, ... , 1, calculate the critical significance level d*i as
d<SUP>*</SUP><SUB>i</SUB>=(i/k)·f<SUB>ℱ</SUB> (9)
Step 3.  If Pi <=  d*i, then reject the null hypotheses associated with the remaining i comparisons.10

In the simulation, we selected k = 16 comparisons of interest. For each comparison, we evaluate the null hypothesis H0: µell  = µphi by doing a t test. The P values associated with the resulting t statistics vary from 0.723 right-arrow 0.001- (Table 6). If we define the false discovery rate fℱ = 0.05, the magnitude of the family error rate alpha ℱ we have been using, then the critical significance level d*i varies from 0.050 right-arrow 0.003. In step 3, we declare comparisons 1-14 to be statistically significant (see Table 6). Table 5 lists the inferences for all 16 comparisons.

                              
View this table:
[in this window]
[in a new window]
 
Table 6.   Calculations for the false discovery rate procedure

Practical considerations. Because the false discovery rate procedure operates on actual P values, it is quite versatile. For example, the procedure can be employed when a family of k comparisons involves different test statistics such as Student t and Wilcoxon signed rank statistics (3, 4). The false discovery rate procedure is valid when the k comparisons are independent (a sample mean is part of only one comparison) or correlated (a sample mean is part of more than one comparison, as in the example) (3, 4, 22).

The false discovery rate procedure has two important benefits. First, it allows us to make an inference, with 100[1 - (fℱ / 2)]% confidence, about the direction of a statistical difference (4, 22). For example, because fℱ = 0.05, we can declare, with 97.5% confidence, that µ2 < µ8 (see Table 6). This is a stronger inference than the simple declaration µ2 not equal  µ8 (Ref. 8, p. 27-39). Second, the statistical results for a set of primary comparisons are largely consistent despite substantial changes in the number of secondary comparisons included within the family (22).


    SUMMARY
TOP
ABSTRACT
INTRODUCTION
THE ISSUE EMBEDDED IN...
PHILOSOPHIES ABOUT MULTIPLE...
THE GENERAL STRATEGY
SIMULATED SAMPLE OBSERVATIONS
NEWMAN-KEULS PROCEDURE
BONFERRONI PROCEDURE
LEAST SIGNIFICANT DIFFERENCE...
FALSE DISCOVERY RATE PROCEDURE:...
SUMMARY
APPENDIX
REFERENCES

We dare not seek a single multiple comparison procedure for all experiments.

Adapted from John W. Tukey (1994)

This remark, written by a pioneer in the area of multiple comparisons, reflects the range of multiple comparison problems that manifest themselves in scientific research. Over the last 50-60 years, statisticians have explored numerous approaches in an effort to address these problems (8, 12). In physiology, as in other disciplines, experiments that involve problems of multiple comparisons are common.

In this review, I have shown that, as researchers, we are more likely to reject a true null hypothesis if we fail to use a multiple comparison procedure when we analyze a family of comparisons. I have also illustrated the three procedures cited most often in APS journals: Newman-Keuls, Bonferroni, and LSD. Unfortunately, each of these is of limited value. In many experimental situations, the Newman-Keuls and LSD procedures fail to control the family error rate, the probability that we reject at least one true null hypothesis. In contrast, the Bonferroni inequality is overly conservative: it fails to detect some of the actual differences that exist within the family.

Finally, I have reviewed the false discovery rate: a versatile, simple, and powerful approach to multiple comparisons. As Tukey suggests, it is perhaps unrealistic to expect that a single multiple comparison procedure will suffice for all situations: a statistical procedure designed specifically for a particular experimental situation will perform better than a general procedure. Nevertheless, there is growing evidence (4, 22) that the false discovery rate procedure may be the best practical solution to the problems of multiple comparisons that exist within science.


    APPENDIX
TOP
ABSTRACT
INTRODUCTION
THE ISSUE EMBEDDED IN...
PHILOSOPHIES ABOUT MULTIPLE...
THE GENERAL STRATEGY
SIMULATED SAMPLE OBSERVATIONS
NEWMAN-KEULS PROCEDURE
BONFERRONI PROCEDURE
LEAST SIGNIFICANT DIFFERENCE...
FALSE DISCOVERY RATE PROCEDURE:...
SUMMARY
APPENDIX
REFERENCES

For all but one of the multiple comparison procedures listed in Table 1, an important assumption is that the r experimental groups are independent (12).11 In many studies that use these multiple comparison procedures, however, the r groups are not independent. This happens because investigators make repeated observations on each subject: these observations are correlated by virtue of individual biological makeup (9). Therefore, the true error variability is underestimated, and the observed values for the standard deviations of the group means underestimate the true variabilities (9).

To appreciate the impact of correlation on variability, imagine an investigation in a sample of n subjects. In each subject, some random variable X is measured during two experimental conditions: a control period and a subsequent intervention period. Let the random variable measured during the control period be designated X1 and that during the intervention period be designated X2. Assume that X1 and X2 are distributed normally