Back to the Top
Dear All
Could any one Tell me what is the distinction between the multiple
comparison Tests - Turkey's and Student Newman
How is Turkeys test better ? and why ?...........( as it seems so )
Regards
Sulagna
Back to the Top
Hi Sulgana,
The Student Newman test has more power in comparison to Tukey's test.
Having said that it does not control for Type I error (i.e. it could
be greater than 5%), whereas with Tukey's test you can adjust for
multiple comparisons. Also, it (Student Newman) test does not
generate 95% CI's for each difference.
I would suggest to use Tukey's test.
Best regards,
Nav
--
Nav Randhawa, MSc
Biostatistician, Pharmacokinetics and Statistics
Biovail Contract Research
Email: navdeep.randhawa.aaa.biovail.com
URL: www.biovail-cro.com
Back to the Top
The multiple comparison tests (of means) you cite are brought into
play when you want to compare all pairs of means you have: the Tukey
and Student-Newman tests are related and report identical results
when comparing the largest with the smallest means. With other
comparisons, Tukey's method is more conservative, but may miss real
differences too often. On the other hand the Student-Newman-Keuls
method is more discriminating and may mistakenly find differences too
often. There may not be general agreement about which one to use
i.e. which is the better of the two tests (of the means).
Perhaps the best approach is to test both ways and report the results
with an interpretive commentary appropriate for the science under
consideration.
Hope above helps,
Angus McLean Ph.D,
8125 Langport Terrace,
Suite 100,
Gaithersburg,
MD 20877
tel 301-869-1009
fax 301-869-5737
BioPharm Global Inc.
Back to the Top
Dear Sulgna,
The Newman-Keuls test has more power. This means it can find that a
diference between two groups is 'statistically significant' in some
cases where the Tukey test would conclude that the difference is 'not
statistically significant'. But this extra power comes at a price.
Although the whole point of multiple comparison post tests is to keep
the chance of a Type I error in any comparison to be 5%, in fact the
Newman-Keuls test doesn't do this. In some cases, the chance of a
Type I error can be greater than 5%. Another problem is because the
Newman-Keuls test works in a sequential fashion, it can not produce
95% confidence intervals for each difference. Because the Newman-
Keuls test doesn't control error rate, doesn't generate confidence
intervals, Tukey test is better.
Back to the Top
The following message was posted to: PharmPK
Dear Angus,
You wrote:
> Tukey's method is more conservative, but may miss real
> differences too often.
This sounds quite strange. How do you know if you miss a 'real
difference'
(by the way, what is a 'real difference'?) ?
> Perhaps the best approach is to test both ways and report the results
> with an interpretive commentary appropriate for the science under
> consideration.
This also sounds strange. Using two different tests is considered bad
practice in statistics. The appropriate statistical test must be
selected
before the experiment is performed, or at least before the data are
known.
Others have pointed out that the Newman-Keuls test does not restrict the
type I error to the chosen alpha (usually 0.05). This indeed means
that it
may conclude mistakenly to a significant difference 'too often'. In my
opinion this also means that the Newman-Keuls test is not an appropriate
test.
Any comments?
Johannes H. Proost
Dept. of Pharmacokinetics and Drug Delivery
University Centre for Pharmacy
Antonius Deusinglaan 1
9713 AV Groningen, The Netherlands
tel. 31-50 363 3292
fax 31-50 363 3247
Email: j.h.proost.aaa.rug.nl
Back to the Top
To Johannes H. Proost:
From a regulatory perspective one should definitely a priori define
the statistical test and criteria for difference one will use in a
clinical or preclinical protocol and be able to justify it: it is not
at all good practice to use multiple tests subsequently. That is for
sure, since it makes it look at though one is searching for a test
allowing interpretation of data the "way you want to." In other
words you are selecting a statistical test, which is supportive of an
argument you have already decided to make.
On the other hand for inspection of preliminary experimental data I
have not seen a lot wrong with exploring different statistical
options upfront and discussing them upfront. Perhaps at that point
one would justify selecting a statistical test for future studies.
The essence of commentary provided by yourself on Newman-Keuls/Tukey
and others in the discussion group is most helpful information and
evidently could be referred to at that point evidently generally in
favor of Tukey.
One wonders whether there are any instances depending on the type of
data that you do have and the comparison you are making (where you
have other relevant information available) where it would be more
appropriate despite the limitations described to use the more
discriminating test i.e. Newman Keuls or indeed should our position
be for multiple comparisons to discard the Newman-Keuls completely in
favor of Tukey test?
Best Regards,
Angus McLean Ph.D,
8125 Langport Terrace,
Suite 100,
Gaithersburg,
MD 20877
tel 301-869-1009
fax 301-869-5737
Back to the Top
The following message was posted to: PharmPK
Dear Argus,
Thank you for your reply. I agree with your view on statistics for
inspection of preliminary experimental data, i.e. this is indeed the
usual way, I presume.
But my comment was also intended to point to this practice, which is
at least a dangerous one. The results of such a statistical test
practice are no more than 'preliminary statistical data', which is
almost a 'contradictio in terminis'. We apply statistics to make
clear decisions; they are always arbitrary (e.g. alpha = 0.05 is
completely arbitrary), but should not be subjective. There is nothing
like 'tend to be statistically different' or 'likely to be
statistically different'.
Please note that I am not an expert in statistics, and I did not
provide the information on the Newman-Keuls and Tukey tests, but I
trust this information was correct. So I would like to hear the
answer to your final question from the experts.
Johannes H. Proost
Dept. of Pharmacokinetics and Drug Delivery
University Centre for Pharmacy
Antonius Deusinglaan 1
9713 AV Groningen, The Netherlands
tel. 31-50 363 3292
fax 31-50 363 3247
Email: j.h.proost.at.rug.nl
Back to the Top
Johannes H. Proost
I am not a statistical expert either; I put my question reproduced
here below to Harvey Motulsky MD and president of Graph Pad Prism in
San Diego.
Again my question relating to multiple comparison optional tests is :
"One wonders whether there are any instances depending on the type of
data that you do have and the comparison you are making (where you
have other relevant information available) where it would be more
appropriate, despite the limitations described, to use the more
discriminating test i.e. Newman Keuls or indeed should our position
be for multiple comparisons to discard the Newman-Keuls completely in
favor of Tukey test?"
Harvey sent me to his Website and I reproduce it here:
" How do I decide between the Tukey and Newman-Keuls multiple
comparison test? FAQ# 1093
Both the Tukey test (also called Tukey-Kramer test) and the
Newman-Keuls (also called Student-Newman-Keuls test) are used to
compare all pairs of means following one-way ANOVA. Although these
are called post tests, they can be performed regardless of the
results of the overall ANOVA results.
The Newman-Keuls test has more power. This means it can find that
a difference between two groups is 'statistically significant' in
some cases where the Tukey test would conclude that the difference is
'not statistically significant'. But this extra power comes at a
price. Although the whole point of multiple comparison post tests is
to keep the chance of a Type I error in any comparison to be 5%, in
fact the Newman-Keuls test doesn't do this1. In some cases, the
chance of a Type I error can be greater than 5%. Another problem is
because the Newman-Keuls test works in a sequential fashion, it can
not produce 95% confidence intervals for each difference.
Because the Newman-Keuls test has two strikes against it
(doesn't control error rate, doesn't generate confidence intervals)
we recommend that you use the Tukey test instead.
1 MA Seaman, JR LEvin and RC Serlin, Psychological Bulletin
110:577-586, 1991."
Additionally Harvey wrote to me personally on this topic of multiple
comparison tests (see below)
"My understanding (really just repeating what I've read, and not an
independent judgement) is that Newman Keuls belongs in the same
category as Fisher's LSD and Duncan's: of historical interest, and
not to be used.
So what should you use? If you want both significance testing and
confidence intervals, then I don't think anything beats Tukey's (all
comparisons) or Dunnett (all vs. control). If you just want
significance testing and don't want or need confidence intervals,
then I believe that Holm's test is what you want. I do plan to add
this to prism 5, but haven't done so yet. It isn't hard to do by
hand. Glantz's "Primer of Biostatistics" explains it well"
Perhaps this question should be posted to a statistical discussion
group.
Hope above helps,
Best Regards,
Angus McLean Ph.D,
8125 Langport Terrace,
Suite 100,
Gaithersburg,
MD 20877
tel 301-869-1009
fax 301-869-5737
[The Graph Pad Prism URL http://www.graphpad.com/prism/Prism.htm and
the link to the FAQ http://www.graphpad.com/faq/viewfaq.cfm?faq=1093
- db]
Back to the Top
The following message was posted to: PharmPK
In response to Dr. Proost's comment that: "We apply statistics to make
clear decisions; they are always arbitrary (e.g. alpha = 0.05 is
completely arbitrary)"
A p-value decision is often made based on what is typically done and
without much thought about why.
A decision to select a p value should not "completely arbitrary"; the
p-value decision should be based upon
considering several factors:
1. A p value of 0.05 is commonly used as an adequate risk that the
experimental differences had a relatively small likelihood of occurring
by chance alone. A p-value (alpha) of 0.05 implies that if one did the
same study 20 times only once would the difference be attributable to
chance so one can fairly reject the null hypothesis with a probability
of 0.05 of falsely rejecting the null hypothesis when it is true (type
I error). One can also determine the probability of accepting the null
hypothesis hypothesis when it is false (type II error (beta)). And
consequently, the probability of not making a type II error (beta) can
be determined.
2. How many subjects are needed to have adequate statistical power is
determined by 1- beta. The number of subjects needed to have an
adequate statistical power for a given N can be calculated given the
means and standard deviations. The larger the mean difference the
greater the statistical power (1-beta). Power is the probability that
the difference is a true difference. Statistical power should be at
least 0.80.
3. One should also know if a one-tailed test or a two-tailed test is
desirable. If one knows which direction the effect should occur then a
one-tailed test would require fewer subjects to be statistically
adequate both for significance and statistical power.
4. How important is the outcome? example: in determining the
effectiveness of a new anesthetic, which carries the possibility of an
adverse outcome (death), one would want the p value to be much smaller
than in determining the preferred color of a new medication.
5. Multiple statistical comparisons on the same data set carries with
it the penalty of correcting for the number of tests (Bonferroni
correction). This is an acceptable procedure for pilot studies which
should be confirmed with another study.
James D. Prah, PhD
US EPA
Human Studies Division MD (58B)
Research Triangle Park, NC, 27711
919 966 6244
919 966 6367 FAX
Back to the Top
I've noticed in all the previous discussions that Duncan's Multiple
Range Test or Cramer's Modified Duncan's Multiple Range Test does not
seem to be in vogue any longer. At the risk of showing my advancing
age, would someone please tell me what the current status is of this
particular test. At one time it was very popular (demonstrated in
Federer's Experimental Design ) and I even have a paper somewhere
showing that this test was the 'best' (however that was defined)
after an extensive set of computer simulations involving all the
multiple range tests. Obviously something happened to it along the
way. If it won't take up too much time, would someone please comment
on the fate of this historical multiple range test.
Thank you.
Edmond B. Edwards, Ph.D.
Back to the Top
Hi Edmond,
In brief, Duncan's test does not adjust for multiple comparisons as
some of the other tests. This means that when conducting a number of
tests a significance level of 0.05 (alpha: Type I error) is used and
therefore the possibility of finding a significant difference by
chance alone is inflated. Whereas, when one uses a test such as
Tukey's, the significant level is adjusted to account and penalize
one depending on the number of tests being conducted and this in turn
controls the type I error rate.
You might find useful an article by Gerard Dallal that discusses
various procedures for multiple comparisons, http://www.tufts.edu/
~gdallal/mc.htm
Best regards,
Nav
--
Nav Randhawa, MSc
Biostatistician, Pharmacokinetics and Statistics
Biovail Contract Research
Tel: (416) 752-3636 Ext. 369
Email: navdeep.randhawa.-a-.biovail.com
URL: www.biovail-cro.com
Back to the Top
The following message was posted to: PharmPK
Dear Angus,
Thanks for your reply. I agree with most of your comments. I
understand your
view with respect to 'inspection of preliminary experimental data',
but my
counter question is: do we need a statistical test for these data?
What do
we learn from the statistical test saying that there is a statistical
difference with test A, but not with test B? This creates a grey
area. We
have observed a difference in our 'preliminary experimental data' and
we can
make our (preliminary) conclusions from the order of magnitude of the
difference. But this could be concluded also from test A or test B
only, and
very likely also from inspection of the data without any statistical
test.
Statistical test are designed to make conclusions from complete
experimental
data. I do not see any rationale for a statistical test in this stage
(but I
admit that I do it myself quite often ... It is a good thing to be
aware of
strange habits). In short, one should use an (or the) appropriate
test, and
nothing more, and apply statistical tests only to complete experimental
data.
Best regards,
Hans Proost
Johannes H. Proost
Dept. of Pharmacokinetics and Drug Delivery
University Centre for Pharmacy
Antonius Deusinglaan 1
9713 AV Groningen, The Netherlands
tel. 31-50 363 3292
fax 31-50 363 3247
Email: j.h.proost.-at-.rug.nl
Back to the Top
Dear James,
Thank you for your extensive comments. A few comments:
ad 1: OK. But the value of 0.05 remains arbitrary. It could be 0.03,
0.06 or
whatever. Intuitively 0.05 is a nice value, but it is arbitrary. The
most
important thing is that almost everybody uses 0.05, so the meaning of
'significantly different' is at least at this point identical in more
than
99% of (biomedical) literature.
ad 2:
> the difference is a true difference. Statistical power should be
at least
0.80.
This is a specific demand that is used (or published) only in special
cases.
Should one do this always? I guess that in the majority of 'non-
significant
differences' the power is less than 0.80. This implies that no
conclusion
can be drawn. How often we read in the results: 'the difference
between A
and B was not significant' (and say, the mean value of A is 80 and of
B is
100), and in the conclusions and abstract (and in each citation):
'there was
no difference in effect between A and B', or even 'A and B were
similar'?
ad 3:
> 3. One should also know if a one-tailed test or a two-tailed test is
> desirable. If one knows which direction the effect should occur
then a
> one-tailed test would require fewer subjects to be statistically
> adequate both for significance and statistical power.
I am rather sceptical to one-tailed testing, unless there is really a
valid
reason to do so. E.g. assuming that the effect of a drug is in the
'therapeutic direction' is not allowed, in my opinion.
Best regards,
Hans Proost
Johannes H. Proost
Dept. of Pharmacokinetics and Drug Delivery
University Centre for Pharmacy
Antonius Deusinglaan 1
9713 AV Groningen, The Netherlands
tel. 31-50 363 3292
fax 31-50 363 3247
Email: j.h.proost.-at-.rug.nl
Back to the Top
Johannes: an example of preliminary work I have in mind could be a
preclinical study not done under GLP conditions. The purpose of such
an upfront study is exploratory in nature for a number of reasons
paricularly from the point of view of establishing study conditions
[ "The best laid schemes o' Mice an' Men, gang aft agley." ] prior
to designing a larger formal study under GLP.
Often people do such studies first to evaluate the prospect of
performing a larger study under GLP conditions. Usually such a
preliminary study (at least to the limit of my experience) is not
subject to any formal statistical criteria. As such you may not
bother performing statistics at all and sometimes I do not do this.
Usually the sample size in such a study is a priori too small to
allow the meaningful test and concusions to be made at the p value
you need. Often I confess that, despite being well aware of
limitations about making conclusions, that I have done some
statistical testing at this stage (sometimes altering the p value)
just to satisfy my curiosity to see what I get.
Best Regards,
Angus McLean Ph.D,
8125 Langport Terrace,
Suite 100,
Gaithersburg,
MD 20877
tel 301-869-1009
fax 301-869-5737
Back to the Top
Good Morning Nav,
Thank you for putting this old chestnut in perspective for me. I
particularly appreciated the reference to Gerard Dallal's URL - are
you a graduate of Tufts by any chance? It looks like this man
provides a superior education in statistical analysis.
Best Wishes,
Edmond
Back to the Top
Hi Edmond,
I am glad you found the reference useful, you might also like his
'The Little Handbook of Statistical Practice', its on the net.
Best regards,
Nav
[See http://www.tufts.edu/~gdallal/LHSP.HTM - db]
Back to the Top
The following message was posted to: PharmPK
Dear Dr. Proost,
ad 1: OK. But the value of 0.05 remains arbitrary. It could be 0.03,
0.06 or whatever. Intuitively 0.05 is a nice value, but it is
arbitrary. The
most important thing is that almost everybody uses 0.05, so the
meaning of
'significantly different' is at least at this point identical in more
than
99% of (biomedical) literature.
The choice of 0.05 is made by convention and is based on the differences
between the experimental and control condition. It should not be an
arbitrary choice. One should have knowledge of the statistical
background for applying p values to medical research and not rely on
what one regards as an arbitrary decision. While one uses
"significantly different" for p values of 0.05 and 0.01 the statistical
difference between them is a factor of 5 - thus the "meaning" is very
different. It begs the question to say that everyone does it so I can
ignore the rational basis for selection of p values a priori.
ad 2:
the difference is a true difference. Statistical power should be
at least 0.80.
This is a specific demand that is used (or published) only in special
cases. Should one do this always? I guess that in the majority of 'non-
significant differences' the power is less than 0.80. This implies
that no
conclusion can be drawn. How often we read in the results: 'the
difference
between A and B was not significant' (and say, the mean value of A is
80 and of
B is 100), and in the conclusions and abstract (and in each citation):
'there was no difference in effect between A and B', or even 'A and B
were
similar'?
Power lends insight into how much confidence one can have in the
significant outcome. Most research articles do not publish the power
but should. In fact, prior to doing a study the statistical plan should
be laid out with justification including the tests, p value, the power,
and the number of subjects one needs to do an adequate study. Couple
this with knowledge of the direction of the outcome then one can indeed
apply a one-sided test to determine if the result is beneficial assuming
that the goal is a unidirectional outcome. If it one just wants to know
if the experimental condition is different from the control then, of
course use a two-tailed test.
James D. Prah, PhD
US EPA
Human Studies Division MD (58B)
Research Triangle Park, NC, 27711
919 966 6244
919 966 6367 FAX
Back to the Top
The following message was posted to: PharmPK
Dear James,
Thank you for your reply.
Ad 1: I agree that the choice of 0.05 is made by
convention, but I don't see how this can be based on the
differences between the experimental and control
conditions. And I still see no argument how p = 0.05 can
be argued firmly, i.e. not intuitively and not by
convention ('are doctors happy because 95% of their
patient will not die due to their intervention?'). I fully
agree that one cannot ignore the rational basis for
selection of p values, but I don't see how this can be
done in a non-intuitive way. And I repeat that the
convention 'p<0.05' is not bad in this case.
ad 2: OK. But again there are some practical problems.
(a) For prior estimations of power, e.g. for sample size
determinations, one should have some reasonable estimate
of variability, which is often unknown.
(b) Determination of power requires that the true
difference between A and B that can be detected must be
defined. This difference should be based on a relevant
difference between A and B, and will be different in
different situations. A 20% is useful in many cases, but
certainy not in all cases. Of course one can calculate a
graph of power against the true difference between A and B
(operational chart), but this is impossible in practice, I
guess.
> Couple this with knowledge of the direction of the
> outcome then one can indeed apply a one-sided test
> to determine if the result is beneficial assuming
> that the goal is a unidirectional outcome.
As explained in my previous mail, one should be aware of
misleading conclusions when applying one-sided tests. I
agree that such knowledge can indeed be used in some
cases, but I do not agree with your remark as a general
statement.
Best regards,
Hans Proost
Johannes H. Proost
Dept. of Pharmacokinetics and Drug Delivery
University Centre for Pharmacy
Antonius Deusinglaan 1
9713 AV Groningen, The Netherlands
tel. 31-50 363 3292
fax 31-50 363 3247
Email: j.h.proost.-a-.rug.nl
Back to the Top
The following message was posted to: PharmPK
Dear Dr. Proost and Dr. Prah,
Very interesting conversation.
I'm not an statistical expert but, please, let me add one comment
regarding the significance level of 0.05 and expresions like
"significantly different". I think that to find "statistically
significant" differences is only a matter of the sample size that you
use regardless of the significance level considered. More difficult
from my
view is to establish "relevant" differences with the classical
statistical tools. How we can take decisions if the statistical tool
that you use
(p-value or CI) ignores the technical opinion of a particular matter
(i.e. "relevant" decrease in blood pressure, etc.) and is not able to
"learn" or accumulate experience from one experiment to another? What
do you think
about the bayesian approach and the possibility of its "extensive"
implementation?
Best Regards,
Jose-Antonio Cordero
Barcelona
SPAIN
Back to the Top
The following message was posted to: PharmPK
Dear Dr. Proost and Dr. Prah,
Very interesting conversation.
I'm not an statistical expert but, please, let me add one comment
regarding the significance level of 0.05 and expresions like
"significantly different". I think that to find "statistically
significant" differences is only a matter of the sample size that you
use regardless of the significance level considered. More difficult
from my
view is to establish "relevant" differences with the classical
statistical tools. How we can take decisions if the statistical tool
that you use
(p-value or CI) ignores the technical opinion of a particular matter
(i.e. "relevant" decrease in blood pressure, etc.) and is not able to
"learn" or accumulate experience from one experiment to another? What
do you think
about the bayesian approach and the possibility of its "extensive"
implementation?
Best Regards,
Jose-Antonio Cordero
Barcelona
SPAIN
Back to the Top
Dear Dr. Proost,
My comments are embedded below.
Ad 1: I agree that the choice of 0.05 is made by
> convention, but I don't see how this can be based on the
> differences between the experimental and control
> conditions. And I still see no argument how p = 0.05 can
> be argued firmly, i.e. not intuitively and not by
> convention ('are doctors happy because 95% of their
> patient will not die due to their intervention?'). I fully
> agree that one cannot ignore the rational basis for
> selection of p values, but I don't see how this can be
> done in a non-intuitive way. And I repeat that the
> convention 'p<0.05' is not bad in this case.
>
> Prah's REPLY: Intuition is defined as :" the immediate knowing or
learning of
> something without the conscious use of reasoning"
I would hope that the choice of a p value or other scientific decisions
are not made
> on this basis. The convention use of
> p values of 0.01 and 0.05 came about before the ready availability of
> computers. Now it is much easier to have a stat program provide and
exact p value. Depending upon one's
> willingness to take a risk of making type I or type II errors power
> calculations should also be made. If one knows a few
> the difference one concludes is clinically significant or confirms
one
> hypothesis, then one can easily determine
> the number of subjects needed for the study for a given statisical
> power. Given the importance of statistics in ultimate
> decision making in drug development, this subject should be
understood
> as thoroughly as one understands the drug's biological mechanism
> or how one's analytical instruments are calibrated.
> As an aside, a doctor might be happy if 95% of the patients don't
> die if
> 100 percent will die without treatment.
>
> ad 2: OK. But again there are some practical problems.
> (a) For prior estimations of power, e.g. for sample size
> determinations, one should have some reasonable estimate
> of variability, which is often unknown.
> (b) Determination of power requires that the true
> difference between A and B that can be detected must be
> defined. This difference should be based on a relevant
> difference between A and B, and will be different in
> different situations. A 20% is useful in many cases, but
> certainy not in all cases. Of course one can calculate a
> graph of power against the true difference between A and B
> (operational chart), but this is impossible in practice, I
> guess.
>
>> Couple this with knowledge of the direction of the
>> outcome then one can indeed apply a one-sided test
>> to determine if the result is beneficial assuming
>> that the goal is a unidirectional outcome.
>>
>
> As explained in my previous mail, one should be aware of
> misleading conclusions when applying one-sided tests. I
> agree that such knowledge can indeed be used in some
> cases, but I do not agree with your remark as a general
> statement.
>
Prah's REPLY: The statement wasn't intended to be a general statement
but one-sided tests
can be used in some circumstances which could reduce the
number of subjects tested thereby reducing risk and cost.
>James D. Prah, PhD
US EPA
Human Studies Division MD (58B)
Research Triangle Park, NC, 27711
919 966 6244
919 966 6367 FAX
PharmPK Discussion List Archive Index page
Copyright 1995-2010 David W. A. Bourne (david@boomer.org)