Statistical Methods in Psychology
Journals:
Guidelines and
Explanations
http://www.apa.org/journals/amp/amp548594.html
American
Psychologist
Selected Article
August 1999, Vol. 54, No. 8, 594–604
© 1999 by the American Psychological Association
For personal use only—not for distribution
Statistical
Methods in Psychology
Journals:
Guidelines and
Explanations
Leland Wilkinson and Task Force on Statistical
Inference
APA Board of Scientific Affairs
Correspondence concerning this article should be
addressed to Task Force on
Statistical Inference, c/o Sangeeta Panicker, , APA
Science Directorate, 750 First
Street, NE, Washington, DC 200024242. Email may be
addressed to
spanicker@apa.org
In the light of continuing debate over the applications
of significance testing in psychology journals and following
the publication of Cohen's (1994) article, the Board of
Scientific Affairs (BSA) of the American Psychological
Association (APA) convened a committee called the Task Force
on Statistical Inference (TFSI) whose charge was "to
elucidate some of the controversial issues surrounding
applications of statistics including significance testing and
its alternatives; alternative underlying models and data
transformation; and newer methods made possible by powerful
computers" (BSA, personal communication, February 28,
1996). Robert Rosenthal, Robert Abelson, and Jacob Cohen (cochairs)
met initially and agreed on the desirability of having several
types of specialists on the task force: statisticians,
teachers of statistics, journal editors, authors of statistics
books, computer experts, and wise elders. Nineindividuals were
subsequently invited to join and all agreed. These were Leona
Aiken, Mark Appelbaum, Gwyneth Boodoo, David A. Kenny, Helena
Kraemer, Donald Rubin, Bruce Thompson, Howard Wainer, and
Leland Wilkinson. In addition, Lee Cronbach, Paul Meehl,
Frederick Mosteller and John Tukey served as Senior Advisors
to the
Task Force and commented on written
materials.
The TFSI met twice in two years and corresponded
throughout that period. After the first meeting, the task
force circulated a preliminary report indicating its intention
to examine issues beyond null hypothesis significance testing.
The task force invited comments and used this feedback in the
deliberations during its second meeting.
After the second meeting, the task force recommended
several possibilities for further action, chief of which would
be to revise the statistical sections of the American
Psychological Association Publication Manual (APA, 1994).
After extensive discussion, the BSA recommended that
"before the TFSI undertook a revision of the APA
Publication Manual, it might want to consider publishing an
article in American Psychologist, as a way to initiate
discussion in the field about changes in current practices of
data analysis and reporting" (BSA, personal
communication, November 17, 1997).
This report follows that request. The sections in
italics are proposed guidelines that the TFSI recommends could
be used for revising the APA publication manual or for
developing other BSA supporting materials. Following each
guideline are comments, explanations, or elaborations
assembled by Leland Wilkinson for the task force and under its
review. This report is concerned with the use of statistical
methods only and is not meant as an assessment of research
methods in general. Psychology is a broad science. Methods
appropriate in one area may be inappropriate in another.
The title and format of this report are adapted from a
similar article by Bailar and Mosteller (1988). That article
should be consulted, because it overlaps somewhat with this
one and discusses some issues relevant to research in
psychology. Further detail can also be found in the
publications on this topic by several committee members (
Abelson, 1995, 1997; Rosenthal, 1994; Thompson, 1996; Wainer,
in press; see also articles in Harlow, Mulaik, & Steiger,
1997
Method
Design
Make
clear at the outset what type of study you are doing. Do not
cloak a study in one guise to try to give it the assumed
reputation of another. For studies that have multiple goals,
be sure to define and prioritize those goals.
There are many forms of empirical studies in
psychology, including case reports, controlled experiments,
quasiexperiments, statistical simulations, surveys,
observational studies, and studies of studies (metaanalyses).
Some are hypothesis generating: They explore data to form or
sharpen hypotheses about a population for assessing future
hypotheses. Some are hypothesis testing: They assess specific
a priori hypotheses or estimate parameters by random sampling from that population.
Some are
metaanalytic: They assess specific a priori hypotheses
or estimate
parameters (or both) by synthesizing the results of
available studies.
Some researchers have the impression or have been
taught to believe that
some of these forms yield information that is more
valuable or credible
than others (see Cronbach, 1975, for a discussion).
Occasionally
proponents of some research methods disparage others.
In fact, each form
of research has its own strengths, weaknesses, and
standards of practice.
Population
The interpretation of the results of any study depends
on the
characteristics of the population intended for
analysis. Define the
population (participants, stimuli, or studies) clearly.
If control or
comparison groups are part of the design, present how
they are
defined.
Psychology students sometimes think that a statistical
population is the
human race or, at least, college sophomores. They also
have some
difficulty distinguishing a class of objects versus a
statistical populationóthat
sometimes we make inferences about a population through
statistical
methods, and other times we make inferences about a
class through logical
or other nonstatistical methods. Populations may be
sets of potential
observations on people, adjectives, or even research
articles. How a
population is defined in an article affects almost
every conclusion in that
article.
Sample
Describe the sampling procedures and emphasize any
inclusion or
exclusion criteria. If the sample is stratified (e.g.,
by site or gender)
describe fully the method and rationale. Note the
proposed sample
size for each subgroup.
Interval estimates for clustered and stratified random
samples differ from
those for simple random samples. Statistical software
is now becoming
available for these purposes. If you are using a
convenience sample
(whose members are not selected at random), be sure to
make that
procedure clear to your readers. Using a convenience
sample does not
automatically disqualify a study from publication, but
it harms your
objectivity to try to conceal this by implying that you
used a random
sample. Sometimes the case for the representativeness
of a convenience
sample can be strengthened by explicit comparison of
sample
characteristics with those of a defined population
across a wide range of
variables.
Assignment
Random assignment.
For research involving causal
inferences, the assignment of units to levels of the
causal variable is
critical. Random assignment (not to be confused with
random
selection) allows for the strongest possible causal
inferences free of
extraneous assumptions. If random assignment is
planned, provide
enough information to show that the process for making
the actual
assignments is random.
There is a strong research tradition and many exemplars
for random
assignment in various fields of psychology. Even those
who have
elucidated quasiexperimental designs in psychological
research (e.g.,
Cook & Campbell, 1979) have repeatedly emphasized
the superiority of
random assignment as a method for controlling bias and
lurking variables.
"Random" does not mean "haphazard."
Randomization is a fragile
condition, easily corrupted deliberately, as we see
when a skilled magician
flips a fair coin repeatedly to heads, or innocently,
as we saw when the
drum was not turned sufficiently to randomize the picks
in the Vietnam
draft lottery. As psychologists, we also know that
human participants are
incapable of producing a random process (digits,
spatial arrangements,
etc.) or of recognizing one. It is best not to trust
the random behavior of a
physical device unless you are an expert in these
matters. It is safer to use
the pseudorandom sequence from a welldesigned computer
generator or
from published tables of random numbers. The added
benefit of such a
procedure is that you can supply a random number seed
or starting
number
in a table that other researchers can use to check your
methods
later.
Nonrandom assignment.
For some research questions,
random assignment is not feasible. In such cases, we
need to minimize
effects of variables that affect the observed
relationship between a
causal variable and an outcome. Such variables are
commonly called
confounds or covariates. The researcher needs to
attempt to
determine the relevant covariates, measure them
adequately, and
adjust for their effects either by design or by
analysis. If the effects of
covariates are adjusted by analysis, the strong
assumptions that are
made must be explicitly stated and, to the extent
possible, tested and
justified. Describe methods used to attenuate sources
of bias,
including plans for minimizing dropouts, noncompliance,
and missing
data.
Authors have used the term "control group" to
describe, among other
things, (a) a comparison group, (b) members of pairs
matched or blocked
on one or more nuisance variables, (c) a group not
receiving a particular
treatment, (d) a statistical sample whose values are
adjusted post hoc by
the use of one or more covariates, or (e) a group for
which the
experimenter acknowledges bias exists and perhaps hopes
that this
admission will allow the reader to make appropriate
discounts or other
mental
adjustments. None of these is an instance of a fully adequate
control group.
If we can neither implement randomization nor approach
total control of
variables that modify effects (outcomes), then we
should use the term
"control group" cautiously. In most of these
cases, it would be better to
forgo the term and use "contrast group"
instead. In any case, we should
describe exactly which confounding variables have been
explicitly
controlled and speculate about which unmeasured ones
could lead to
incorrect inferences. In the absence of randomization,
we should do our
best to investigate sensitivity to various untestable
assumptions.
Measurement
Variables. Explicitly
define the variables in the study, show how
they are related to the goals of the study, and explain
how they are
measured. The units of measurement of all variables,
causal and
outcome, should fit the language you use in the
introduction and
discussion sections of your report.
A variable is a method for assigning to a set of
observations a value from a
set of possible outcomes. For example, a variable
called "gender" might
assign each of 50 observations to one of the values
male or female. When
we define a variable, we are declaring what we are
prepared to represent
as a valid observation and what we must consider as
invalid. If we define
the range of a particular variable (the set of possible
outcomes) to be from
1 to 7 on a Likert scale, for example, then a value of
9 is not an outlier (an
unusually extreme value). It is an illegal value. If we
declare the range of a
variable to be positive real numbers and the domain to
be observations of
reaction time (in milliseconds) to an administration of
electric shock, then a
value of 3,000 is not illegal; it is an outlier.
Naming a variable is almost as important as measuring
it. We do well to
select a name that reflects how a variable is measured.
On this basis, the
name "IQ test score" is preferable to
"intelligence" and "retrospective
selfreport of childhood sexual abuse" is
preferable to "childhood sexual
abuse." Without such precision, ambiguity in
defining variables can give a
theory an unfortunate resistance to empirical
falsification. Being precise
does not make us operationalists. It simply means that
we try to avoid
excessive generalization.
Editors and reviewers should be suspicious when they
notice authors
changing definitions or names of variables, failing to
make clear what
would be contrary evidence, or using measures with no
history and thus no
known properties. Researchers should be suspicious when
code books
and scoring systems are inscrutable or more voluminous
than the research
articles on which they are based. Everyone should worry
when a system
offers to code a specific observation in two or more
ways for the same
variable.
Instruments. If
a questionnaire is used to collect data,
summarize the psychometric properties of its scores
with specific
regard to the way the instrument is used in a
population.
Psychometric properties include measures of validity,
reliability, and
any
other qualities affecting conclusions. If a physical apparatus
is
used, provide enough information (brand, model, design
specifications) to allow another experimenter to
replicate your
measurement process.
There
are many methods for constructing instruments and
psychometrically
validating scores from such measures. Traditional
truescore theory and
item–response test theory provide appropriate
frameworks for assessing
reliability and internal validity. Signal
detection theory and various
coefficients of association can be used to assess
external validity. Messick
(1989) provides a comprehensive guide to validity.
It is important to remember that a test is not reliable
or unreliable.
Reliability is a property of the scores on a test for a
particular population
of examinees ( Feldt & Brennan, 1989). Thus,
authors should provide
reliability coefficients of the scores for the data
being analyzed even when
the focus of their research is not psychometric.
Interpreting the size of
observed effects requires an assessment of the
reliability of the scores.
Besides showing that an instrument is reliable, we need
to show that it
does not correlate strongly with other key constructs.
It is just as important
to establish that a measure does not measure what it
should not measure
as it is to show that it does measure what it should.
Researchers occasionally encounter a measurement
problem that has no
obvious solution. This happens when they decide to
explore a new and
rapidly growing research area that is based on a
previous researcher's
welldefined construct implemented with a poorly
developed psychometric
instrument. Innovators, in the excitement of their
discovery, sometimes give
insufficient attention to the quality of their
instruments. Once a defective
measure enters the literature, subsequent researchers
are reluctant to
change it. In these cases, editors and reviewers should
pay special
attention to the psychometric properties of the
instruments used, and they
might want to encourage revisions (even if not by the
scale's author) to
prevent the accumulation of results based on relatively
invalid or unreliable
measures.
Procedure. Describe
any anticipated sources of attrition due to
noncompliance, dropout, death, or other factors.
Indicate how such
attrition may affect the generalizability of the
results. Clearly
describe the conditions under which measurements are
taken (e.g.,
format, time, place, personnel who collected data).
Describe the
specific methods used to deal with experimenter bias,
especially if you
collected the data yourself.
Despite the longestablished findings of the effects of
experimenter bias (
Rosenthal, 1966), many published studies appear to
ignore or discount
these problems. For example, some authors or their
assistants with
knowledge of hypotheses or study goals screen
participants (through
personal interviews or telephone conversations) for
inclusion in their
studies. Some authors administer questionnaires. Some
authors give
instructions to participants. Some authors perform
experimental
manipulations. Some
tally or code responses. Some rate videotapes.
An author's selfawareness, experience, or resolve does
not eliminate
experimenter bias. In short, there are no valid
excuses, financial or
otherwise, for avoiding an opportunity to doubleblind.
Researchers
looking for guidance on this matter should consult the
classic book of
Webb, Campbell, Schwartz, and Sechrest (1966) and an
exemplary
dissertation (performed on a modest budget) by Baker
(1969).
Power and sample size.
Provide information on sample size
and the process that led to sample size decisions.
Document the effect
sizes, sampling and measurement assumptions, as well as
analytic
procedures used in power calculations. Because power
computations
are most meaningful when done before data are collected
and
examined, it is important to show how effectsize
estimates have been
derived from previous research and theory in order to
dispel
suspicions that they might have been taken from data
used in the
study or, even worse, constructed to justify a
particular sample size.
Once the study is analyzed, confidence intervals
replace calculated
power in describing results.
Largely because of the work of Cohen (1969, 1988),
psychologists have
become aware of the need to consider power in the
design of their studies,
before they collect data. The intellectual exercise
required to do this
stimulates authors to take seriously prior research and
theory in their field,
and it gives an opportunity, with incumbent risk, for a
few to offer the
challenge that there is no applicable research behind a
given study. If
exploration were not disguised in hypotheticodeductive
language, then it
might have the opportunity to influence subsequent
research constructively.
Computer programs that calculate power for various
designs and
distributions are now available. One can use them to
conduct power
analyses for a range of reasonable alpha values and
effect sizes. Doing so
reveals how power changes across this range and
overcomes a tendency
to regard a single power estimate as being absolutely
definitive.
Many of us encounter power issues when applying for
grants. Even when
not asking for money, think about power. Statistical
power does not
corrupt.
Results
Complications
Before presenting results, report complications,
protocol violations,
and other unanticipated events in data collection.
These include
missing data,
attrition, and nonresponse. Discuss analytic techniques
devised to ameliorate these problems. Describe
nonrepresentativeness statistically by reporting
patterns and
distributions of missing data and contaminations.
Document how the
actual analysis differs from the analysis planned
before complications
arose. The use of techniques to ensure that the
reported results are
not produced by anomalies in the data (e.g., outliers,
points of high
influence, nonrandom missing data, selection bias,
attrition problems)
should be a standard component of all analyses.
As soon as you have collected your data, before you
compute any
statistics, look at your data. Data screening is not
data snooping. It is not
an opportunity to discard data or change values to
favor your hypotheses.
However, if you assess hypotheses without examining
your data, you risk
publishing nonsense.
Computer malfunctions tend to be catastrophic: A system
crashes; a file
fails to import; data are lost. Less wellknown are
more subtle bugs that
can be more catastrophic in the long run. For example,
a single value in a
file may be corrupted in reading or writing (often in
the first or last record).
This circumstance usually produces a major value error,
the kind of
singleton that can make large correlations change sign
and small
correlations become large.
Graphical inspection of data offers an excellent
possibility for detecting
serious compromises to data integrity. The reason is
simple: Graphics
broadcast; statistics narrowcast. Indeed, some
international corporations
that must defend themselves against rapidly evolving
fraudulent schemes
use realtime graphic displays as their first line of
defense and statistical
analyses as a distant second. The following example
shows why.
Figure 1 shows a scatterplot matrix (SPLOM) of three
variables from a
national survey of approximately 3,000 counseling
clients ( Chartrand,
1997). This display, consisting of pairwise scatter
plots arranged in a
matrix, is found in most modern statistical packages.
The diagonal cells
contain dot plots of each variable (with the dots
stacked like a histogram)
and scales used for each variable. The three variables
shown are
questionnaire measures of respondent's age (AGE),
gender (SEX), and
number of years together in current relationship
(TOGETHER). The
graphic in Figure 1 is not intended for final
presentation of results; we use it
instead to locate coding errors and other anomalies
before we analyze our
data. Figure 1 is a selected portion of a computer
screen display that
offers tools for zooming in and out, examining points,
and linking to
information in other graphical
displays and data editors. SPLOM displays
can be used to recognize unusual patterns in 20 or more
variables
simultaneously. We focus on these three only.
There are several anomalies in this graphic. The AGE
histogram shows a
spike at the right end, which corresponds to the value
99 in the data. This
coded value most likely signifies a missing value,
because it is unlikely that
this many people in a sample of 3,000 would have an age
of 99 or greater.
Using numerical values for missing value codes is a
risky practice ( Kahn
& Udry, 1986).
The histogram for SEX shows an unremarkable division
into two values.
The histogram for TOGETHER is highly skewed, with a
spike at the lower
end presumably signifying no relationship. The most
remarkable pattern is
the triangular joint distribution of TOGETHER and AGE.
Triangular joint
distributions often (but not necessarily) signal an
implication or a relation
rather than a linear function with error. In this case,
it makes sense that the
span of a relationship should not exceed a person's
age. Closer
examination shows that something is wrong here,
however. We find some
respondents (in the upper left triangular area of the
TOGETHER–AGE
panel) claiming that they have been in a significant
relationship longer than
they have been alive! Had we computed
statistics or fit models
before examining the raw data, we would likely have
missed these
reporting errors. There is little reason to expect that
TOGETHER would
show any anomalous behavior with other variables, and
even if AGE and
TOGETHER appeared jointly in certain models, we may not
have known
anything was amiss, regardless of our care in examining
residual or other
diagnostic plots.
The main point of this example is that the type of
"atheoretical" search for
patterns that we are sometimes warned against in
graduate school can
save us from the humiliation of having to retract
conclusions we might
ultimately make on the basis of contaminated data. We
are warned against
fishing expeditions for understandable reasons, but
blind application of
models without screening our data is a far graver
error.
Graphics cannot solve all our problems. Special issues
arise in modeling
when we have missing data. The two popular methods for
dealing with
missing data that are found in basic statistics
packagesólistwise and
pairwise deletion of missing valuesóare among the
worst methods available
for practical applications. Little and Rubin (1987)
have discussed these
issues in more detail and offer alternative approaches.
Analysis
Choosing a minimally sufficient analysis.
The enormous
variety
of modern quantitative methods leaves researchers with the
nontrivial task of matching analysis and design to the
research
question. Although complex designs and stateoftheart
methods are
sometimes necessary to address research questions
effectively,
simpler classical approaches often can provide elegant
and sufficient
answers to important questions. Do not choose an
analytic method to
impress your readers or to deflect criticism. If the
assumptions and
strength of a simpler method are reasonable for your
data and
research problem, use it. Occam's razor applies to
methods as well as
to theories.
We should follow the advice of Fisher (1935):
Experimenters should remember that they and their
colleagues usually
know more about the kind of material they are dealing
with than do the
authors of textbooks written without such personal
experience, and that a
more complex, or less intelligible, test is not likely
to serve their purpose
better, in any sense, than those of proved value in
their own subject. (p.
49)
There is nothing wrong with using stateoftheart
methods, as long as you
and your readers understand how they work and what they
are doing. On
the other hand, don't cling to obsolete methods (e.g.,
Newman–Keuls or
Duncan post hoc tests) out of fear of learning the new.
In any case, listen
to Fisher. Begin with an idea. Then pick a method.
Computer programs.
There are many good computer programs
for analyzing data. More important than choosing a
specific
statistical package is verifying your results,
understanding what they
mean, and knowing how they are computed. If you cannot
verify your
results by intelligent "guesstimates," you
should check them against
the output of another program. You will not be happy if
a vendor
reports a bug after your data are in print (not an
infrequent event).
Do not report statistics found on a printout without
understanding
how they are computed or what they mean. Do not report
statistics to
a greater precision than is supported by your data
simply because
they are printed that way by the program. Using the
computer is an
opportunity for you to control your analysis and
design. If a computer
program does not provide the analysis
you need, use another
program rather than let the computer shape your
thinking.
There is no substitute for common sense. If you cannot
use rules of thumb
to detect whether the result of a computation makes
sense to you, then
you should ask yourself whether the procedure you are
using is
appropriate for your research. Graphics can help you to
make some of
these determinations; theory can help in other cases.
But never assume that
using a highly regarded program absolves you of the
responsibility for
judging whether your results are plausible. Finally,
when documenting the
use of a statistical procedure, refer to the
statistical literature rather than a
computer manual; when documenting the use of a program,
refer to the
computer manual rather than the statistical literature.
Assumptions. You
should take efforts to assure that the
underlying assumptions
required for the analysis are reasonable
given the data. Examine residuals carefully. Do not use
distributional
tests and statistical indexes of shape (e.g., skewness,
kurtosis) as a
substitute for examining your residuals graphically.
Using a statistical test to diagnose problems in model
fitting has several
shortcomings. First, diagnostic significance tests
based on summary
statistics (such as tests for homogeneity of variance)
are often impractically
sensitive; our statistical tests of models are often
more robust than our
statistical tests of assumptions. Second, statistics
such as skewness and
kurtosis often fail to detect distributional
irregularities in the residuals.
Third, statistical tests depend on sample size, and as
sample size increases,
the tests often will reject innocuous assumptions. In
general, there is no
substitute for graphical analysis of assumptions.
Modern statistical packages offer graphical diagnostics
for helping to
determine whether a model appears to fit data
appropriately. Most users
are familiar with residual plots for linear regression
modeling. Fewer are
aware that John Tukey's paradigmatic equation, data =
fit + residual,
applies to a more general class of models and has broad
implications for
graphical analysis of assumptions. Stemandleaf plots,
box plots,
histograms, dot plots, spread/level plots, probability
plots, spectral plots,
autocorrelation and crosscorrelation plots, coplots,
and trellises (
Chambers, Cleveland, Kleiner, & Tukey, 1983;
Cleveland, 1995; Tukey,
1977) all serve at various times for displaying
residuals, whether they arise
from analysis of variance (ANOVA), nonlinear modeling,
factor analysis,
latent variable modeling, multidimensional scaling,
hierarchical linear
modeling, or other procedures.
Hypothesis tests.
It is hard to imagine a situation in which a
dichotomous accept–reject decision is better than
reporting an actual
p value or, better still, a confidence interval. Never
use the
unfortunate expression "accept the null
hypothesis." Always provide
some effectsize estimate when reporting a p value.
Cohen (1994) has
written on this subject in this journal. All
psychologists would benefit from
reading his insightful article.
Effect sizes. Always
present effect sizes for primary outcomes. If
the units of measurement are meaningful on a practical
level (e.g.,
number of cigarettes smoked per day), then we usually
prefer an
unstandardized measure (regression coefficient or mean
difference)
to a standardized measure (r or d). It helps to add
brief comments
that place these effect sizes in a practical and
theoretical context.
APA's (1994) publication manual included an important
new
"encouragement" (p. 18) to report effect
sizes. Unfortunately, empirical
studies of various journals indicate that the effect
size of this
encouragement has been negligible ( Keselman et al.,
1998; Kirk, 1996;
Thompson & Snyder, 1998). We must stress again that
reporting and
interpreting effect sizes in the context of previously
reported effects is
essential to good research. It enables readers to
evaluate the stability of
results across samples, designs, and analyses.
Reporting effect sizes also
informs power analyses and metaanalyses needed in
future research.
Fleiss (1994), Kirk (1996), Rosenthal (1994), and
Snyder and Lawson
(1993) have summarized various measures of effect sizes
used in
psychological research. Consult these articles for
information on computing
them. For a simple, general purpose display of the
practical meaning of an
effect size, see Rosenthal and Rubin (1982). Consult
Rosenthal and Rubin
(1994) for information on the use of "counternull
intervals" for effect sizes,
as alternatives to confidence intervals.
Interval estimates.
Interval estimates should be given for any
effect sizes involving principal outcomes. Provide
intervals for
correlations and other coefficients of association or
variation
whenever possible.
Confidence intervals are usually available in
statistical software; otherwise,
confidence intervals for basic statistics can be
computed from typical
output. Comparing confidence intervals from a current
study to intervals
from previous, related studies helps focus attention on
stability across
studies ( Schmidt, 1996). Collecting intervals across
studies also helps in
constructing plausible regions for population
parameters. This practice
should help prevent the common mistake of assuming a
parameter is
contained in a confidence interval.
Multiplicities. Multiple
outcomes require special handling. There
are many ways to conduct reasonable inference when
faced with
multiplicity (e.g., Bonferroni correction of p values,
multivariate test
statistics, empirical Bayes methods). It is your
responsibility to define
and justify the methods used.
Statisticians speak of the
curse of dimensionality. To paraphrase,
multiplicities are the curse of the social sciences. In
many areas of
psychology, we cannot do research on important problems
without
encountering multiplicity. We often encounter many
variables and many
relationships.
One of the most prevalent strategies psychologists use
to handle
multiplicity is to follow an ANOVA with pairwise
multiplecomparison
tests. This approach is usually wrong for several
reasons. First, pairwise
