Click a topic below for an index of articles:

 

New-Material

Home

Alternative-Treatments

Financial or Socio-Economic Issues

Forum

Health Insurance

Hepatitis

HIV/AIDS

Institutional Issues

International Reports

Legal Concerns

Math Models or Methods to Predict Trends

Medical Issues

Our Sponsors

Occupational Concerns

Our Board

Religion and infectious diseases

State Governments

Stigma or Discrimination Issues

If you would like to submit an article to this website, email us at info@heart-intl.net for a review of this paper
info@heart-intl.net

 

any words all words
Results per page:

“The only thing necessary for these diseases to the triumph is for good people and governments to do nothing.”

    

Statistical Methods in Psychology

Journals: Guidelines and

Explanations

http://www.apa.org/journals/amp/amp548594.html                                                        

                     American

                     Psychologist

           Selected Article

           August 1999, Vol. 54, No. 8, 594–604

           © 1999 by the American Psychological Association

           For personal use only—not for distribution

Statistical Methods in Psychology

Journals: Guidelines and

Explanations

 

           Leland Wilkinson and Task Force on Statistical Inference

           APA Board of Scientific Affairs

           Correspondence concerning this article should be addressed to Task Force on

           Statistical Inference, c/o Sangeeta Panicker, , APA Science Directorate, 750 First

           Street, NE, Washington, DC 20002-4242. Email may be addressed to

           spanicker@apa.org

           In the light of continuing debate over the applications of significance testing in psychology journals and following the publication of Cohen's (1994) article, the Board of Scientific Affairs (BSA) of the American Psychological Association (APA) convened a committee called the Task Force on Statistical Inference (TFSI) whose charge was "to elucidate some of the controversial issues surrounding applications of statistics including significance testing and its alternatives; alternative underlying models and data transformation; and newer methods made possible by powerful computers" (BSA, personal communication, February 28, 1996). Robert Rosenthal, Robert Abelson, and Jacob Cohen (cochairs) met initially and agreed on the desirability of having several types of specialists on the task force: statisticians, teachers of statistics, journal editors, authors of statistics books, computer experts, and wise elders. Nineindividuals were subsequently invited to join and all agreed. These were Leona Aiken, Mark Appelbaum, Gwyneth Boodoo, David A. Kenny, Helena Kraemer, Donald Rubin, Bruce Thompson, Howard Wainer, and Leland Wilkinson. In addition, Lee Cronbach, Paul Meehl, Frederick Mosteller and John Tukey served as Senior Advisors to the

Task Force and commented on written materials.

           The TFSI met twice in two years and corresponded throughout that period. After the first meeting, the task force circulated a preliminary report indicating its intention to examine issues beyond null hypothesis significance testing. The task force invited comments and used this feedback in the deliberations during its second meeting.

           After the second meeting, the task force recommended several possibilities for further action, chief of which would be to revise the statistical sections of the American Psychological Association Publication Manual (APA, 1994). After extensive discussion, the BSA recommended that "before the TFSI undertook a revision of the APA Publication Manual, it might want to consider publishing an article in American Psychologist, as a way to initiate discussion in the field about changes in current practices of data analysis and reporting" (BSA, personal communication, November 17, 1997).

           This report follows that request. The sections in italics are proposed guidelines that the TFSI recommends could be used for revising the APA publication manual or for developing other BSA supporting materials. Following each guideline are comments, explanations, or elaborations assembled by Leland Wilkinson for the task force and under its review. This report is concerned with the use of statistical methods only and is not meant as an assessment of research methods in general. Psychology is a broad science. Methods appropriate in one area may be inappropriate in another.

           The title and format of this report are adapted from a similar article by Bailar and Mosteller (1988). That article should be consulted, because it overlaps somewhat with this one and discusses some issues relevant to research in psychology. Further detail can also be found in the publications on this topic by several committee members ( Abelson, 1995, 1997; Rosenthal, 1994; Thompson, 1996; Wainer, in press; see also articles in Harlow, Mulaik, & Steiger, 1997    

Method

Design

                      Make clear at the outset what type of study you are doing. Do not cloak a study in one guise to try to give it the assumed reputation of another. For studies that have multiple goals, be sure to define and prioritize those goals.

                      There are many forms of empirical studies in psychology, including case reports, controlled experiments, quasi-experiments, statistical simulations, surveys, observational studies, and studies of studies (meta-analyses). Some are hypothesis generating: They explore data to form or sharpen hypotheses about a population for assessing future hypotheses. Some are hypothesis testing: They assess specific a priori hypotheses or estimate parameters by random sampling from that population. Some are

        

          

           meta-analytic: They assess specific a priori hypotheses or estimate

           parameters (or both) by synthesizing the results of available studies.

 

           Some researchers have the impression or have been taught to believe that

           some of these forms yield information that is more valuable or credible

           than others (see Cronbach, 1975, for a discussion). Occasionally

           proponents of some research methods disparage others. In fact, each form

           of research has its own strengths, weaknesses, and standards of practice.

 

           Population

 

           The interpretation of the results of any study depends on the

           characteristics of the population intended for analysis. Define the

           population (participants, stimuli, or studies) clearly. If control or

           comparison groups are part of the design, present how they are

           defined.

 

           Psychology students sometimes think that a statistical population is the

           human race or, at least, college sophomores. They also have some

           difficulty distinguishing a class of objects versus a statistical populationóthat

           sometimes we make inferences about a population through statistical

           methods, and other times we make inferences about a class through logical

           or other nonstatistical methods. Populations may be sets of potential

           observations on people, adjectives, or even research articles. How a

           population is defined in an article affects almost every conclusion in that

           article.

 

           Sample

 

           Describe the sampling procedures and emphasize any inclusion or

           exclusion criteria. If the sample is stratified (e.g., by site or gender)

           describe fully the method and rationale. Note the proposed sample

           size for each subgroup.

 

           Interval estimates for clustered and stratified random samples differ from

           those for simple random samples. Statistical software is now becoming

           available for these purposes. If you are using a convenience sample

           (whose members are not selected at random), be sure to make that

           procedure clear to your readers. Using a convenience sample does not

           automatically disqualify a study from publication, but it harms your

           objectivity to try to conceal this by implying that you used a random

           sample. Sometimes the case for the representativeness of a convenience

           sample can be strengthened by explicit comparison of sample

           characteristics with those of a defined population across a wide range of

           variables.

 

           Assignment

 

           Random assignment.  For research involving causal

           inferences, the assignment of units to levels of the causal variable is

           critical. Random assignment (not to be confused with random

           selection) allows for the strongest possible causal inferences free of

           extraneous assumptions. If random assignment is planned, provide

           enough information to show that the process for making the actual

           assignments is random.

 

           There is a strong research tradition and many exemplars for random

           assignment in various fields of psychology. Even those who have

           elucidated quasi-experimental designs in psychological research (e.g.,

           Cook & Campbell, 1979) have repeatedly emphasized the superiority of

           random assignment as a method for controlling bias and lurking variables.

           "Random" does not mean "haphazard." Randomization is a fragile

           condition, easily corrupted deliberately, as we see when a skilled magician

           flips a fair coin repeatedly to heads, or innocently, as we saw when the

           drum was not turned sufficiently to randomize the picks in the Vietnam

           draft lottery. As psychologists, we also know that human participants are

           incapable of producing a random process (digits, spatial arrangements,

           etc.) or of recognizing one. It is best not to trust the random behavior of a

           physical device unless you are an expert in these matters. It is safer to use

           the pseudorandom sequence from a well-designed computer generator or

           from published tables of random numbers. The added benefit of such a

           procedure is that you can supply a random number seed or starting

           number in a table that other researchers can use to check your methods

           later.

 

           Nonrandom assignment.  For some research questions,

           random assignment is not feasible. In such cases, we need to minimize

           effects of variables that affect the observed relationship between a

           causal variable and an outcome. Such variables are commonly called

           confounds or covariates. The researcher needs to attempt to

           determine the relevant covariates, measure them adequately, and

           adjust for their effects either by design or by analysis. If the effects of

           covariates are adjusted by analysis, the strong assumptions that are

           made must be explicitly stated and, to the extent possible, tested and

           justified. Describe methods used to attenuate sources of bias,

           including plans for minimizing dropouts, noncompliance, and missing

           data.

 

           Authors have used the term "control group" to describe, among other

           things, (a) a comparison group, (b) members of pairs matched or blocked

           on one or more nuisance variables, (c) a group not receiving a particular

           treatment, (d) a statistical sample whose values are adjusted post hoc by

           the use of one or more covariates, or (e) a group for which the

           experimenter acknowledges bias exists and perhaps hopes that this

           admission will allow the reader to make appropriate discounts or other

           mental adjustments. None of these is an instance of a fully adequate

           control group.

 

           If we can neither implement randomization nor approach total control of

           variables that modify effects (outcomes), then we should use the term

           "control group" cautiously. In most of these cases, it would be better to

           forgo the term and use "contrast group" instead. In any case, we should

           describe exactly which confounding variables have been explicitly

           controlled and speculate about which unmeasured ones could lead to

           incorrect inferences. In the absence of randomization, we should do our

           best to investigate sensitivity to various untestable assumptions.

 

           Measurement

 

           Variables.  Explicitly define the variables in the study, show how

           they are related to the goals of the study, and explain how they are

           measured. The units of measurement of all variables, causal and

           outcome, should fit the language you use in the introduction and

           discussion sections of your report.

 

           A variable is a method for assigning to a set of observations a value from a

           set of possible outcomes. For example, a variable called "gender" might

           assign each of 50 observations to one of the values male or female. When

           we define a variable, we are declaring what we are prepared to represent

           as a valid observation and what we must consider as invalid. If we define

           the range of a particular variable (the set of possible outcomes) to be from

           1 to 7 on a Likert scale, for example, then a value of 9 is not an outlier (an

           unusually extreme value). It is an illegal value. If we declare the range of a

           variable to be positive real numbers and the domain to be observations of

           reaction time (in milliseconds) to an administration of electric shock, then a

           value of 3,000 is not illegal; it is an outlier.

 

           Naming a variable is almost as important as measuring it. We do well to

           select a name that reflects how a variable is measured. On this basis, the

           name "IQ test score" is preferable to "intelligence" and "retrospective

           self-report of childhood sexual abuse" is preferable to "childhood sexual

           abuse." Without such precision, ambiguity in defining variables can give a

           theory an unfortunate resistance to empirical falsification. Being precise

           does not make us operationalists. It simply means that we try to avoid

           excessive generalization.

 

           Editors and reviewers should be suspicious when they notice authors

           changing definitions or names of variables, failing to make clear what

           would be contrary evidence, or using measures with no history and thus no

           known properties. Researchers should be suspicious when code books

           and scoring systems are inscrutable or more voluminous than the research

           articles on which they are based. Everyone should worry when a system

           offers to code a specific observation in two or more ways for the same

           variable.

 

           Instruments.  If a questionnaire is used to collect data,

           summarize the psychometric properties of its scores with specific

           regard to the way the instrument is used in a population.

           Psychometric properties include measures of validity, reliability, and

           any other qualities affecting conclusions. If a physical apparatus is

           used, provide enough information (brand, model, design

           specifications) to allow another experimenter to replicate your

           measurement process.

 

           There are many methods for constructing instruments and psychometrically

           validating scores from such measures. Traditional true-score theory and

           item–response test theory provide appropriate frameworks for assessing

           reliability and internal validity. Signal detection theory and various

           coefficients of association can be used to assess external validity. Messick

           (1989) provides a comprehensive guide to validity.

 

           It is important to remember that a test is not reliable or unreliable.

           Reliability is a property of the scores on a test for a particular population

           of examinees ( Feldt & Brennan, 1989). Thus, authors should provide

           reliability coefficients of the scores for the data being analyzed even when

           the focus of their research is not psychometric. Interpreting the size of

           observed effects requires an assessment of the reliability of the scores.

 

           Besides showing that an instrument is reliable, we need to show that it

           does not correlate strongly with other key constructs. It is just as important

           to establish that a measure does not measure what it should not measure

           as it is to show that it does measure what it should.

 

           Researchers occasionally encounter a measurement problem that has no

           obvious solution. This happens when they decide to explore a new and

           rapidly growing research area that is based on a previous researcher's

           well-defined construct implemented with a poorly developed psychometric

           instrument. Innovators, in the excitement of their discovery, sometimes give

           insufficient attention to the quality of their instruments. Once a defective

           measure enters the literature, subsequent researchers are reluctant to

           change it. In these cases, editors and reviewers should pay special

           attention to the psychometric properties of the instruments used, and they

           might want to encourage revisions (even if not by the scale's author) to

           prevent the accumulation of results based on relatively invalid or unreliable

           measures.

 

           Procedure.  Describe any anticipated sources of attrition due to

           noncompliance, dropout, death, or other factors. Indicate how such

           attrition may affect the generalizability of the results. Clearly

           describe the conditions under which measurements are taken (e.g.,

           format, time, place, personnel who collected data). Describe the

           specific methods used to deal with experimenter bias, especially if you

           collected the data yourself.

 

           Despite the long-established findings of the effects of experimenter bias (

           Rosenthal, 1966), many published studies appear to ignore or discount

           these problems. For example, some authors or their assistants with

           knowledge of hypotheses or study goals screen participants (through

           personal interviews or telephone conversations) for inclusion in their

           studies. Some authors administer questionnaires. Some authors give

           instructions to participants. Some authors perform experimental

           manipulations. Some tally or code responses. Some rate videotapes.

 

           An author's self-awareness, experience, or resolve does not eliminate

           experimenter bias. In short, there are no valid excuses, financial or

           otherwise, for avoiding an opportunity to double-blind. Researchers

           looking for guidance on this matter should consult the classic book of

           Webb, Campbell, Schwartz, and Sechrest (1966) and an exemplary

           dissertation (performed on a modest budget) by Baker (1969).

 

           Power and sample size.  Provide information on sample size

           and the process that led to sample size decisions. Document the effect

           sizes, sampling and measurement assumptions, as well as analytic

           procedures used in power calculations. Because power computations

           are most meaningful when done before data are collected and

           examined, it is important to show how effect-size estimates have been

           derived from previous research and theory in order to dispel

           suspicions that they might have been taken from data used in the

           study or, even worse, constructed to justify a particular sample size.

           Once the study is analyzed, confidence intervals replace calculated

           power in describing results.

 

           Largely because of the work of Cohen (1969, 1988), psychologists have

           become aware of the need to consider power in the design of their studies,

           before they collect data. The intellectual exercise required to do this

           stimulates authors to take seriously prior research and theory in their field,

           and it gives an opportunity, with incumbent risk, for a few to offer the

           challenge that there is no applicable research behind a given study. If

           exploration were not disguised in hypothetico-deductive language, then it

           might have the opportunity to influence subsequent research constructively.

 

           Computer programs that calculate power for various designs and

           distributions are now available. One can use them to conduct power

           analyses for a range of reasonable alpha values and effect sizes. Doing so

           reveals how power changes across this range and overcomes a tendency

           to regard a single power estimate as being absolutely definitive.

 

           Many of us encounter power issues when applying for grants. Even when

           not asking for money, think about power. Statistical power does not

           corrupt.

 

           Results

 

           Complications

 

           Before presenting results, report complications, protocol violations,

           and other unanticipated events in data collection. These include

           missing data, attrition, and nonresponse. Discuss analytic techniques

           devised to ameliorate these problems. Describe

           nonrepresentativeness statistically by reporting patterns and

           distributions of missing data and contaminations. Document how the

           actual analysis differs from the analysis planned before complications

           arose. The use of techniques to ensure that the reported results are

           not produced by anomalies in the data (e.g., outliers, points of high

           influence, nonrandom missing data, selection bias, attrition problems)

           should be a standard component of all analyses.

 

           As soon as you have collected your data, before you compute any

           statistics, look at your data. Data screening is not data snooping. It is not

           an opportunity to discard data or change values to favor your hypotheses.

           However, if you assess hypotheses without examining your data, you risk

           publishing nonsense.

 

    

           Computer malfunctions tend to be catastrophic: A system crashes; a file

           fails to import; data are lost. Less well-known are more subtle bugs that

           can be more catastrophic in the long run. For example, a single value in a

           file may be corrupted in reading or writing (often in the first or last record).

           This circumstance usually produces a major value error, the kind of

           singleton that can make large correlations change sign and small

           correlations become large.

 

           Graphical inspection of data offers an excellent possibility for detecting

           serious compromises to data integrity. The reason is simple: Graphics

           broadcast; statistics narrowcast. Indeed, some international corporations

           that must defend themselves against rapidly evolving fraudulent schemes

           use real-time graphic displays as their first line of defense and statistical

           analyses as a distant second. The following example shows why.

 

           Figure 1 shows a scatter-plot matrix (SPLOM) of three variables from a

           national survey of approximately 3,000 counseling clients ( Chartrand,

           1997). This display, consisting of pairwise scatter plots arranged in a

           matrix, is found in most modern statistical packages. The diagonal cells

           contain dot plots of each variable (with the dots stacked like a histogram)

           and scales used for each variable. The three variables shown are

           questionnaire measures of respondent's age (AGE), gender (SEX), and

           number of years together in current relationship (TOGETHER). The

           graphic in Figure 1 is not intended for final presentation of results; we use it

           instead to locate coding errors and other anomalies before we analyze our

           data. Figure 1 is a selected portion of a computer screen display that

           offers tools for zooming in and out, examining points, and linking to

           information in other graphical displays and data editors. SPLOM displays

           can be used to recognize unusual patterns in 20 or more variables

           simultaneously. We focus on these three only.

 

           There are several anomalies in this graphic. The AGE histogram shows a

           spike at the right end, which corresponds to the value 99 in the data. This

           coded value most likely signifies a missing value, because it is unlikely that

           this many people in a sample of 3,000 would have an age of 99 or greater.

           Using numerical values for missing value codes is a risky practice ( Kahn

           & Udry, 1986).

 

           The histogram for SEX shows an unremarkable division into two values.

           The histogram for TOGETHER is highly skewed, with a spike at the lower

           end presumably signifying no relationship. The most remarkable pattern is

           the triangular joint distribution of TOGETHER and AGE. Triangular joint

           distributions often (but not necessarily) signal an implication or a relation

           rather than a linear function with error. In this case, it makes sense that the

           span of a relationship should not exceed a person's age. Closer

           examination shows that something is wrong here, however. We find some

           respondents (in the upper left triangular area of the TOGETHER–AGE

           panel) claiming that they have been in a significant relationship longer than

           they have been alive! Had we computed statistics or fit models

           before examining the raw data, we would likely have missed these

           reporting errors. There is little reason to expect that TOGETHER would

           show any anomalous behavior with other variables, and even if AGE and

           TOGETHER appeared jointly in certain models, we may not have known

           anything was amiss, regardless of our care in examining residual or other

           diagnostic plots.

 

           The main point of this example is that the type of "atheoretical" search for

           patterns that we are sometimes warned against in graduate school can

           save us from the humiliation of having to retract conclusions we might

           ultimately make on the basis of contaminated data. We are warned against

           fishing expeditions for understandable reasons, but blind application of

           models without screening our data is a far graver error.

 

           Graphics cannot solve all our problems. Special issues arise in modeling

           when we have missing data. The two popular methods for dealing with

           missing data that are found in basic statistics packagesólistwise and

           pairwise deletion of missing valuesóare among the worst methods available

           for practical applications. Little and Rubin (1987) have discussed these

           issues in more detail and offer alternative approaches.

 

           Analysis

 

           Choosing a minimally sufficient analysis.  The enormous

           variety of modern quantitative methods leaves researchers with the

           nontrivial task of matching analysis and design to the research

           question. Although complex designs and state-of-the-art methods are

           sometimes necessary to address research questions effectively,

           simpler classical approaches often can provide elegant and sufficient

           answers to important questions. Do not choose an analytic method to

           impress your readers or to deflect criticism. If the assumptions and

           strength of a simpler method are reasonable for your data and

           research problem, use it. Occam's razor applies to methods as well as

           to theories.

 

           We should follow the advice of Fisher (1935):

 

           Experimenters should remember that they and their colleagues usually

           know more about the kind of material they are dealing with than do the

           authors of text-books written without such personal experience, and that a

           more complex, or less intelligible, test is not likely to serve their purpose

           better, in any sense, than those of proved value in their own subject. (p.

           49)

 

           There is nothing wrong with using state-of-the-art methods, as long as you

           and your readers understand how they work and what they are doing. On

           the other hand, don't cling to obsolete methods (e.g., Newman–Keuls or

           Duncan post hoc tests) out of fear of learning the new. In any case, listen

           to Fisher. Begin with an idea. Then pick a method.

 

           Computer programs.  There are many good computer programs

           for analyzing data. More important than choosing a specific

           statistical package is verifying your results, understanding what they

           mean, and knowing how they are computed. If you cannot verify your

           results by intelligent "guesstimates," you should check them against

           the output of another program. You will not be happy if a vendor

           reports a bug after your data are in print (not an infrequent event).

           Do not report statistics found on a printout without understanding

           how they are computed or what they mean. Do not report statistics to

           a greater precision than is supported by your data simply because

           they are printed that way by the program. Using the computer is an

           opportunity for you to control your analysis and design. If a computer

           program does not provide the analysis you need, use another

           program rather than let the computer shape your thinking.

 

           There is no substitute for common sense. If you cannot use rules of thumb

           to detect whether the result of a computation makes sense to you, then

           you should ask yourself whether the procedure you are using is

           appropriate for your research. Graphics can help you to make some of

           these determinations; theory can help in other cases. But never assume that

           using a highly regarded program absolves you of the responsibility for

           judging whether your results are plausible. Finally, when documenting the

           use of a statistical procedure, refer to the statistical literature rather than a

           computer manual; when documenting the use of a program, refer to the

           computer manual rather than the statistical literature.

 

           Assumptions.  You should take efforts to assure that the

           underlying assumptions required for the analysis are reasonable

           given the data. Examine residuals carefully. Do not use distributional

           tests and statistical indexes of shape (e.g., skewness, kurtosis) as a

           substitute for examining your residuals graphically.

 

           Using a statistical test to diagnose problems in model fitting has several

           shortcomings. First, diagnostic significance tests based on summary

           statistics (such as tests for homogeneity of variance) are often impractically

           sensitive; our statistical tests of models are often more robust than our

           statistical tests of assumptions. Second, statistics such as skewness and

           kurtosis often fail to detect distributional irregularities in the residuals.

           Third, statistical tests depend on sample size, and as sample size increases,

           the tests often will reject innocuous assumptions. In general, there is no

           substitute for graphical analysis of assumptions.

 

           Modern statistical packages offer graphical diagnostics for helping to

           determine whether a model appears to fit data appropriately. Most users

           are familiar with residual plots for linear regression modeling. Fewer are

           aware that John Tukey's paradigmatic equation, data = fit + residual,

           applies to a more general class of models and has broad implications for

           graphical analysis of assumptions. Stem-and-leaf plots, box plots,

           histograms, dot plots, spread/level plots, probability plots, spectral plots,

           autocorrelation and cross-correlation plots, co-plots, and trellises (

           Chambers, Cleveland, Kleiner, & Tukey, 1983; Cleveland, 1995; Tukey,

           1977) all serve at various times for displaying residuals, whether they arise

           from analysis of variance (ANOVA), nonlinear modeling, factor analysis,

           latent variable modeling, multidimensional scaling, hierarchical linear

           modeling, or other procedures.

 

           Hypothesis tests.  It is hard to imagine a situation in which a

           dichotomous accept–reject decision is better than reporting an actual

           p value or, better still, a confidence interval. Never use the

           unfortunate expression "accept the null hypothesis." Always provide

           some effect-size estimate when reporting a p value. Cohen (1994) has

           written on this subject in this journal. All psychologists would benefit from

           reading his insightful article.

 

           Effect sizes.  Always present effect sizes for primary outcomes. If

           the units of measurement are meaningful on a practical level (e.g.,

           number of cigarettes smoked per day), then we usually prefer an

           unstandardized measure (regression coefficient or mean difference)

           to a standardized measure (r or d). It helps to add brief comments

           that place these effect sizes in a practical and theoretical context.

 

           APA's (1994) publication manual included an important new

           "encouragement" (p. 18) to report effect sizes. Unfortunately, empirical

           studies of various journals indicate that the effect size of this

           encouragement has been negligible ( Keselman et al., 1998; Kirk, 1996;

           Thompson & Snyder, 1998). We must stress again that reporting and

           interpreting effect sizes in the context of previously reported effects is

           essential to good research. It enables readers to evaluate the stability of

           results across samples, designs, and analyses. Reporting effect sizes also

           informs power analyses and meta-analyses needed in future research.

 

           Fleiss (1994), Kirk (1996), Rosenthal (1994), and Snyder and Lawson

           (1993) have summarized various measures of effect sizes used in

           psychological research. Consult these articles for information on computing

           them. For a simple, general purpose display of the practical meaning of an

           effect size, see Rosenthal and Rubin (1982). Consult Rosenthal and Rubin

           (1994) for information on the use of "counternull intervals" for effect sizes,

           as alternatives to confidence intervals.

 

           Interval estimates.  Interval estimates should be given for any

           effect sizes involving principal outcomes. Provide intervals for

           correlations and other coefficients of association or variation

           whenever possible.

 

           Confidence intervals are usually available in statistical software; otherwise,

           confidence intervals for basic statistics can be computed from typical

           output. Comparing confidence intervals from a current study to intervals

           from previous, related studies helps focus attention on stability across

           studies ( Schmidt, 1996). Collecting intervals across studies also helps in

           constructing plausible regions for population parameters. This practice

           should help prevent the common mistake of assuming a parameter is

           contained in a confidence interval.

 

           Multiplicities.  Multiple outcomes require special handling. There

           are many ways to conduct reasonable inference when faced with

           multiplicity (e.g., Bonferroni correction of p values, multivariate test

           statistics, empirical Bayes methods). It is your responsibility to define

           and justify the methods used.

 

           Statisticians speak of the curse of dimensionality. To paraphrase,

           multiplicities are the curse of the social sciences. In many areas of

           psychology, we cannot do research on important problems without

           encountering multiplicity. We often encounter many variables and many

           relationships.

 

           One of the most prevalent strategies psychologists use to handle

           multiplicity is to follow an ANOVA with pairwise multiple-comparison

           tests. This approach is usually wrong for several reasons. First, pairwise