The Journal of the American Dental Association
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


J Am Dent Assoc, Vol 140, No 1, 48-54.
© 2009 American Dental Association

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Greenberg, B. L.
Right arrow Articles by Kantor, M. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Greenberg, B. L.
Right arrow Articles by Kantor, M. L.

CLINICAL PRACTICE

JADA Continuing Education

The clinician’s guide to the literature

Interpreting results



Barbara L. Greenberg, MSc, PhD and Mel L. Kantor, DDS, MPH, PhD


   ABSTRACT
 TOP
 ABSTRACT
 VARIABILITY
 HYPOTHESIS TESTING AND THE...
 ESTIMATION AND CONFIDENCE...
 STATISTICAL INFERENCE AND SAMPLE...
 SUMMARY
 Suggested Readings
 
Background. Oral health care professionals must have up-to-date information to guide clinical practice. The peer-reviewed literature contains the most reliable and current information. The clinical research literature relies on statistical analysis of data to make inferences and draw conclusions. In this article, the authors explore the fundamental principles that underlie statistical testing.

Conclusions and Practice Implications. Having the fundamental tools to critically interpret the results presented in the literature is one of the essential elements for appropriately translating clinical research into practice.

Key Words: Variability; hypothesis testing; statistical interpretation; confidence intervals; sample size

Abbreviations: N: Noise. • S: Signal.

Knowledge is power.

Sir Francis Bacon (1597)

Oral health care professionals rely on up-to-date information to guide clinical practice. The most reliable and current information can be found in the peer-reviewed literature. The clinical research literature relies on statistical analysis of data to make inferences and draw conclusions. Although there are many statistical tests available to the researcher, there are fundamental concepts that underlie statistical testing. Understanding these concepts will greatly facilitate critical review of the clinical research literature by the busy clinician. This article addresses these fundamental principles.

Clinicians and clinical researchers try to make inferences about the state of affairs using available information. Given that this information often is incomplete or uncertain, what tools can be used to ensure that the inferences made are valid? Clinicians use diagnostic tests and researchers use statistics. When clinicians use diagnostic tests, they would like test results that are always positive when the disease is truly present and results that are always negative when the disease is truly absent. However, diagnostic test results do not always reflect the patient’s true and often hidden status; there almost always is some uncertainty and, as a result, some misclassification (false-positive or false-negative results). Nonetheless, on the basis of the diagnostic test results (and even a physical examination can be thought of as a diagnostic test), clinicians draw conclusions and often must take action based on those conclusions. They use measures of diagnostic efficacy that quantify the uncertainty of the test, such as sensitivity and specificity, to interpret the test results to draw the "correct" conclusion.

Clinical researchers draw a study sample of subjects or patients from a target population of interest. Researchers would like it if the results observed in the sample were an accurate reflection of the results that would be observed if the entire population could be studied. Similar to diagnostic testing by clinicians, researchers conduct statistical tests on the observable data to make inferences about some underlying truth. The statistical test provides a measure of the likelihood of misclassification (false-positive or false-negative results) or uncertainty regarding the results obtained from the study sample. Researchers use this information to interpret the study results and to draw the "correct" conclusion.

The purpose of statistical testing is to use data collected in a study to make inferences about the relevant overall population. One of the strengths of statistical testing is the ability to quantify the uncertainty of the results when making inferences. Researchers use two approaches to quantify this uncertainty: hypothesis testing, which is associated with P values, and estimation, which is associated with confidence intervals (CIs). Fundamental to quantifying uncertainty is estimating variability.


   VARIABILITY
 TOP
 ABSTRACT
 VARIABILITY
 HYPOTHESIS TESTING AND THE...
 ESTIMATION AND CONFIDENCE...
 STATISTICAL INFERENCE AND SAMPLE...
 SUMMARY
 Suggested Readings
 
Variability is the source of uncertainty in clinical studies. Variability is a result of the limits of precision when a measurement is made, as well as the natural variation or randomness within a group. Variability commonly is quantified by two measures: the standard deviation (SD) and the standard error (SE).

The SD quantifies the variability of individual measurements in a study sample. The SD characterizes the distribution of all of the sample data points around the sample mean (that is, the arithmetic average). For data that are normally distributed (Figure 1Go), the area under the curve represents all of the observations in the distribution, and the mean ± 1.96 SD defines the area under the curve where 95 percent of the individual observations lie. A sample data set with a mean of 15.5 and an SD of 3.0 indicates that approximately 95 percent of the sample data points fall between 9.62 [15.5 –(1.96 x 3.0)] and 21.38 [15.5 + (1.96 x 3.0)].


Figure 1
View larger version (65K):
[in this window]
[in a new window]

 
Figure 1. With normally distributed sample data points, 95 percent of the observations lie within ± 1.96 standard deviations of the mean. The remaining 5 percent are distributed evenly between the two tails (shaded portions). Moreover, in a theoretical normal distribution of all possible sample means for a given population, 95 percent of the observations lie within ± 1.96 standard errors of the mean.

 
A small SD indicates little variability in the sample data, with the data points being tightly clustered around the sample mean. A large SD indicates a lot of variability in the sample data, with the data points being widely distributed around the mean. The SD is the most appropriate statistic for reporting measurement variability in a study for normally distributed data.

While the SD refers to individual observations in a study sample, the SE quantifies the variability of observed sample means if the study were repeated many times. Consider the following example: A sample is drawn from a population, the sample mean is calculated and the sample is returned to the population. Another sample of the same size is drawn from the population, its mean is calculated and this sample is returned to the population. Repeating this process many times and plotting the sample means would yield a normal curve (Figure 1Go). The SE quantifies the variability of sample means that would be obtained. In a theoretical normal distribution of all possible sample means for a given population, the mean ± 1.96 SE defines the area under the curve where 95 percent of the repeated sample means lie. The remaining 5 percent is distributed equally in the two tails.

Investigators should always report measurement variability in a research paper. The SD is the most appropriate statistic for reporting measurement variability of a sample statistic such as a mean. In some cases, authors incorrectly use the SE to indicate measurement variability in the study sample. Because the SE is derived by dividing the SD by the square root of the sample size, the SE is always substantially smaller than the SD. It is misleading to use the smaller SE to describe the measurement variability in a study sample, and researchers should avoid this practice. In summary, the SD is used when describing a sample and the SE is used when making inferences either through hypothesis testing or estimation. The tableGo presents a theoretical data set with the mean, SD and SE.


View this table:
[in this window]
[in a new window]

 
TABLE Hypothetical data for the number of decayed surfaces detected in 10 subjects.

 
Statistical tests involve the use of the SE to determine whether an observed difference between two groups or an observed association between two factors is greater than what would be expected by chance alone. Given that the SE can be calculated for many sample statistics in addition to the mean, it plays a major role in hypothesis testing and estimation for making statistical inferences.


   HYPOTHESIS TESTING AND THE P VALUE
 TOP
 ABSTRACT
 VARIABILITY
 HYPOTHESIS TESTING AND THE...
 ESTIMATION AND CONFIDENCE...
 STATISTICAL INFERENCE AND SAMPLE...
 SUMMARY
 Suggested Readings
 
Hypothesis testing is one of the fundamental approaches for making inferences about a proposed relationship between an exposure or intervention and an outcome. There are two types of hypotheses: the null hypothesis and the alternative or research hypothesis. The null hypothesis is the assumption that there is no significant association between a factor of interest (an exposure or intervention) and the outcome of interest (a disease). The alternative or research hypothesis states that there is a "real" or "true" association between the exposure and the outcome. Box 1Go presents an example of null and research hypotheses. The ultimate goal is to test the validity of the proposed hypothesis. The question is this: Are the observed data due to chance alone or is it more likely that the data reflect a true relationship?


View this table:
[in this window]
[in a new window]

 
BOX 1 Example of null and research hypotheses.

 
Statistical testing is aimed at rejecting or not rejecting the null hypothesis rather than proving or disproving the alternative hypothesis. If the data suggest that the null hypothesis is false, then the null hypothesis is rejected, a de facto acceptance of the alternative hypothesis that the data indicate a true relationship. However, the alternative hypothesis is never tested directly.

Statistical inference commonly is done by means of hypothesis testing using P values. The P value answers this question: How likely is it that the results observed in the study are due to chance alone? If the likelihood that they are due to chance alone is "sufficiently low," the null hypothesis is rejected. The investigator must establish a threshold for "sufficiently low." The threshold is known as the {alpha} level and is most commonly set at 5 percent.

Two-tailed versus one-tailed test In a two-tailed test, the 5 percent is distributed equally in the high and low ends (or tails) of the sampling distribution (Figure 1Go). The two-tailed test allows for a significant difference in either direction; for example, the new treatment could be better or worse than the standard treatment or placebo. Alternatively, in a one-tailed test, the 5 percent is in one of the two tails. The most conservative approach to hypothesis testing is to use the two-tailed test. The one-tailed test can be used only when it is known that the new treatment is better than the standard treatment or placebo, an assumption that rarely can be made.

If a P value is below the threshold (P < .05), the null hypothesis is rejected and the observed effect is considered a "true" relationship. A P value less than .05 indicates that an outcome as extreme as the observed effect is likely to be due to chance alone less than 5 percent of the time, suggesting that the effect is not a result of chance. The more certain an investigator wants to be, the lower the {alpha} level. To use our earlier analogy of diagnostic testing in clinical practice, the P value quantifies the likelihood of drawing a false-positive conclusion from a statistical test.

Interpreting a P value Interpretation of the P value is basically a binary decision. When the P value is less than the {alpha} level, the result is statistically significant; when the P value is equal to or greater than the {alpha} level, the result is not statistically significant. It is incorrect to say a smaller P value is more significant than a larger P value. However, many investigators do not follow this interpretation and incorrectly refer to results as "very" or "extremely" significant when P values are small (P < .001). A small P value means the data are less likely to be due to chance alone, thus reducing the chance of misclassification or making a false-positive conclusion. A P value greater than the {alpha} level is not proof that the null hypothesis is true, but it does indicate that the evidence is not strong enough to reject it. Box 2Go is an example of hypothesis testing.


View this table:
[in this window]
[in a new window]

 
BOX 2 Example of hypothesis testing.

 
Hypothesis testing using the P value also can be conceptualized as a signal detection task that can be represented as a signal-to-noise ratio. Statistical tests compare the strength of an association or difference between two variables (the signal) with the natural variability in the data (the background noise). A large signal-to-noise ratio is unlikely to occur by chance alone and is associated with a small P value, while a small signal-to-noise ratio is likely due to chance alone and is associated with a large P value (Figure 2Go). In the earlier fluoride example, the difference in the number of carious lesions between subjects using the fluoride rinse and those using the placebo is the signal, and the SE of the difference is the noise. If the signal-to-noise ratio is large, the P value will be small, and there will be a statistically significant difference between the groups.


Figure 2
View larger version (55K):
[in this window]
[in a new window]

 
Figure 2. A. Hypothesis testing is akin to evaluating the signal-to-noise ratio, where the signal (S) might be the difference between sample means and the noise (N) is a result of the sample variability. If the signal-to-noise ratio is close to 1, then the P value is going to be greater than .05 and the observed difference is not statistically significant. B. The signal-to-noise ratio will be larger if the signal is larger given the same amount of noise. C. The signal-to-noise ratio also will be larger if the noise is reduced given the same signal.

 

   ESTIMATION AND CONFIDENCE INTERVALS
 TOP
 ABSTRACT
 VARIABILITY
 HYPOTHESIS TESTING AND THE...
 ESTIMATION AND CONFIDENCE...
 STATISTICAL INFERENCE AND SAMPLE...
 SUMMARY
 Suggested Readings
 
While statistical inference is done most commonly with hypothesis testing using a P value, an alternative approach is to use the CI. The CI not only indicates significance, it also quantifies variability or uncertainty of the result used to make the statistical inference.

It is important to recognize the fundamental difference between statistical inference using hypothesis testing and CI. Hypothesis testing with the use of a P value is based on a single value or point estimate derived from the data, and it results in a binary "reject" or "not reject" decision. For example, when the means of two groups are compared, the means and the difference between the means are point estimates. Given that data from any single study are derived from a sample of the larger population of all eligible subjects, a different study sample could yield different point estimates. If the study were repeated using different samples, the resulting multiple point estimates could be tightly clustered or widely distributed. Therefore, a potentially more informative approach involves the use of an interval estimate that incorporates a range that quantifies the natural variability in the data; this is known as a CI. The strength of the CI is that the range of the interval reflects the variability that would be observed if the study were repeated many times.

The CI is used to estimate the upper and lower limits of the variability in the sample data. Given that the SE quantifies the variability of the study sample statistic (for example, sample mean) from the true population parameter (for example, population mean), it is used to calculate the CI. The sample statistic (for example, mean) + 1.96 SE defines the upper limit of the 95 percent CI, and the sample statistic –1.96 SE defines the lower limit of the 95 percent CI. The greater the variability in the sample data, the wider the CI—that is, the less precisely we have estimated the population mean.

The width of the interval also is a function of the degree of confidence one wants to have in the interval. Similar to the convention of setting {alpha} at .05, a 95 percent CI is used most commonly. Thus, the 95 percent CI includes the true population value 95 times of 100; the true population value will be outside this interval five times out of 100. This is the margin of error that the investigator is willing to accept. The margin of error that is widely reported for the results of opinion or political polls is nothing more than the 95 percent CI for the percentage of respondents who feel a certain way or will vote for a particular candidate.

Interpreting a CI The CI is used to test significance and to estimate the magnitude of an observed association. In epidemiologic and clinical studies, estimates of the magnitude of an observed association are referred to as measures of disease association. To illustrate how to interpret a CI, we use the measure of disease association known as the relative risk (RR).

The RR provides information about the strength of the association as well as the direction of the association. The RR is the risk of developing disease among those exposed divided by the risk of developing disease among those not exposed. Given that the RR is a ratio, if the risks are the same between the exposure groups, then RR equals 1. An RR of 1 is the same as the null hypothesis of "no association." If the risk among exposed subjects is greater than the risk among nonexposed subjects, the RR is greater than 1 and indicates a positive association between the exposure and the outcome. If the risk among exposed subjects is less than the risk among nonexposed subjects, the RR is less than 1 and indicates a negative association between the exposure and the outcome; the exposure then is considered a putative protective factor.

Using the observed RR (our sample statistic of interest), one calculates the upper and lower limits of the 95 percent CI around the RR by using the formula (sample statistic ± 1.96 SE). A 95 percent CI around the RR that includes 1 indicates that the true value of the RR could be 1. Therefore, the RR is not considered to be statistically significant and the evidence does not support an association between exposure and disease. If the 95 percent CI does not include 1, the result is considered to be statistically significant and provides evidence of a significant association between exposure and disease. Box 3Go shows an example of interpreting a CI.


View this table:
[in this window]
[in a new window]

 
BOX 3 Example of interpreting a confidence interval (CI).

 
Further interpretation of the 95 percent CI The upper and lower limits of the CI denote the possible effect size and potential clinical significance. In some instances, the results may not be considered statistically significant (that is, the 95 percent CI around the RR includes 1), but the potential magnitude of the effect may be great. For example, a 95 percent CI of 1 to 35 is not statistically significant, but it does suggest that the magnitude of the effect could be large (upper limit of 35). In contrast, a 95 percent CI of 1.07 to 1.09 is considered statistically significant, but the magnitude of the effect is very modest (upper limit of 1.09), and it may be of relatively little clinical importance. A statistically significant result does not necessarily imply a clinically important result. Therefore, it is advisable to look at the potential effect size as well as the significance of the results. All investigators should indicate what is considered a clinically important outcome, and practitioners should use clinical judgment when interpreting results.


   STATISTICAL INFERENCE AND SAMPLE SIZE
 TOP
 ABSTRACT
 VARIABILITY
 HYPOTHESIS TESTING AND THE...
 ESTIMATION AND CONFIDENCE...
 STATISTICAL INFERENCE AND SAMPLE...
 SUMMARY
 Suggested Readings
 
For the statistical inference to be valid, an adequate sample size is necessary to ensure enough power to minimize misclassification and to correctly decide whether to reject the null hypothesis. Power is the probability that the study will be able to detect a difference or an association if, in fact, one exists (by analogy, a true positive diagnostic test result). The desired level of power is 80 percent or greater. Without sufficient power, a study result of no statistical significance could be incorrect owing to an inadequate sample size rather than being a valid reflection of the data.

Therefore, if the study reveals no significant outcome (P ≥.05 or a 95 percent CI that includes the null value), two questions should be asked: Was there enough power? Was the sample size large enough? Sample size calculations should always be done at the outset of the study during the design phase. Establishing the clinically important difference or outcome is fundamental for sample size calculations, and the outcome should be generated on the basis of data in the established literature combined with clinical expertise.

In general, the larger the sample size, the smaller the difference or the weaker the association that can be detected as statistically significant. However, there also is danger from the other side of the spectrum, where very large samples can result in a statistically significant finding for even a very small effect that may be of minimal or no real clinical significance. A statistically significant difference does not necessarily imply a clinically important difference.


   SUMMARY
 TOP
 ABSTRACT
 VARIABILITY
 HYPOTHESIS TESTING AND THE...
 ESTIMATION AND CONFIDENCE...
 STATISTICAL INFERENCE AND SAMPLE...
 SUMMARY
 Suggested Readings
 
This article covered several of the basic concepts that underlie the majority of statistical analyses reported in the clinical research literature: variability, hypothesis testing, CIs and sample size. Having the fundamental tools to critically interpret the results presented in the literature is one of the essential elements for appropriately translating clinical research into practice.


   FOOTNOTES
 

Dr. Greenberg is interim associate dean for research, Office of Research, New Jersey Dental School, University of Medicine and Dentistry of New Jersey, 110 Bergen St., Newark, N.J. 07103, e-mail "greenbbl{at}umdnj.edu". Address reprint requests to Dr. Greenberg.


Dr. Kantor is a professor, Department of Diagnostic Sciences, New Jersey Dental School, University of Medicine and Dentistry of New Jersey, Newark.


Disclosure. Drs. Greenberg and Kantor did not report any disclosures.


The authors acknowledge the valuable input of Dr. Curt Bay, A.T. Still University, Mesa, Ariz., and the inspiration of friend and colleague Dr. Leo Lee Lewis.


   Suggested Readings
 TOP
 ABSTRACT
 VARIABILITY
 HYPOTHESIS TESTING AND THE...
 ESTIMATION AND CONFIDENCE...
 STATISTICAL INFERENCE AND SAMPLE...
 SUMMARY
 Suggested Readings
 





This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Greenberg, B. L.
Right arrow Articles by Kantor, M. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Greenberg, B. L.
Right arrow Articles by Kantor, M. L.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS