Introduction:
Statistical tests and their misinterpretation have been a topic of concern for decades, with some scientific journals even banning their use in research. Despite this, statistical methods remain prevalent and it is important to improve basic teaching and understanding of these methods. In an effort to clarify the meaning of significance tests, confidence intervals, and statistical power, a recent article published in the European Journal of Epidemiology reviews 25 common misconceptions and emphasizes the importance of examining and synthesizing all results relating to a scientific question.
The article also highlights why statistical tests should not be the sole input for inferences or decisions about associations or effects. While more detailed discussions on the topic can be found in many sources, this article provides a critical and comprehensive guide to improve the use and interpretation of statistical tests in research.
The challenge of defining the scope of statistical models for accurate representation of observed and hypothetical data:
One of the main challenges in statistical modeling is defining the scope of the model, which should allow for a good representation of both observed and hypothetical alternative data. This is often difficult to achieve, particularly in cases where multiple outcome measures or predictive factors have been measured, and many analysis choices have been made after the data were collected.
Another challenge is understanding and assessing underlying assumptions, which are often presented in a highly compressed and abstract form. As a result, many assumptions go unrecognized and are often unremarked upon by users and consumers of statistics. Nonetheless, all statistical methods and interpretations are premised on the model assumptions, which must provide a valid representation of the variation that we would expect to see across data sets.
In most applications of statistical testing, one of the assumptions in the model is a hypothesis that a particular effect has a specific size and has been targeted for statistical analysis. This targeted assumption is called the study hypothesis or test hypothesis, and the statistical methods used to evaluate it are called statistical hypothesis tests. Most often, the targeted effect size is a “null” value representing zero effect, which is tested through a null hypothesis. However, other effect sizes can also be tested, such as hypotheses that the effect falls within a specific range.
Unfortunately, much of statistical teaching and practice has developed a strong focus on testing null hypotheses, which can contribute to misunderstandings about statistical tests. This focus on null hypotheses has been referred to as “Null Hypothesis Significance Testing” (NHST), which can lead to a narrow and unhealthy perspective on the role of statistical testing in clinical research.
The limitations of dichotomous interpretation of p-values in statistical analysis:
In conventional statistical methods, probability refers to hypothetical frequencies of data patterns under an assumed statistical model, known as frequency probabilities. These probabilities are often misinterpreted as hypothesis probabilities by many statistically educated scientists.
One of the most widely used frequency probabilities is the p-value, which measures the compatibility between the observed data and the entire statistical model. The p-value is a continuous measure ranging from 0 for complete incompatibility to 1 for perfect compatibility, but it is often degraded into a dichotomy where results are declared statistically significant if the p-value falls below a cut-off (usually 0.05) and declared nonsignificant otherwise. This dichotomous view overlooks the fact that the p-value measures the fit of the model to the data and does not provide information on which assumption is incorrect.
Furthermore, traditional definitions of p-value and statistical significance have focused on null hypotheses, treating all other assumptions used to compute the p-value as if they were known to be correct. However, recognizing that these other assumptions are often questionable, a more general view of the p-value is needed as a statistical summary of the compatibility between the observed data and what would be expected if the entire statistical model were correct.
Embracing Confidence Intervals for Robust Statistical Inference:
In statistical hypothesis testing, it is possible to vary the test hypothesis and observe how the p-value changes across competing hypotheses. The effect size that produces a p-value of 1 is considered the size most compatible with the data if all other assumptions used in the test were accurate. This result provides a point estimate of the effect under these assumptions. On the other hand, effect sizes that produce p-values greater than 0.05 will typically define a range of sizes that would be considered more compatible with the data, corresponding to a 95% confidence interval.
Confidence intervals are interval estimates that have the property of coverage probability, meaning that if we calculate, for instance, 95% confidence intervals repeatedly in valid applications, on average, 95% of them will contain the true effect size. This property is a feature of a long sequence of confidence intervals computed from valid models, rather than any single confidence interval.
However, many textbooks and studies only discuss p-values for the null hypothesis of no effect, which contributes to misunderstandings of tests and underestimation of estimation. This focus on null hypotheses also obscures the close relationship between p-values and confidence intervals and the weaknesses they share.
Many journals now require confidence intervals, which provide a more informative and dependable summary of statistical results than p-values alone. Therefore, as statisticians, it is essential to comprehend the strengths and limitations of both p-value and confidence intervals and educate others on their proper interpretation and use to ensure valid and reliable statistical analyses.
0 comments