Linear regression and the normality assumption

Researchers often perform arbitrary outcome transformations to fulfill the normality assumption of a linear regression model. This commentary explains and illustrates that in large data settings, such transformations are often unnecessary, and worse may bias model estimates. Linear regression assump...

Full description

Saved in:

Bibliographic Details
Published in	Journal of clinical epidemiology Vol. 98; pp. 146 - 151
Main Authors	Schmidt, Amand F., Finan, Chris
Format	Journal Article
Language	English
Published	United States Elsevier Inc 01.06.2018 Elsevier Limited
Subjects	Bias Big data Computer simulation Confidence intervals Diabetes mellitus Diabetes mellitus (non-insulin dependent) Economic models Empirical analysis Epidemiological methods Estimates Hemoglobin Hypotheses Impact tests Linear Models Linear regression Mathematical functions Modeling assumptions Normal distribution Normality Regression analysis Regression models Researchers Sample Size Statistical analysis Statistical inference Transformations Variables Violations Big data Statistical inference Epidemiological methods Linear regression Bias Modeling assumptions
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Researchers often perform arbitrary outcome transformations to fulfill the normality assumption of a linear regression model. This commentary explains and illustrates that in large data settings, such transformations are often unnecessary, and worse may bias model estimates. Linear regression assumptions are illustrated using simulated data and an empirical example on the relation between time since type 2 diabetes diagnosis and glycated hemoglobin levels. Simulation results were evaluated on coverage; i.e., the number of times the 95% confidence interval included the true slope coefficient. Although outcome transformations bias point estimates, violations of the normality assumption in linear regression analyses do not. The normality assumption is necessary to unbiasedly estimate standard errors, and hence confidence intervals and P-values. However, in large sample sizes (e.g., where the number of observations per variable is >10) violations of this normality assumption often do not noticeably impact results. Contrary to this, assumptions on, the parametric model, absence of extreme observations, homoscedasticity, and independency of the errors, remain influential even in large sample size settings. Given that modern healthcare research typically includes thousands of subjects focusing on the normality assumption is often unnecessary, does not guarantee valid results, and worse may bias estimates due to the practice of outcome transformations.
Bibliography:	SourceType-Scholarly Journals-1 ObjectType-Commentary-1 content type line 14 ObjectType-Article-1 ObjectType-Feature-2 content type line 23
ISSN:	0895-4356 1878-5921 1878-5921
DOI:	10.1016/j.jclinepi.2017.12.006