I’ve seen quite a lot of cases where people try to check the normality of the response variable in a linear regression setting by either tests, be it Kolmogorov-Smirnov test, Anderson-Darling test, or Shapiro-Wilk test, or drawing the histogram and the kernel density plot. But really, think about when those tests work or when you can approximate a distribution with the density estimation. There should be a lot of observed values from one distribution!! Consider the following linear model.

y_{i}=\mathbf{x}_{i}'\boldsymbol{\beta}+e_{i},\qquad e_{i}\overset{\text{i.i.d.}}{\sim}\mathcal{N}\left(0,\sigma^{2}\right)

What is the distribution of each y_{i}? It’s y_{i}\sim \mathcal{N}\left(\mathbf{x}_{i}'\boldsymbol{\beta},\sigma^{2}\right)! The means should be different for potentially all of y_{i}s. So what is the kernel density plot reaaaaaaally approximating? Is it \mathcal{N}\left(0,\sigma^{2}\right)? NO! It’s the mixture of n normal distributions.

\widehat{f}\left(x\right) \approx \dfrac{1}{n}\displaystyle\sum_{i=1}^{n}f_{y_{i}}\left(y_{i}\,|\,\mathbf{x}_{i}'\boldsymbol{\beta},\sigma^{2}\right)

So it is obvious that the density plot will hardly give you a close shape of a bell-shaped curve especially the symmetry about zero will almost never be achieved. The tests will always reject the null hypothesis. Even if it didn’t, you should not take it without any doubt because such cases are rarity theoretically.

(Example) Imagine the following situation.

y_{i}=\beta_{1}x_{i}+e_{i},\quad x_{i}\in\left\{0,1\right\}

Then, two groups of y_{i} in C_{0}=\left\{y_{j}: x_{j}=0,\;j=1,\ldots,n\right\} and C_{1}=\left\{y_{j}: x_{j}=1,\;j=1,\ldots,n\right\}  respectively following y_{j}\in C_{0}\sim\mathcal{N}\left(0,\sigma^{2}\right) and y_{j}\in C_{1}\sim \mathcal{N}\left(\beta,\sigma^{2}\right). We can easily see that the response variables in this case are from two different population distributions which is thus called a mixture of 2 normal components. There is no way we can test the normality of y_{i}s as a whole.

(Conclusion) Let’s go back to the basics. What actually matters in the linear setting? The distribution of the response variable? We don’t even need the normality assumption of the error term. The ordinary least squares estimate (OLS) doesn’t assume normality of the errors. The conditions are

  1. \mathrm{E}\left(e_{i}\,|\,\mathbf{x}_{i}\right)=0 (strict exogeneity)
  2. \mathrm{E}\left(e_{i}^{2}\right)=\sigma^{2} (homoskedasticity)
  3. \mathrm{E}\left(\mathbf{x}_{i}e_{i}\right)=0 (uncorrelated error and predictors)
  4. Additionally for the Central Limit Theorem (CLT) Q_{xx}=\mathrm{E}\left(\mathbf{x}_{i}\mathbf{x}_{i}'\right) should be positive-definite.

What really matters is the consistency and asymptotic distribution of the estimators.

\begin{array}{rcl}\widehat{\boldsymbol{\beta}} &=& \left(\dfrac{1}{n}\displaystyle\sum_{i=1}^{n}\mathbf{x}_{i}\mathbf{x}_{i}'\right)^{-1}\left(\dfrac{1}{n}\displaystyle\sum_{i=1}^{n}\mathbf{x}_{i}y_{i}\right) \xrightarrow{P}\boldsymbol{\beta}\\ \sqrt{n}\left(\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta}\right)&\xrightarrow{D}&\mathcal{N}\left(\mathbf{0},\sigma^{2}Q_{xx}^{-1}\right)\end{array}

We don’t need normality for these to be achieved. Normal distribution is a convenient approximation of the reality to start with because it lays the foundation of the distributional approach to inferences (which I believe is essentially Bayesian) but why do we care so much about the normality so much if we’re going to use OLS anyway?

[CAVEAT] It is ok to draw density estimation plots and histograms of variables as a procedure of the exploratory data analysis (EDA) but normality should actually be checked after the mean effect is eliminated and thus, every linear regression textbook puts it in the name of residual test in the post-estimation diagnostic step.