How To Test If Data Is Normally Distributed

Practice my data follow a normal distribution? A note on the most widely used distribution and how to test for normality in R

What is a normal distribution?
Empirical rule
Parameters
Probabilities and standard normal distribution
- Areas under the normal distribution in R and by hand
  - Ex. 1
    - In R
    - By hand
  - Ex. 2
    - In R
    - By hand
  - Ex. 3
    - In R
    - By paw
  - Ex. 4
    - In R
    - Past hand
  - Ex. 5
Why is the normal distribution so crucial in statistics?
How to exam the normality assumption
- Histogram
- Density plot
- QQ-plot
- Normality test
References

What is a normal distribution?

The normal distribution is a function that defines how a set up of measurements is distributed around the eye of these measurements (i.e., the mean). Many natural phenomena in real life can be approximated by a bell-shaped frequency distribution known every bit the normal distribution or the Gaussian distribution.

The normal distribution is a mount-shaped, unimodal and symmetric distribution where most measurements gather around the mean. Moreover, the farther a mensurate deviates from the mean, the lower the probability of occurring. In this sense, for a given variable, it is common to find values close to the hateful, but less and less likely to find values every bit we move away from the mean. Last only not least, since the normal distribution is symmetric around its mean, extreme values in both tails of the distribution are equivalently unlikely. For example, given that adult height follows a normal distribution, most adults are close to the average height and extremely curt adults occur as infrequently as extremely tall adults.

In this article, the focus is on understanding the normal distribution, the associated empirical dominion, its parameters and how to compute \(Z\) scores to discover probabilities under the curve (illustrated with examples). Equally it is a requirement in some statistical tests, nosotros besides prove 4 complementary methods to test the normality assumption in R.

Empirical dominion

Information possessing an approximately normal distribution have a definite variation, as expressed by the following empirical dominion:

\(\mu \pm \sigma\) includes approximately 68% of the observations
\(\mu \pm 2 \cdot \sigma\) includes approximately 95% of the observations
\(\mu \pm iii \cdot \sigma\) includes almost all of the observations (99.7% to exist more precise)

Normal distribution & empirical rule (68-95-99.7% dominion)

where \(\mu\) and \(\sigma\) correspond to the population mean and population standard deviation, respectively.

The empirical rule, also known as the 68-95-99.7% dominion, is illustrated past the following 2 examples. Suppose that the scores of an exam in statistics given to all students in a Belgian academy are known to have, approximately, a normal distribution with mean \(\mu = 67\) and standard deviation \(\sigma = 9\). It can then be deduced that approximately 68% of the scores are between 58 and 76, that approximately 95% of the scores are between 49 and 85, and that nigh all of the scores (99.7%) are between 40 and 94. Thus, knowing the hateful and the standard deviation gives united states a fairly good motion-picture show of the distribution of scores. Now suppose that a unmarried university educatee is randomly selected from those who took the exam. What is the probability that her score will exist between 49 and 85? Based on the empirical rule, we find that 0.95 is a reasonable respond to this probability question.

The utility and value of the empirical rule are due to the common occurrence of approximately normal distributions of measurements in nature. For example, IQ, shoe size, height, birth weight, etc. are approximately normally-distributed. You will find that approximately 95% of these measurements will be within \(2\sigma\) of their hateful (Wackerly, Mendenhall, and Scheaffer 2014).

Parameters

Similar many probability distributions, the shape and probabilities of the normal distribution is defined entirely by some parameters. The normal distribution has two parameters: (i) the hateful \(\mu\) and (2) the variance \(\sigma^ii\) (i.e., the foursquare of the standard departure \(\sigma\)). The mean \(\mu\) locates the middle of the distribution, that is, the central tendency of the observations, and the variance \(\sigma^2\) defines the width of the distribution, that is, the spread of the observations.

The mean \(\mu\) can take on any finite value (i.e., \(-\infty < \mu < \infty\)), whereas the variance \(\sigma^2\) can assume any positive finite value (i.due east., \(\sigma^2 > 0\)). The shape of the normal distribution changes based on these two parameters. Since there is an infinite number of combinations of the mean and variance, there is an space number of normal distributions, and thus an infinite number of forms.

For case, see how the shapes of the normal distributions vary when the two parameters change:

As you lot can see on the 2nd graph, when the variance (or the standard deviation) decreases, the observations are closer to the mean. On the reverse, when the variance (or standard deviation) increases, it is more probable that observations will exist further away from the hateful.

A random variable \(X\) which follows a normal distribution with a hateful of 430 and a variance of 17 is denoted \(Ten ~ \sim \mathcal{Northward}(\mu = 430, \sigma^2 = 17)\).

We have seen that, although different normal distributions have different shapes, all normal distributions have mutual characteristics:

They are symmetric, l% of the population is above the mean and fifty% of the population is beneath the mean
The mean, median and way are equal
The empirical rule detailed earlier is applicable to all normal distributions

Probabilities and standard normal distribution

Probabilities and quantiles for random variables with normal distributions are easily institute using R via the functions pnorm() and qnorm(). Probabilities associated with a normal distribution can also exist found using this Shiny app. Nonetheless, before computing probabilities, we demand to learn more most the standard normal distribution and the \(Z\) score.

Although there are infinitely many normal distributions (since at that place is a normal distribution for every combination of mean and variance), we need only i tabular array to find the probabilities under the normal curve: the standard normal distribution. The normal standard distribution is a special example of the normal distribution where the hateful is equal to 0 and the variance is equal to 1. A normal random variable \(X\) can always exist transformed to a standard normal random variable \(Z\), a procedure known equally "scaling" or "standardization," by subtracting the mean from the observation, and dividing the result by the standard deviation. Formally:

\[Z = \frac{X - \mu}{\sigma}\]

where \(Ten\) is the observation, \(\mu\) and \(\sigma\) the mean and standard deviation of the population from which the observation was drawn. And so the mean of the standard normal distribution is 0, and its variance is one, denoted \(Z ~ \sim \mathcal{N}(\mu = 0, \sigma^2 = one)\).

From this formula, we see that \(Z\), referred every bit standard score or \(Z\) score, allows to see how far away one specific observation is from the mean of all observations, with the distance expressed in standard deviations. In other words, the \(Z\) score corresponds to the number of standard deviations 1 ascertainment is abroad from the mean. A positive \(Z\) score means that the specific observation is higher up the hateful, whereas a negative \(Z\) score means that the specific observation is below the mean. \(Z\) scores are oft used to compare an private to her peers, or more generally, a measurement compared to its distribution.

For instance, suppose a student scoring 60 at a statistics exam with the mean score of the grade being 40, and scoring 65 at an economic science test with the mean score of the class being 80. Given the "raw" scores, one would say that the student performed better in economics than in statistics. Still, taking into consideration her peers, information technology is articulate that the educatee performed relatively amend in statistics than in economics. Computing \(Z\) scores allows to take into consideration all other students (i.eastward., the entire distribution) and gives a ameliorate mensurate of comparison. Permit's compute the \(Z\) scores for the 2 exams, assuming that the score for both exams follow a normal distribution with the following parameters:

	Statistics	Economics
Mean	xl	80
Standard deviation	8	12.5
Pupil's score	threescore	65

\(Z\) scores for:

Statistics: \(Z_{stat} = \frac{60 - twoscore}{8} = 2.5\)
Economic science: \(Z_{econ} = \frac{65 - 80}{12.v} = -1.2\)

On the one hand, the \(Z\) score for the test in statistics is positive (\(Z_{stat} = 2.v\)) which means that she performed improve than average. On the other paw, her score for the exam in economics is negative (\(Z_{econ} = -1.ii\)) which means that she performed worse than average. Below an illustration of her grades in a standard normal distribution for better comparison:

Although the score in economics is better in accented terms, the score in statistics is actually relatively better when comparing each score inside its own distribution.

Furthermore, \(Z\) score also enables to compare observations that would otherwise be incommunicable because they have different units for example. Suppose you lot desire to compare a salary in € with a weight in kg. Without standardization, there is no way to conclude whether someone is more than extreme in terms of her wage or in terms of her weight. Thanks to \(Z\) scores, we can compare 2 values that were in the offset place non comparable to each other.

Final remark regarding the interpretation of a \(Z\) score: a rule of thumb is that an observation with a \(Z\) score between -3 and -ii or between two and 3 is considered equally a rare value. An ascertainment with a \(Z\) score smaller than -3 or larger than 3 is considered as an extremely rare value. A value with whatsoever other \(Z\) score is considered as not rare nor extremely rare.

Areas under the normal distribution in R and by manus

Now that we take covered the \(Z\) score, we are going to utilise information technology to decide the surface area under the curve of a normal distribution.

Note that there are several means to arrive at the solution in the following exercises. Y'all may therefore employ other steps than the ones presented to obtain the same outcome.

Ex. 1

Let \(Z\) announce a normal random variable with hateful 0 and standard deviation i, discover \(P(Z > one)\).

Nosotros actually wait for the shaded expanse in the post-obit effigy:

Standard normal distribution: \(P(Z > 1)\)

In R

                    pnorm(1,   hateful = 0,   sd = i, # sd stands for standard deviation   lower.tail = FALSE )

                    ## [1] 0.1586553

We wait for the probability of \(Z\) existence larger than i and then nosotros set the argument lower.tail = False. The default lower.tail = TRUE would requite the result for \(P(Z < one)\). Note that \(P(Z = ane) = 0\) and then writing \(P(Z > i)\) or \(P(Z \ge 1)\) is equivalent.

By paw

See that the random variable \(Z\) has already a mean of 0 and a standard departure of ane, and then no transformation is required. To find the probabilities by paw, we need to refer to the standard normal distribution table shown below:

Standard normal distribution tabular array (Wackerly, Mendenhall, and Scheaffer 2014).

From the illustration at the tiptop of the table, we come across that the values within the tabular array correspond to the area under the normal curve above a certain \(z\). Since we are looking precisely at the probability above \(z = 1\) (since we look for \(P(Z > 1)\)), we can but proceed down the first (\(z\)) column in the table until \(z = i.0\). The probability is 0.1587. Thus, \(P(Z > i) = 0.1587\). This is similar to what nosotros constitute using R, except that values in the tabular array are rounded to 4 digits.

Ex. two

Permit \(Z\) denote a normal random variable with mean 0 and standard departure i, discover \(P(−1 \le Z \le one)\).

We are looking for the shaded expanse in the post-obit figure:

Standard normal distribution: \(P(−1 \le Z \le 1)\)

In R

                    pnorm(1, lower.tail = TRUE) - pnorm(-1, lower.tail = TRUE)

                    ## [1] 0.6826895

Note that the arguments past default for the mean and the standard difference are mean = 0 and sd = 1. Since this is what we need, we can omit them.^ane

By paw

For this exercise nosotros proceed by steps:

The shaded area corresponds to the entire area under the normal curve minus the two white areas in both tails of the curve.
We know that the normal distribution is symmetric.
Therefore, the shaded expanse is the entire surface area under the curve minus ii times the white area in the right tail of the bend, the white area in the right tail of the curve being \(P(Z > i)\).
We also know that the entire surface area under the normal bend is 1.
Thus, the shaded area is 1 minus 2 times \(P(Z > one)\):

\[P(−i \le Z \le ane) = 1 - ii \cdot P(Z > 1)\] \[= 1 - ii \cdot 0.1587 = 0.6826\]

where \(P(Z > 1) = 0.1587\) has been establish in the previous do.

Ex. 3

Allow \(Z\) denote a normal random variable with mean 0 and standard divergence 1, find \(P(0 \le Z \le 1.37)\).

We are looking for the shaded area in the post-obit figure:

Standard normal distribution: \(P(0 \le Z \le ane.37)\)

In R

                    pnorm(0, lower.tail = FALSE) - pnorm(1.37, lower.tail = False)

                    ## [1] 0.4146565

By hand

Once more nosotros proceed by steps for this practise:

We know that \(P(Z > 0) = 0.five\) since the entire area under the curve is 1, one-half of information technology is 0.v.
The shaded area is half of the entire surface area under the curve minus the expanse from 1.37 to infinity.
The area under the bend from 1.37 to infinity corresponds to \(P(Z > i.37)\).
Therefore, the shaded surface area is \(0.v - P(Z > 1.37)\).
To find \(P(Z > i.37)\), proceed down the \(z\) column in the tabular array to the entry one.3 and then across the tiptop of the table to the cavalcade labeled .07 to read \(P(Z > 1.37) = .0853\)
Thus,

\[P(0 \le Z \le 1.37) = P(Z > 0) - P(Z > one.37)\] \[ = 0.v - 0.0853 = 0.4147\]

Ex. 4

Recap the instance presented in the empirical rule: Suppose that the scores of an examination in statistics given to all students in a Belgian academy are known to have a normal distribution with hateful \(\mu = 67\) and standard deviation \(\sigma = 9\). What fraction of the scores lies between seventy and fourscore?

We are looking for the shaded expanse in the post-obit figure:

\(P(70 \le X \le 80)\) where \(10 \sim \mathcal{N}(\mu = 67, \sigma^2 = 9^two)\)

In R

                    pnorm(lxx, mean = 67, sd = ix, lower.tail = FALSE) - pnorm(fourscore, mean = 67, sd = 9, lower.tail = FALSE)

                    ## [one] 0.2951343

Past mitt

Remind that we are looking for \(P(70 \le X \le 80)\) where \(10 \sim \mathcal{North}(\mu = 67, \sigma^2 = 9^2)\). The random variable \(X\) is in its "raw" format, significant that it has non been standardized yet since the hateful is 67 and the variance is \(ix^2\). We thus need to first apply the transformation to standardize the endpoints 70 and lxxx with the following formula:

\[Z = \frac{X - \mu}{\sigma}\]

After the standardization, \(x = lxx\) becomes (in terms of \(z\), and then in terms of divergence from the mean expressed in standard departure):

\[z = \frac{70 - 67}{9} = 0.3333\]

and \(ten = 80\) becomes:

\[z = \frac{80 - 67}{nine} = ane.4444\]

The figure higher up in terms of \(Ten\) is now in terms of \(Z\):

\(P(0.3333 \le Z \le 1.4444)\) where \(Z \sim \mathcal{Northward}(\mu = 0, \sigma^two = 1)\)

Finding the probability \(P(0.3333 \le Z \le i.4444)\) is like to exercises 1 to 3:

The shaded expanse corresponds to the area nether the bend from \(z = 0.3333\) to \(z = 1.4444\).
In other words, the shaded expanse is the surface area under the curve from \(z = 0.3333\) to infinity minus the area under the bend from \(z = ane.4444\) to infinity.
From the table, \(P(Z > 0.3333) = 0.3707\) and \(P(Z > i.4444) = 0.0749\)
Thus:

\[P(0.3333 \le Z \le 1.4444)\] \[= P(Z > 0.3333) - P(Z > ane.4444)\] \[= 0.3707 - 0.0749 = 0.2958\]

The divergence with the probability institute using in R comes from the rounding.

To conclude this exercise, we can say that, given that the mean scores is 67 and the standard deviation is 9, 29.58% of the students scored betwixt lxx and eighty.

Ex. five

See another example in a context hither.

Why is the normal distribution so crucial in statistics?

The normal distribution is of import for three main reasons:

Some statistical hypothesis tests assume that the data follow a normal distribution
The central limit theorem states that, for a large number of observations (usually \(n > xxx\)), no thing what is the underlying distribution of the original variable, the distribution of the sample means (\(\overline{Ten}_n\)) and of the sum (\(S_n = \sum_{i = i}^n X_i\)) may be approached past a normal distribution (Stevens 2013)
Linear and nonlinear regression assume that the residuals are normally-distributed (for small sample sizes)

It is therefore useful to know how to test for normality in R, which is the topic of next sections.

How to exam the normality supposition

As mentioned above, some statistical tests require that the information follow a normal distribution, or the outcome of the exam may be flawed.

In this section, we show 4 complementary methods to determine whether your information follow a normal distribution in R.

Histogram

A histogram displays the spread and shape of a distribution, and then it is a good starting signal to evaluate normality. Let'south have a look at the histogram of a distribution that we would expect to follow a normal distribution, the height of 1,000 adults in cm:

The normal curve with the corresponding mean and variance has been added to the histogram. The histogram follows the normal curve so the data seems to follow a normal distribution.

Below the minimal code for a histogram in R with the dataset iris:

                data(iris) hist(iris$Sepal.Length)

In {ggplot2}:

                ggplot(iris) +   aes(x = Sepal.Length) +   geom_histogram()

Histograms are yet not sufficient, particularly in the case of pocket-size samples because the number of bins greatly change its appearance. Histograms are not recommended when the number of observations is less than 20 considering it does non always correctly illustrate the distribution. See ii examples below with dataset of ten and 12 observations:

Tin can you tell whether these datasets follow a normal distribution? Surprisingly, both follow a normal distribution!

In the remaining of the article, we volition utilise the dataset of the 12 adults. If yous would like to follow my code in your own script, here is how I generated the data:

                set.seed(42) dat_hist <- data.frame(   value = rnorm(12, hateful = 165, sd = 5) )

The rnorm() function generates random numbers from a normal distribution (12 random numbers with a mean of 165 and standard deviation of 5 in this case). These 12 observations are and so saved in the dataset called dat_hist under the variable value. Note that prepare.seed(42) is important to obtain the exact same information as me.ⁱⁱ

Density plot

Density plots as well provide a visual judgment nigh whether the data follow a normal distribution. They are like to histograms every bit they besides allow to clarify the spread and the shape of the distribution. However, they are a smoothed version of the histogram. Here is the density plot drawn from the dataset on the height of the 12 adults discussed above:

                plot(density(dat_hist$value))

In {ggpubr}:

                library("ggpubr") # package must be installed first ggdensity(dat_hist$value,   master = "Density plot of adult tiptop",   xlab = "Tiptop (cm)" )

Since it is hard to exam for normality from histograms and density plots only, it is recommended to corroborate these graphs with a QQ-plot. QQ-plot, too known every bit normality plot, is the tertiary method presented to evaluate normality.

QQ-plot

Similar histograms and density plots, QQ-plots allow to visually evaluate the normality assumption. Hither is the QQ-plot fatigued from the dataset on the peak of the 12 adults discussed above:

                library(car) qqPlot(dat_hist$value)

                ## [1] 12  2

In {ggpubr}:

                library(ggpubr) ggqqplot(dat_hist$value)

Instead of looking at the spread of the data (every bit it is the case with histograms and density plots), with QQ-plots nosotros only need to ascertain whether the information points follow the line (sometimes referred equally Henry's line).

If points are close to the reference line and within the conviction bands, the normality supposition tin can be considered as met. The bigger the divergence between the points and the reference line and the more they lie outside the confidence bands, the less likely that the normality status is met. The pinnacle of these 12 adults seem to follow a normal distribution because all points lie within the confidence bands.

When facing a non-normal distribution as shown by the QQ-plot below (systematic deviation from the reference line), the first pace is usually to utilise the logarithm transformation on the data and recheck to encounter whether the log-transformed information are ordinarily distributed. Applying the logarithm transformation can be done with the log() function.

Note that QQ-plots are also a convenient way to assess whether residuals from linear regression follow a normal distribution.

Normality test

The 3 tools presented above were a visual inspection of the normality. Nonetheless, visual inspection may sometimes exist unreliable so it is as well possible to formally test whether the data follow a normal distribution with statistical tests. These normality tests compare the distribution of the data to a normal distribution in order to assess whether observations testify an important deviation from normality.

The two almost common normality tests are Shapiro-Wilk'due south examination and Kolmogorov-Smirnov test. Both tests have the same hypotheses, that is:

\(H_0\): the data follow a normal distribution
\(H_1\): the data do not follow a normal distribution

Shapiro-Wilk examination is recommended for normality examination equally it provides better ability than Kolmogorov-Smirnov test.ⁱⁱⁱ In R, the Shapiro-Wilk test of normality can exist washed with the function shapiro.exam():⁴

                shapiro.test(dat_hist$value)

                ##  ##  Shapiro-Wilk normality test ##  ## data:  dat_hist$value ## West = 0.93968, p-value = 0.4939

From the output, we meet that the \(p\)-value \(> 0.05\) implying that we practice not decline the null hypothesis that the data follow a normal distribution. This test goes in the same direction than the QQ-plot, which showed no meaning difference from the normality (equally all points lied within the confidence bands).

It is of import to note that, in do, normality tests are oftentimes considered equally likewise conservative in the sense that for big sample size (\(due north > l\)), a pocket-size deviation from the normality may cause the normality condition to be violated. A normality test is a hypothesis test, so as the sample size increases, their capacity of detecting smaller differences increases. So as the number of observations increases, the Shapiro-Wilk test becomes very sensitive even to a modest departure from normality. Every bit a consequence, it happens that co-ordinate to the normality test the information do not follow a normal distribution although the departures from the normal distribution is negligible and then the information could in fact be considered to follow approximately a normal distribution. For this reason, it is often the example that the normality condition is verified based on a combination of all methods presented in this article, that is, visual inspections (with histograms and QQ-plots) and a formal inspection (with the Shapiro-Wilk examination for instance).

I personally tend to adopt QQ-plots over histograms and normality tests and so I practise not accept to bother near the sample size. This article showed the different methods that are bachelor, your choice volition of form depends on the type of your data and the context of your analyses.

Cheers for reading. I promise the article helped yous to learn more than most the normal distribution and how to test for normality in R.

As always, if y'all have a question or a suggestion related to the topic covered in this commodity, please add information technology as a comment so other readers tin benefit from the discussion.

References

Stevens, James P. 2013. Intermediate Statistics: A Modern Approach. Routledge.

Wackerly, Dennis, William Mendenhall, and Richard Fifty Scheaffer. 2014. Mathematical Statistics with Applications. Cengage Learning.

Liked this post?

Get updates every time a new commodity is published.
No spam and unsubscribe someday.

Share on: