Lab03 - Normal Distribution in R

Continuous distributions
- Uniform distribution
- Normal distribution
Application
- Normal

Many of the standard probability distributions have functions in R to calculate:

probability density/mass function (d)
cumulative distribution function (p)
quantile function (q)
random draws (r)

The letter after each indicates the letter the function will start with for all distributions.

Continuous distributions

In constrast to discrete random variables, continuous random variables can take on an uncountably infinite number of values. The easiest way for this to happen is that the random variable can take on any value between two specified values (and infinity counts), i.e. an interval.

Continuous random variables have a probability density function (pdf) instead of a pmf. When integrated from a to b, this pdf gives the probability the random variable will take on a value between a and b. Continuous random variables still have a cdf, quantile function, and random generator that all still have the same interpretation.

Uniform distribution

The simplest continuous random variable is the uniform random variable. If \(Y\) is a random variable and it is uniformly distributed between a and b, then we write \(Y\sim Unif(a,b)\) and this means that \(Y\) can take on any value between a and b with equal probability.

The probability density function for a uniform random variable is zero outside of a and b (indicating that the random variable cannot take values below a or above b) and is constant at a value of 1/(b-a) from a to b.

a <- 0
b <- 1

# The curve function expects you to give a function of `x` and then it 
# (internally) creates a sequence of values from `from` and to `to` and creates
# plots similar to what we had before, but using a line rather than points.
curve(dunif(x, min = a, max = b), 
      from = -1, to = 2, n = 1001, 
      xlab='y', ylab='f(y)', 
      main='Probability density function for Unif(0,1)')

The cumulative distribution function is the integral from negative infinite up to y of the probability density function. Thus it indicates the probability the random variables will take on values less than y, i.e. \(P(Y<y)\). Since the probability density function for the uniform is constant, the integral is simply zero from negative infinite up to a, then a straight line from (a,0) up to (b,1) and then constant after that.

curve(punif(x, min = a, max = b), 
      from = -1, to = 2, n = 1001,
      xlab='y', ylab='F(y)', 
      main='Cumulative distribution function for Unif(0,1)')

As mentioned previously, the cumulative distribution function provides the area under the pdf, i.e. the integral, from negative infinity up to a value for the random variable.

# Visualize the CDF
curve(dunif(x, min = a, max = b), 
      from = -1, to = 2, n = 1001,
      xlab='y', ylab='f(y)', 
      main='CDF is area under the PDF (integral)')

U <- 0.3
x <- seq(-1, U, length = 1001)
polygon(c(x, rev(x)), 
          c(rep(0, length(x)), dunif(rev(x), min = a, max = b)),
          col = 'red', lty = 0)

punif(U, min = a, max = b)

## [1] 0.3

We can also visualize areas under the curve that are differences in the CDF.

# Visualize area under the curve
curve(dunif(x, min = a, max = b), 
      from = -1, to = 2, n = 1001,
      xlab='y', ylab='f(y)', 
      main='Area under the curve (integral)')

L <- 0.3; U <- 0.7
x <- seq(L, U, length = 1001)
polygon(c(x, rev(x)), 
          c(rep(0, length(x)), dunif(rev(x), min = a, max = b)),
          col = 'red', lty = 0)

punif(U, min = a, max = b) - punif(L, min = a, max = b)

## [1] 0.4

To draw random values from the distribution, use the r version of the function. For instance, here is 100 random Unif(a,b) draws represented as a histogram.

random_uniforms <- runif(100, min = a, max = b)

hist(random_uniforms, 
     probability = TRUE, 
     main = 'Random draws from Unif(0,1)')

curve(dunif(x, min = a, max = b), 
      add = TRUE, col='red')

Activity

Plot the pdf, cdf, and quantile function for a uniform distribution on the interval (13, 65), i.e. \(X\sim Unif(13,65)\). Then sample 999 random uniforms on the interval (13,65) and plot a histogram of these draws with the probability density function. Bonus: plot area under the uniform pdf and calculate the appropriate values using the cdf.

Click for solution

a <- 13
b <- 65

opar = par(mfrow=c(2,2)) # Create a 2x2 grid of figures
curve(dunif(x, min = a, max = b), 
      from = a - 1, to = b + 1, n = 1001,
      xlab='y', ylab='f(y)', 
      main='Probability density function for Unif(13,65)')

curve(punif(x, min = a, max = b), 
      from = a - 1, to = b + 1, n = 1001,
      xlab='y', ylab='f(y)', 
      main='Cumulative distribution function for Unif(13,65)')

curve(qunif(x, min = a, max = b), 
      from = 0, to = 1,
      xlab='y', ylab='f(y)', 
      main='Quantile function for Unif(13,65)')

random_uniforms <- runif(999, min = a, max = b)

hist(random_uniforms, probability = TRUE, 
     main = 'Random draws from Unif(13,65)')

curve(dunif(x, min = a, max = b), 
      add=TRUE, col='red')

par(opar) # Revert back to 1x1 grid

Normal distribution

The most important distribution for this course is the normal (Gaussian) distribution. The normal distribution has two parameters: the mean (\(\mu\)) and the variance (\(\sigma^2\)) and we write \(X\sim N(\mu,\sigma^2)\). If \(\mu=0\) and \(\sigma^2=1\), we call the associated random variable a standard normal. The probability density function for the normal distribution is the well-known bell-shaped curve.

mu    <- 0
sigma <- 1 # standard deviation

curve(dnorm(x, mean = mu, sd = sigma), # notice the 3rd argument is the sd
      from = mu - 4*sigma, to = mu + 4*sigma, n = 1001,
      ylab = 'f(x)', xlab = 'x',
      main = 'PDF for a standard normal')

The cdf function has a sigmoid shape.

curve(pnorm(x, mean = mu, sd = sigma), 
      from = mu - 4*sigma, to = mu + 4*sigma, n = 1001,
      main = 'CDF for a standard normal',
      ylab = 'F(x)')

As mentioned previously, the cumulative distribution function provides the area under the pdf, i.e. the integral, from negative infinity up to a value for the random variable.

# Visualize the CDF
curve(dnorm(x, mean = mu, sd = sigma), 
      from = mu - 4*sigma, to = mu + 4*sigma, n = 1001,
      xlab='y', ylab='f(y)', 
      main='CDF is area under the PDF (integral)')

U <- 0.3
x <- seq(mu - 4*sigma, U, length = 1001)
polygon(c(x, rev(x)), 
          c(rep(0, length(x)), dnorm(rev(x), mean = mu, sd = sigma)),
          col = 'red', lty = 0)

pnorm(U, mean = mu, sd = sigma)

## [1] 0.6179114

We can also visualize areas under the curve that are differences in the CDF.

# Visualize area under the curve
curve(dnorm(x, mean = mu, sd = sigma), 
      from = mu - 4*sigma, to = mu + 4*sigma, n = 1001,
      xlab='y', ylab='f(y)', 
      main='Area under the curve (integral)')

L <- -0.2; U <- 0.3
x <- seq(L, U, length = 1001)
polygon(c(x, rev(x)), 
          c(rep(0, length(x)), dnorm(rev(x), mean = mu, sd = sigma)),
          col = 'red', lty = 0)

pnorm(U, mean = mu, sd = sigma) - pnorm(L, mean = mu, sd = sigma)

## [1] 0.1971711

The quantile function will be used later in the semester to help construct confidence/credible intervals.

curve(qnorm(x, mean = mu, sd = sigma),
      from = 0, to = 1, n = 1001, 
      main = 'Quantile function for a standard normal')

Then we can take random draws

draws <- rnorm(100, mean = mu, sd = sigma)
hist(draws, 20, probability = TRUE)
curve(dnorm(x, mean = mu, sd = sigma), n = 1001, 
      add = TRUE, col = 'red')

Activity

Plot the pdf, cdf, and quantile function for a normal distribution with mean -4 and variance 3, i.e. \(X\sim N(-4,3)\). Then sample 999 random N(-4,3) and plot a histogram of these draws with the probability density function. Bonus: plot area under the normal pdf and calculate the appropriate values using the cdf.

Click for solution

mu    <- -4
sigma <- sqrt(3) # standard deviation!!!

opar = par(mfrow=c(2,2))
curve(dnorm(x, mean = mu, sd = sigma), 
      from = mu-4*sigma, to = mu+4*sigma, n = 1001, 
      main = 'PDF for a normal')

curve(pnorm(x, mean = mu, sd = sigma), 
      from = mu-4*sigma, to = mu+4*sigma, n = 1001,
      main = 'CDF for a normal')

curve(qnorm(x, mean = mu, sd = sigma),
      from = 0, to = 1, n = 1001,
      main = 'Quantile function for a normal')

draws <- rnorm(999, mean = mu, sd = sigma)

hist(draws, probability = TRUE)
curve(dnorm(x, mean = mu, sd = sigma), n = 1001,
      add = TRUE, col = 'red')

par(opar)

Quantiles of a normal distribution

For any normal random variable, determine the probability it is

within 1 standard deviation of its mean
within 2 standard deviation of its mean
within 3 standard deviation of its mean

Click for solution

Since this is apparently for any normal, we will just use the standard normal. So if \(X\sim N(\mu, \sigma^2)\), we must determine \(P(|X|<c) = P(-c < X < c) = P(X<c) - P(X< -c)\) for \(c\) in 1,2,3.

# Try any values for mu and sigma
mu    <- 0
sigma <- 1

pnorm(mu + 1 * sigma, mean = mu, sd = sigma) - pnorm(mu - 1 * sigma, mean = mu, sd = sigma)

## [1] 0.6826895

pnorm(mu + 2 * sigma, mean = mu, sd = sigma) - pnorm(mu - 2 * sigma, mean = mu, sd = sigma)

## [1] 0.9544997

pnorm(mu + 3 * sigma, mean = mu, sd = sigma) - pnorm(mu - 3 * sigma, mean = mu, sd = sigma)

## [1] 0.9973002

Application

Suppose a manufacturing line produces temperature sensors and has a sensor failure proportion equal to 1.5%. In a given day, the plant tests 70 sensors.

Normal

The Central Limit Theorem says that with sufficient sample size, a binomial random variable is approximately a normal random variable. If we approximate the binomial distribution with a normal distribution, then \(X\stackrel{\cdot}{\sim} No(70\times 0.015, 70\times 0.015 \times (1-0.015))\). Notice we just match the mean and variance of the two distributions.

Using this normal distribution, answer the following questions.

What is the probability no sensors fail in one day?
What is the probability 3 or more sensors fail in one day?

Click for solution

The first question is tricky, because we need \(P(X=0)\), but this is 0 for any continuous random variable. Thus, we will approximate this probability by calculate the probability that 0 or fewer sensors fail in one day.

n <- 70
p <- 0.015

mn <- n*p # mean
vr <- n*p*(1-p)

dnorm(  0, mean = mn, sd = sqrt(vr)) # INCORRECT (this not even a probability)

## [1] 0.2302081

pnorm(    0, mean = mn, sd = sqrt(vr)) # Approximate probability of 0 failures

## [1] 0.1509265

1 - pnorm(3, mean = mn, sd = sqrt(vr)) # Approximate probability of 3 or more sensor failures

## [1] 0.02759101

We can utilize the continuity correction to improve our approximation. How can you use the continuity correction to improve the approximation of the two probabilities above?

Click for solution

The continuity correction involves adding or subtracting 1/2 to the quantity you are trying to calculate the probability of. The key is trying to determine whether to add or subtract 1/2.

pnorm(    0 + 1/2, mean = mn, sd = sqrt(vr)) # Approximate probability of 0 failures

## [1] 0.294317

1 - pnorm(3 - 1/2, mean = mn, sd = sqrt(vr)) # Approximate probability of 3 or more sensor failures

## [1] 0.07696464

How good is this normal approximation (based on CLT) to the true probabilities calculated using a binomial distribution? (See lab02.)

Lab03 - Normal Distribution in R

Jarad Niemi

2024-09-12

Continuous distributions

Uniform distribution

Activity

Normal distribution

Activity

Quantiles of a normal distribution

Application

Normal