Many of the standard probability distributions have functions in R to calculate:
The letter after each indicates the letter the function will start with for all distributions.
In constrast to discrete random variables, continuous random variables can take on an uncountably infinite number of values. The easiest way for this to happen is that the random variable can take on any value between two specified values (and infinity counts), i.e. an interval.
Continuous random variables have a probability density function (pdf) instead of a pmf. When integrated from a to b, this pdf gives the probability the random variable will take on a value between a and b. Continuous random variables still have a cdf, quantile function, and random generator that all still have the same interpretation.
The simplest continuous random variable is the uniform random variable. If \(Y\) is a random variable and it is uniformly distributed between a and b, then we write \(Y\sim Unif(a,b)\) and this means that \(Y\) can take on any value between a and b with equal probability.
The probability density function for a uniform random variable is zero outside of a and b (indicating that the random variable cannot take values below a or above b) and is constant at a value of 1/(b-a) from a to b.
a <- 0
b <- 1
# The curve function expects you to give a function of `x` and then it
# (internally) creates a sequence of values from `from` and to `to` and creates
# plots similar to what we had before, but using a line rather than points.
curve(dunif(x, min = a, max = b),
from = -1, to = 2, n = 1001,
xlab='y', ylab='f(y)',
main='Probability density function for Unif(0,1)')
The cumulative distribution function is the integral from negative infinite up to y of the probability density function. Thus it indicates the probability the random variables will take on values less than y, i.e. \(P(Y<y)\). Since the probability density function for the uniform is constant, the integral is simply zero from negative infinite up to a, then a straight line from (a,0) up to (b,1) and then constant after that.
curve(punif(x, min = a, max = b),
from = -1, to = 2, n = 1001,
xlab='y', ylab='F(y)',
main='Cumulative distribution function for Unif(0,1)')
As mentioned previously, the cumulative distribution function provides the area under the pdf, i.e. the integral, from negative infinity up to a value for the random variable.
# Visualize the CDF
curve(dunif(x, min = a, max = b),
from = -1, to = 2, n = 1001,
xlab='y', ylab='f(y)',
main='CDF is area under the PDF (integral)')
U <- 0.3
x <- seq(-1, U, length = 1001)
polygon(c(x, rev(x)),
c(rep(0, length(x)), dunif(rev(x), min = a, max = b)),
col = 'red', lty = 0)
punif(U, min = a, max = b)
## [1] 0.3
We can also visualize areas under the curve that are differences in the CDF.
# Visualize area under the curve
curve(dunif(x, min = a, max = b),
from = -1, to = 2, n = 1001,
xlab='y', ylab='f(y)',
main='Area under the curve (integral)')
L <- 0.3; U <- 0.7
x <- seq(L, U, length = 1001)
polygon(c(x, rev(x)),
c(rep(0, length(x)), dunif(rev(x), min = a, max = b)),
col = 'red', lty = 0)
punif(U, min = a, max = b) - punif(L, min = a, max = b)
## [1] 0.4
To draw random values from the distribution, use the r
version of the function. For instance, here is 100 random Unif(a,b)
draws represented as a histogram.
random_uniforms <- runif(100, min = a, max = b)
hist(random_uniforms,
probability = TRUE,
main = 'Random draws from Unif(0,1)')
curve(dunif(x, min = a, max = b),
add = TRUE, col='red')
Plot the pdf, cdf, and quantile function for a uniform distribution on the interval (13, 65), i.e. \(X\sim Unif(13,65)\). Then sample 999 random uniforms on the interval (13,65) and plot a histogram of these draws with the probability density function. Bonus: plot area under the uniform pdf and calculate the appropriate values using the cdf.
a <- 13
b <- 65
opar = par(mfrow=c(2,2)) # Create a 2x2 grid of figures
curve(dunif(x, min = a, max = b),
from = a - 1, to = b + 1, n = 1001,
xlab='y', ylab='f(y)',
main='Probability density function for Unif(13,65)')
curve(punif(x, min = a, max = b),
from = a - 1, to = b + 1, n = 1001,
xlab='y', ylab='f(y)',
main='Cumulative distribution function for Unif(13,65)')
curve(qunif(x, min = a, max = b),
from = 0, to = 1,
xlab='y', ylab='f(y)',
main='Quantile function for Unif(13,65)')
random_uniforms <- runif(999, min = a, max = b)
hist(random_uniforms, probability = TRUE,
main = 'Random draws from Unif(13,65)')
curve(dunif(x, min = a, max = b),
add=TRUE, col='red')
par(opar) # Revert back to 1x1 grid
The most important distribution for this course is the normal (Gaussian) distribution. The normal distribution has two parameters: the mean (\(\mu\)) and the variance (\(\sigma^2\)) and we write \(X\sim N(\mu,\sigma^2)\). If \(\mu=0\) and \(\sigma^2=1\), we call the associated random variable a standard normal. The probability density function for the normal distribution is the well-known bell-shaped curve.
mu <- 0
sigma <- 1 # standard deviation
curve(dnorm(x, mean = mu, sd = sigma), # notice the 3rd argument is the sd
from = mu - 4*sigma, to = mu + 4*sigma, n = 1001,
ylab = 'f(x)', xlab = 'x',
main = 'PDF for a standard normal')
The cdf function has a sigmoid shape.
curve(pnorm(x, mean = mu, sd = sigma),
from = mu - 4*sigma, to = mu + 4*sigma, n = 1001,
main = 'CDF for a standard normal',
ylab = 'F(x)')
As mentioned previously, the cumulative distribution function provides the area under the pdf, i.e. the integral, from negative infinity up to a value for the random variable.
# Visualize the CDF
curve(dnorm(x, mean = mu, sd = sigma),
from = mu - 4*sigma, to = mu + 4*sigma, n = 1001,
xlab='y', ylab='f(y)',
main='CDF is area under the PDF (integral)')
U <- 0.3
x <- seq(mu - 4*sigma, U, length = 1001)
polygon(c(x, rev(x)),
c(rep(0, length(x)), dnorm(rev(x), mean = mu, sd = sigma)),
col = 'red', lty = 0)
pnorm(U, mean = mu, sd = sigma)
## [1] 0.6179114
We can also visualize areas under the curve that are differences in the CDF.
# Visualize area under the curve
curve(dnorm(x, mean = mu, sd = sigma),
from = mu - 4*sigma, to = mu + 4*sigma, n = 1001,
xlab='y', ylab='f(y)',
main='Area under the curve (integral)')
L <- -0.2; U <- 0.3
x <- seq(L, U, length = 1001)
polygon(c(x, rev(x)),
c(rep(0, length(x)), dnorm(rev(x), mean = mu, sd = sigma)),
col = 'red', lty = 0)
pnorm(U, mean = mu, sd = sigma) - pnorm(L, mean = mu, sd = sigma)
## [1] 0.1971711
The quantile function will be used later in the semester to help construct confidence/credible intervals.
curve(qnorm(x, mean = mu, sd = sigma),
from = 0, to = 1, n = 1001,
main = 'Quantile function for a standard normal')
Then we can take random draws
draws <- rnorm(100, mean = mu, sd = sigma)
hist(draws, 20, probability = TRUE)
curve(dnorm(x, mean = mu, sd = sigma), n = 1001,
add = TRUE, col = 'red')
Plot the pdf, cdf, and quantile function for a normal distribution with mean -4 and variance 3, i.e. \(X\sim N(-4,3)\). Then sample 999 random N(-4,3) and plot a histogram of these draws with the probability density function. Bonus: plot area under the normal pdf and calculate the appropriate values using the cdf.
mu <- -4
sigma <- sqrt(3) # standard deviation!!!
opar = par(mfrow=c(2,2))
curve(dnorm(x, mean = mu, sd = sigma),
from = mu-4*sigma, to = mu+4*sigma, n = 1001,
main = 'PDF for a normal')
curve(pnorm(x, mean = mu, sd = sigma),
from = mu-4*sigma, to = mu+4*sigma, n = 1001,
main = 'CDF for a normal')
curve(qnorm(x, mean = mu, sd = sigma),
from = 0, to = 1, n = 1001,
main = 'Quantile function for a normal')
draws <- rnorm(999, mean = mu, sd = sigma)
hist(draws, probability = TRUE)
curve(dnorm(x, mean = mu, sd = sigma), n = 1001,
add = TRUE, col = 'red')
par(opar)
For any normal random variable, determine the probability it is
Since this is apparently for any normal, we will just use the standard normal. So if \(X\sim N(\mu, \sigma^2)\), we must determine \(P(|X|<c) = P(-c < X < c) = P(X<c) - P(X< -c)\) for \(c\) in 1,2,3.
# Try any values for mu and sigma
mu <- 0
sigma <- 1
pnorm(mu + 1 * sigma, mean = mu, sd = sigma) - pnorm(mu - 1 * sigma, mean = mu, sd = sigma)
## [1] 0.6826895
pnorm(mu + 2 * sigma, mean = mu, sd = sigma) - pnorm(mu - 2 * sigma, mean = mu, sd = sigma)
## [1] 0.9544997
pnorm(mu + 3 * sigma, mean = mu, sd = sigma) - pnorm(mu - 3 * sigma, mean = mu, sd = sigma)
## [1] 0.9973002
Suppose a manufacturing line produces temperature sensors and has a sensor failure proportion equal to 1.5%. In a given day, the plant tests 70 sensors.
The Central Limit Theorem says that with sufficient sample size, a binomial random variable is approximately a normal random variable. If we approximate the binomial distribution with a normal distribution, then \(X\stackrel{\cdot}{\sim} No(70\times 0.015, 70\times 0.015 \times (1-0.015))\). Notice we just match the mean and variance of the two distributions.
Using this normal distribution, answer the following questions.
The first question is tricky, because we need \(P(X=0)\), but this is 0 for any continuous random variable. Thus, we will approximate this probability by calculate the probability that 0 or fewer sensors fail in one day.
n <- 70
p <- 0.015
mn <- n*p # mean
vr <- n*p*(1-p)
dnorm( 0, mean = mn, sd = sqrt(vr)) # INCORRECT (this not even a probability)
## [1] 0.2302081
pnorm( 0, mean = mn, sd = sqrt(vr)) # Approximate probability of 0 failures
## [1] 0.1509265
1 - pnorm(3, mean = mn, sd = sqrt(vr)) # Approximate probability of 3 or more sensor failures
## [1] 0.02759101
The continuity correction involves adding or subtracting 1/2 to the quantity you are trying to calculate the probability of. The key is trying to determine whether to add or subtract 1/2.
pnorm( 0 + 1/2, mean = mn, sd = sqrt(vr)) # Approximate probability of 0 failures
## [1] 0.294317
1 - pnorm(3 - 1/2, mean = mn, sd = sqrt(vr)) # Approximate probability of 3 or more sensor failures
## [1] 0.07696464
How good is this normal approximation (based on CLT) to the true
probabilities calculated using a binomial distribution? (See lab02.)