library("tidyverse")
## ── Attaching core tidyverse packages ─── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4.9000 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ───────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
For reproducibility, set the seed.
set.seed(20190219)
Let’s let \(y\sim Bin(n,\theta)\). Recall that the MLE is \(\widehat{\theta}_{MLE} = \frac{y}{n}\). If we assume the prior \(\theta \sim Be(a,b)\), then the posterior is \(\theta|y \sim Be(a+y,b+n-y)\) and the posterior expectation is \[ \widehat{\theta}_{Bayes} = E[\theta|y] = \frac{a+y}{a+b+n}. \] Specifically, if we assume a Uniform prior on (0, 1), i.e. \(a=b=1\), then \(\theta|y \sim Be(1+y,1+n-y)\) and the posterior expectation is \[ \widehat{\theta}_{Bayes} = E[\theta|y] = \frac{1+y}{2+n}. \] This is equivalent to adding one success and one failure to the data.
Suppose you randomly sample 22 buildings in Ames and find that 15 of them are residential buildings. Then the MLE for the probability of a building in Ames being residential is
n <- 22
y <- 15
(mle <- y/n)
## [1] 0.6818182
the Bayes estimator is
a <- b <- 1
(bayes <- (a+y)/(a+b+n))
## [1] 0.6666667
and a 95% credible interval is
qbeta(c(.025,.975), a+y, b+n-y)
## [1] 0.4708083 0.8362364
Suppose that you take a larger sample of buildings and find that out of 222 buildings 135 of them are residential. Calculate the MLE, Bayes estimator, and a 90% credible interval for the probability of a house in Ames being residential.
n <- 222
y <- 135
(mle <- y/n)
## [1] 0.6081081
(bayes <- (a+y)/(a+b+n))
## [1] 0.6071429
qbeta(c(.05,.95), a+y, b+n-y)
## [1] 0.5530313 0.6601642
Can you calculate a posterior median?
Recall that estimators have the properties
In addition, uncertainty intervals, i.e. credible intervals, have a property called
The bias of an estimator is the expected value of the estimator minus the true value. Generally, we don’t know the true value, but we can simulate data with a known truth. We can estimate the bias, by performing repeated simulations and taking the average value of the estimator across those simulations. This average value is an estimate of the expected value. If we take enough values in the average, then the Central Limit Theorem tells us the distribution of this average. Specifically
\[\overline{x} \sim N(\mu,\sigma^2/n)\] where \(\mu=E[\overline{x}]\) and we estimate \(\sigma^2\) with the sample variance.
Let’s use Monte Carlo simulations to estimate the bias of the MLE and Bayes estimator. To do this, we are going to repeatedly
Then we will
# Biomial model with
# n attempts and probability of success theta
n <- 10
theta <- 0.5
# Storage for estimators
n_reps <- 1e4
mle <- numeric(n_reps)
bayes <- numeric(n_reps)
# Simulate n_reps binomial random variables
# compute MLE and Bayes estimators
for (i in 1:n_reps) {
y <- rbinom(1, size = n, prob = theta)
mle[i] <- y/n
bayes[i] <- (a + y) / (a + b + n)
}
mean(mle) - theta # estimate of MLE bias
## [1] 0.00096
mean(bayes)- theta # estimate of Bayes bias
## [1] 8e-04
Now, we probably want some idea of how close we are to the true bias. We will use the CLT. \[ \overline{\widehat{\theta}} \pm z_{0.025} SE(\overline{\widehat{\theta}}) \] where the \(SE(\overline{\widehat{\theta}})\) is estimated by the sample standard deviation of \(\widehat{\theta}\) divided by the square root of the number of values in the average, i.e. the sample size. The formula for a 95% interval is then \[ \overline{\widehat{\theta}} \pm z_{0.025} s_{\widehat{\theta}}/\sqrt{n} \]
# Binomial bias with Monte Carlo uncertainty
mean(mle ) + c(-1, 1) * qnorm(.975) * sd(mle ) / sqrt(length(mle )) - theta
## [1] -0.002153667 0.004073667
mean(bayes) + c(-1, 1) * qnorm(.975) * sd(bayes) / sqrt(length(bayes)) - theta
## [1] -0.001794722 0.003394722
We could have written the code a bit more succinctly (and, perhaps, obtusely).
# Succinct binomial Monte Carlo study
y <- rbinom(n_reps, size = n, prob = theta)
mle <- y / n
bayes <- (a + y) / (a + b + n)
# Calculate bias with Monte Carlo uncertainty
mean(mle ) + c(-1, 1) *qnorm(.975) * sd(mle ) / sqrt(length(mle )) - theta
## [1] -0.002545906 0.003605906
mean(bayes) + c(-1, 1) *qnorm(.975) * sd(bayes) / sqrt(length(bayes)) - theta
## [1] -0.002121588 0.003004922
Repeat the simulation procedure we had above but using \(n=5\) and \(\theta=0.2\). What do you find for the bias of the MLE and the Bayes estimator?
# Adjust n and theta
n <- 5
theta <- 0.2
# Simulate binomials and calculate estimators
y <- rbinom(n_reps, size = n, prob = theta)
mle <- y/n
bayes <- (a+y)/(a+b+n)
# Calculate bias with Monte Carlo uncertainty
mean(mle ) + c(-1, 1) * qnorm(.975) * sd(mle ) / sqrt(length(mle )) - theta
## [1] -0.004077641 0.002877641
mean(bayes) + c(-1, 1) * qnorm(.975) * sd(bayes) / sqrt(length(bayes)) - theta
## [1] 0.08280169 0.08776974
Now the Bayes estimator appears to be quite biased.
Let’s study what happens to the bias of the Bayes estimator as we change \(n\) and \(\theta\). The following will look at \(n=1,10,100,1000\) and \(\theta\) from \(0\) to \(1\) in increments of 0.1.
# Perform the bias calculation for different combinations
# of n and theta
settings <- expand.grid(n = 10^(0:3),
theta = seq(0, 1, by = 0.1))
We will use the dplyr
package to help us iterate over all of these values of \(n\) and \(\theta\). First we need to construct the
function to do a simulation study for every value of n
and
theta
.
# A function to estimate the bias based on a data.frame x
# that contains a single row with two columns: n and theta
sim_function <- function(x) {
y <- rbinom(1e4, size = x$n, prob = x$theta)
mle <- y / x$n
bayes <- (a + y) / (a + b + x$n)
d <- data.frame(
estimator = c("mle", "bayes"),
bias = c(mean(mle), mean(bayes)) - x$theta,
var = c(var( mle), var( bayes)))
# d$se <- sqrt(d$var / x$n)
# d$lower <- d$bias-qnorm(.975)*d$se
# d$upper <- d$bias+qnorm(.975)*d$se
return(d)
}
Then we can loop over all the values of n
and
theta
.
# Run the simulation study
sim_study <- settings |>
group_by(n, theta) |> # for each combination of n and theta
do(sim_function(.))
Let’s plot the Monte Carlo estimated bias.
# Plot estimated bias from the simulation study
ggplot(sim_study,
aes(
x = theta,
y = bias, # plotting bias
color = estimator,
linetype = estimator)) +
geom_line() +
facet_wrap(~n, scales = "free_y") +
theme_bw()
We can see that the Bayes estimator has the most bias when 1) the sample size is small and 2) \(\theta\) is far away from 0.5. But the variance of the Bayes estimator is smaller than the variance of the MLE estimator.
# Plot estimated bias from the simulation study
ggplot(sim_study,
aes(
x = theta,
y = var, # plotting variance
color = estimator,
linetype = estimator)) +
geom_line() +
facet_wrap(~n, scales = "free_y") +
theme_bw()
We can see that the Bayes estimator always has smaller variance than the MLE.
Thus, we often use a new property of an estimator called the mean squared error (MSE): \[ MSE(\widehat{\theta}) = E[(\widehat\theta-\theta)^2] = Var(\widehat{\theta}) + Bias(\widehat{\theta},\theta)^2 \] This property combines both the bias and the variance into a single term. Generally the estimator with the smallest MSE should be utilized.
# Plot estimated mse from the simulation study
sim_study$mse <- sim_study$var + sim_study$bias^2
ggplot(sim_study,
aes(
x = theta,
y = mse, # plotting MSE
color = estimator,
linetype = estimator)) +
geom_line() +
facet_wrap(~n, scales = "free_y") +
theme_bw()
Generate a curve for the MSE for the MLE and Bayes estimators for a binomial probability of success when \(n=13\) for various values of \(\theta\).
settings <- expand.grid(n = 13,
theta = seq(0,1,by=0.1))
sim_study <- settings |>
group_by(n, theta) |>
do(sim_function(.))
sim_study$mse <- sim_study$var + sim_study$bias^2
ggplot(sim_study, aes(x=theta, y=mse, color=estimator)) +
geom_line() +
facet_wrap(~n) + # This line isn't necessary
theme_bw()
Recall that an estimator is consistent if \[ P(|\widehat\theta_n-\theta|<\epsilon) \to 1 \quad\mbox{ as }\quad n\to\infty \] for any \(\epsilon>0\). Here I have made explicit that the estimator depends on \(n\).
To evaluate consistency by simulation, we will perform a sequence of experiments, i.e. one coin flip at a time, and calculate the estimator after each coin flip. Then we will see how many times the estimator is more than \(\epsilon\) away from \(\theta\).
theta <- 0.51
n_max <- 999
sim_function <- function(d) {
# Simulate binomials
x <- rbinom(n_max, size = 1, prob = theta)
# Storage structure for estimators
mle <- bayes <- numeric(n_max)
# Compute estimators for all from 1 to n_max
for (n in 1:n_max) {
y <- sum(x[1:n])
mle[n] <- y / n
bayes[n] <- (a + y) / (a + b + n)
}
# Record results
data.frame(n = 1:n_max,
mle = mle,
bayes = bayes)
}
# Repeat 1000 times
d <- data.frame(rep=1:1e3) |>
group_by(rep) |>
do(sim_function(.))
# Calculate probability
epsilon <- 0.05 # feel free to change this
sum <- d |>
gather(estimator, estimate, mle, bayes) |>
group_by(n, estimator) |>
summarize(prob = mean(abs(estimate - theta) < epsilon))
## `summarise()` has grouped output by 'n'. You can override
## using the `.groups` argument.
Now plot the results
ggplot(sum,
aes(
x = n,
y = prob,
color = estimator,
linetype = estimator)) +
geom_line() +
theme_bw()
This is an illustration of the property called consistency. For both the Bayes estimator and the MLE, the probability of being within \(\epsilon\) of \(\theta\) (the true success probability) converges to 1 as the number of observations increases.
Via simulation, show that if \(\theta=0.01\) that both the MLE and Bayes estimators are consistent.
theta = 0.01
d <- data.frame(rep=1:1e3) |>
group_by(rep) |>
do(sim_function(.))
sum <- d |>
gather(estimator, estimate, mle, bayes) |>
group_by(n, estimator) |>
summarize(prob = mean(abs(estimate - theta) < epsilon))
## `summarise()` has grouped output by 'n'. You can override
## using the `.groups` argument.
ggplot(sum, aes(x = n, y = prob, color = estimator, linetype = estimator)) +
geom_line() +
theme_bw()
Uncertainty intervals have a property called coverage. Coverage is the probability the interval will contain the truth when the method used to construct the interval is used repeatedly on different data sets.
We will evaluate coverage using simulations by repeatedly
# Binomial parameters
n <- 100
theta <- 0.5
# Simulate binomial random variables
n_reps <- 1e4
y <- rbinom(n_reps, size = n, prob = theta)
# Calculate all endpoints
lower <- qbeta(.025, a + y, b + n - y)
upper <- qbeta(.975, a + y, b + n - y)
# Calculate coverage
mean( lower <= theta & theta <= upper )
## [1] 0.9407
If we want to understand our Monte Carlo uncertainty in this coverage, we can use the CLT for a proportion.
# Monte Carlo uncertainty
p <- mean( lower <= theta & theta <= upper)
p + c(-1, 1) * qnorm(.975) * sqrt(p * (1 - p) / n_reps)
## [1] 0.9360709 0.9453291
Via simulation, determine the coverage for a credible interval when \(n=10\) and \(\theta = 0.2\) and use the CLT to provide an uncertainty on this coverage.
n <- 10
theta <- 0.2
n_reps <- 1e4
y <- rbinom(n_reps, size = n, prob = theta)
lower <- qbeta(.025, a + y, b + n - y)
upper <- qbeta(.975, a + y, b + n - y)
(p <- mean( lower <= theta & theta <= upper))
## [1] 0.9695
p + c(-1, 1) * qnorm(.975) * sqrt(p * (1 - p)/n_reps)
## [1] 0.9661297 0.9728703
Now try with \(\theta=0.01\).
Let’s now look at coverage across a wide range of \(n\) and \(\theta\).
# Binomial coverage function
binomial_coverage_function <- function(d) {
y <- rbinom(1e4, size = d$n, prob = d$theta)
# Calculate endpoints of credible intervals
lower <- qbeta(.025, a + y, b + d$n - y)
upper <- qbeta(.975, a + y, b + d$n - y)
data.frame(coverage = mean( lower <= d$theta & d$theta <= upper))
}
sim_study <- expand.grid(n = 10^(0:3),
theta = seq(0,1,by=0.1)) |>
group_by(n,theta) |>
do(binomial_coverage_function(.))
Now plot the results
# Binomial coverage plot
ggplot(sim_study,
aes(
x = theta,
y = coverage)) +
geom_line() +
facet_wrap(~n) +
geom_hline(
yintercept = 0.95,
color = "red") +
ylim(0,1) +
theme_bw()
The coverage is extremely poor when \(\theta\) is 0 (or 1) because \(y\) is always 0 (or \(n\)). In these cases, a simple way to fix the interval is to use a one-sided interval when \(y\) is 0 (or \(n\)). Specifically,
binomial_coverage_function_fixed <- function(d) {
n <- d$n
theta <- d$theta
y <- rbinom(1e4, size = n, prob = theta)
mle <- y/d$n
bayes <- (a+y)/(a+b+n)
lower <- qbeta(.025, a+y, b+n-y)
upper <- qbeta(.975, a+y, b+n-y)
# Fix intervals when y=0
lower[y==0] <- 0
upper[y==0] <- qbeta(.95, a+0, b+n)
# Fix intervals when y=n
upper[y==n] <- 1
lower[y==n] <- qbeta(.05, a+n, b+0)
data.frame(coverage = mean( lower <= theta & theta <= upper))
}
sim_study <- expand.grid(n = 10^(0:3),
theta = seq(0,1,by=0.1)) |>
group_by(n,theta) |>
do(binomial_coverage_function_fixed(.))
Now plot the results
ggplot(sim_study,
aes(
x = theta,
y = coverage)) +
geom_line() +
facet_wrap(~n) +
geom_hline(
yintercept = 0.95,
color = "red") +
ylim(0,1) +
theme_bw()
The plot here has been constructed to have y-axis limits set to 0 and 1 for comparison with the previous plot.
In this class, we will use a uniform prior for the probability. An alternative is to use Jeffreys prior which is \(Be(0.5,0.5)\). Run a simulation study to look at coverage for Jeffreys prior.
a <- b <- 0.5
sim_study <- expand.grid(n = 10^(0:3),
theta = seq(0,1,by=0.1)) |>
group_by(n,theta) |>
do(binomial_coverage_function_fixed(.))
ggplot(sim_study, aes(x=theta, y=coverage)) +
geom_line() +
facet_wrap(~n) +
geom_hline(yintercept = 0.95, color = "red") +
ylim(0,1) +
theme_bw()