Validating predictInterval() against brms

Why compare to brms?

merTools::predictInterval() was written to put prediction intervals on mixed-effect models that are too large or too slow to bootstrap. It does this by simulating from the estimated distribution of the fixed and random effects – fast, but an approximation. Since then, brms has become a standard for fully Bayesian multilevel modeling, and its posterior predictive distribution is about as close to a gold standard as we have. This vignette asks a simple question: how close are predictInterval()’s intervals to the ones brms produces?

We look at three things: a Gaussian random-slopes model, what happens for an entirely unobserved group, and a binomial GLMM. This vignette is precompiled – reproducing it requires the brms package and a working Stan toolchain.

A random-slopes model

We use the bundled hsb data (7,185 students in 160 schools) and model math achievement from student SES, school-mean SES, and a school-varying SES slope. We hold out 20% of students for testing, plus six whole schools to stand in for “new groups” later.

data(hsb)
hsb$schid <- factor(hsb$schid)
new_schools <- sample(levels(hsb$schid), 6)
hsb$.set <- "train"
hsb$.set[hsb$schid %in% new_schools] <- "test_new"
seen <- which(hsb$.set == "train")
hsb$.set[sample(seen, round(0.2 * length(seen)))] <- "test_seen"
train     <- droplevels(hsb[hsb$.set == "train", ])
test_seen <- hsb[hsb$.set == "test_seen" & hsb$schid %in% levels(train$schid), ]
test_new  <- hsb[hsb$.set == "test_new", ]

f1 <- mathach ~ ses + meanses + (ses | schid)
m_lme <- lmer(f1, data = train)
brms_fit_seconds <- system.time(
  m_brm <- fit_brm(f1, train, gaussian(), "brms_hsb_meanses"))[["elapsed"]]

We draw a predictive distribution for each held-out student from each method – predictInterval(returnSims = TRUE) and brms::posterior_predict() – and compute everything (point estimates, intervals, coverage) from those draws.

mt_seen <- mt_draws(m_lme, test_seen)
pp_seen <- posterior_predict(m_brm, newdata = test_seen, allow_new_levels = TRUE)

Point estimates are essentially identical

Both methods condition on the same estimated fixed effects and school BLUPs, so the conditional-mean predictions should agree almost exactly – and they do.

lme_mean <- predict(m_lme, newdata = test_seen)
brm_mean <- colMeans(posterior_epred(m_brm, newdata = test_seen, allow_new_levels = TRUE))
c(correlation = cor(lme_mean, brm_mean),
  mean_abs_diff = mean(abs(lme_mean - brm_mean)),
  response_sd = sd(test_seen$mathach))

#>   correlation mean_abs_diff   response_sd 
#>    0.99984045    0.03670134    6.71745425

ggplot(data.frame(merTools = lme_mean, brms = brm_mean), aes(merTools, brms)) +
  geom_abline(slope = 1, intercept = 0, linetype = 2, color = "grey50") +
  geom_point(alpha = 0.3, size = 0.8, color = "#1B9E77") + coord_equal() +
  labs(x = "lme4 predict()", y = "brms posterior_epred()") +
  theme_minimal(base_size = 12)

The prediction intervals agree

b_mt <- qband(mt_seen, 0.90); b_brm <- qband(pp_seen, 0.90)
c(width_merTools = median(b_mt$upr - b_mt$lwr),
  width_brms = median(b_brm$upr - b_brm$lwr),
  cor_lower = cor(b_mt$lwr, b_brm$lwr),
  cor_upper = cor(b_mt$upr, b_brm$upr))

#> width_merTools     width_brms      cor_lower      cor_upper 
#>     20.1789145     20.2344075      0.9901934      0.9899334

endpts <- rbind(data.frame(bound = "lower", merTools = b_mt$lwr, brms = b_brm$lwr),
                data.frame(bound = "upper", merTools = b_mt$upr, brms = b_brm$upr))
ggplot(endpts, aes(merTools, brms, color = bound)) +
  geom_abline(slope = 1, intercept = 0, linetype = 2, color = "grey50") +
  geom_point(alpha = 0.3, size = 0.8) + coord_equal() +
  labs(title = "90% prediction-interval endpoints", x = "merTools", y = "brms") +
  theme_minimal(base_size = 12)

And they are equally well calibrated

The real test is out-of-sample: do nominal intervals cover held-out scores at the nominal rate? Both methods track the diagonal, and each other.

data.frame(nominal = NOMINAL,
           merTools = coverage(mt_seen, test_seen$mathach, NOMINAL),
           brms     = coverage(pp_seen, test_seen$mathach, NOMINAL))

#>   nominal  merTools      brms
#> 1    0.50 0.4569776 0.4598698
#> 2    0.80 0.7895879 0.7932032
#> 3    0.90 0.9168474 0.9168474
#> 4    0.95 0.9660159 0.9703543

At a fraction of the cost

t_lme <- system.time(lmer(f1, data = train))[["elapsed"]]
t_mt  <- system.time(mt_draws(m_lme, test_seen))[["elapsed"]]
t_pp  <- system.time(posterior_predict(m_brm, newdata = test_seen,
                                       allow_new_levels = TRUE))[["elapsed"]]
data.frame(  # brms_fit_seconds was captured when the model was fit, above
  step = c("lme4 fit", "predictInterval", "brms fit (compile+sample)", "posterior_predict"),
  seconds = round(c(t_lme, t_mt, brms_fit_seconds, t_pp), 2))

#>                        step seconds
#> 1                  lme4 fit    0.05
#> 2           predictInterval    0.53
#> 3 brms fit (compile+sample)  398.97
#> 4         posterior_predict    1.52

For this model, the entire lme4 + predictInterval() path runs in about a second, while a single brms fit takes several minutes – a couple of orders of magnitude difference, before the (much larger) models predictInterval() was really built for.

What about an entirely new group?

For a school that was never seen during fitting there is a genuine modeling choice: do you drop the school’s effect and predict from the fixed effects only, or sample a plausible effect for it from the estimated school-level distribution? Both packages support either, and the comparison is only fair when the choices match:

treatment of an unseen group	`predictInterval()`	brms
drop the effect (population prediction)	`new.levels = "zero"` (default)	`re_formula = NA`
sample a new effect	`new.levels = "draw"`	`allow_new_levels = TRUE` (default)

Matched up, the two tools agree on both the width and the coverage of the new-school intervals:

nd <- test_new
pp_pop  <- posterior_predict(m_brm, newdata = nd, re_formula = NA)
pp_draw <- posterior_predict(m_brm, newdata = nd, allow_new_levels = TRUE)
mt_zero <- mt_draws(m_lme, nd, new.levels = "zero")
mt_draw <- mt_draws(m_lme, nd, new.levels = "draw")
w <- function(d) median(qband(d, 0.90)$upr - qband(d, 0.90)$lwr)
data.frame(
  unseen_group   = c("drop effect", "sample effect"),
  merTools_width = c(w(mt_zero), w(mt_draw)),
  brms_width     = c(w(pp_pop),  w(pp_draw)),
  merTools_cov90 = c(coverage(mt_zero, nd$mathach, 0.90), coverage(mt_draw, nd$mathach, 0.90)),
  brms_cov90     = c(coverage(pp_pop,  nd$mathach, 0.90), coverage(pp_draw, nd$mathach, 0.90)))

#>    unseen_group merTools_width brms_width merTools_cov90 brms_cov90
#> 1   drop effect       19.96801   20.02855      0.8708487  0.8745387
#> 2 sample effect       20.67910   20.76924      0.8892989  0.8856089

So the interval width for an unseen group is set by the choice, not the method. Earlier I compared predictInterval()’s default against brms’s default, which is an apples-to-oranges pairing – the apparent “gap” was the two packages making different default assumptions, not a deficiency in either.

How much the choice matters depends on how much group variance is left after the fixed effects. Here meanses already absorbs most of the between-school variation, so dropping vs. sampling the school effect barely changes the width. Remove meanses and the school SD roughly doubles, so the two options diverge:

m0 <- lmer(mathach ~ ses + (ses | schid), data = train)
data.frame(
  model = c("with meanses", "without meanses"),
  school_SD = c(attr(VarCorr(m_lme)$schid, "stddev")[["(Intercept)"]],
                attr(VarCorr(m0)$schid,    "stddev")[["(Intercept)"]]),
  width_zero = c(w(mt_draws(m_lme, nd, new.levels = "zero")),
                 w(mt_draws(m0,    nd, new.levels = "zero"))),
  width_draw = c(w(mt_draws(m_lme, nd, new.levels = "draw")),
                 w(mt_draws(m0,    nd, new.levels = "draw"))))

#>             model school_SD width_zero width_draw
#> 1    with meanses  1.598503   19.96801   20.67910
#> 2 without meanses  2.125770   19.94993   21.22576

The practical guidance: predictInterval()’s default (new.levels = "zero") answers “what is the expected outcome for a typical member of an unseen group,” while new.levels = "draw" answers “what is the outcome for an unseen group drawn from the population” – the same quantity brms returns by default. Pick the one that matches your question.

A generalized linear mixed model

The story carries over to GLMMs. We fit a binomial model for whether grouse had ticks and compare predicted probabilities – predictInterval(type = "probability") against brms::posterior_epred().

data(grouseticks)
grouseticks$TICKS_BIN <- as.integer(grouseticks$TICKS >= 1)
grouseticks$cHEIGHT   <- as.numeric(scale(grouseticks$HEIGHT))
gk <- grouseticks
gk$.set <- ifelse(runif(nrow(gk)) < 0.2, "test", "train")
gk_tr <- droplevels(gk[gk$.set == "train", ])
gk_te <- gk[gk$.set == "test" & gk$BROOD %in% levels(gk_tr$BROOD) &
            gk$LOCATION %in% unique(gk_tr$LOCATION), ]

gf <- TICKS_BIN ~ YEAR + cHEIGHT + (1 | BROOD) + (1 | LOCATION)
g_lme <- glmer(gf, family = binomial, data = gk_tr,
               control = glmerControl(optimizer = "bobyqa"))
g_brm <- fit_brm(gf, gk_tr, bernoulli(), "brms_grouseticks")

mt_pr <- mt_draws(g_lme, gk_te, type = "probability", include.resid.var = FALSE)
if (max(mt_pr) > 1 || min(mt_pr) < 0) mt_pr <- plogis(mt_pr)
brm_pr <- posterior_epred(g_brm, newdata = gk_te, allow_new_levels = TRUE)
mt_p <- apply(mt_pr, 2, median); brm_p <- apply(brm_pr, 2, median)
c(correlation = cor(mt_p, brm_p), mean_abs_diff = mean(abs(mt_p - brm_p)))

#>   correlation mean_abs_diff 
#>    0.99488587    0.02704796

ggplot(data.frame(merTools = mt_p, brms = brm_p), aes(merTools, brms)) +
  geom_abline(slope = 1, intercept = 0, linetype = 2, color = "grey50") +
  geom_point(alpha = 0.4, color = "#7570B3") + coord_equal(xlim = c(0, 1), ylim = c(0, 1)) +
  labs(title = "Predicted P(tick): merTools vs brms",
       x = "predictInterval(type = 'probability')", y = "posterior_epred()") +
  theme_minimal(base_size = 12)

Takeaways

For observed groups, predictInterval() reproduces brms’s point estimates and prediction intervals almost exactly, with the same out-of-sample coverage, at a tiny fraction of the computational cost.
For new groups, you choose how to treat the unseen effect: new.levels = "zero" (the default) drops it and predicts from the fixed effects, while new.levels = "draw" samples it from the estimated covariance – matching brms’s re_formula = NA and allow_new_levels = TRUE respectively. Either way the two packages agree; pick the convention that fits your question.
The same agreement holds for GLMMs on the probability scale.

In short, when bootstrapping or full MCMC is impractical, predictInterval() is a fast, well-calibrated stand-in for the cases it was designed to handle.