using Random
using Distributions
using DifferentialEquations
using DataFrames
using Plots
using StatsPlots
using Turing
using MCMCChains
using CSV
using DrWatson
using StatsBaseModel Checking and Validation
Estimated time: 1.5-2 hours
Introduction
Where we are: You’ve learned to calculate posteriors (Introduction), sample them with MCMC, and check convergence (Diagnostics). Now we ask: even if the sampler worked, does the model make sense?
You’ve learned how to fit models and diagnose MCMC convergence. But a converged chain doesn’t mean your model is correct - it just means the sampler has explored the posterior properly. This session focuses on a different question: does your model actually describe the data well?
Model checking is about building trust in your inferences. A model that doesn’t generate data resembling your observations shouldn’t be trusted to make predictions or inform policy.
Objectives
In this session you will:
- Use prior predictive checks to validate model specification
- Use posterior predictive checks to assess model fit
- Interpret residual patterns to diagnose model problems
- Understand simulation-based calibration for validating inference pipelines
Source file
The source file of this session is located at sessions/model_checking.qmd.
Load libraries
Initialisation
Random.seed!(1234)TaskLocalRNG()
The Bayesian workflow
Model checking isn’t a one-time step at the end - it’s part of the process. Gelman et al. (2020) describe this as the “Bayesian workflow”:
- Specify model - choose structure, priors, likelihood
- Prior predictive check - do simulations from priors look reasonable?
- Fit model - run MCMC or other inference
- Diagnose sampler - check convergence (previous session)
- Posterior predictive check - does the fitted model generate realistic data?
- Iterate - if checks fail, revise the model
So far we’ve focused on steps 1, 3, and 4. This session covers the checking steps (2 and 5) that tell you whether the model itself makes sense - not just whether the sampler worked.
Prior predictive checks
Before fitting any data, you should ask: do my priors make sense? Prior predictive checking simulates data purely from your prior assumptions, without conditioning on observations.
Let’s use the SIR model from the introduction:
function sir_ode!(du, u, p, t)
R_0, D_inf = p
β = R_0 / D_inf
γ = 1.0 / D_inf
S, I, R, C = u
N = S + I + R
du[1] = -β * S * I / N
du[2] = β * S * I / N - γ * I
du[3] = γ * I
du[4] = β * S * I / N # Cumulative incidence
end
function simulate_sir(R_0, D_inf, S0, I0, times)
times_vec = collect(times)
u0 = [S0, I0, 0.0, 0.0]
prob = ODEProblem(sir_ode!, u0, (times_vec[1], times_vec[end]), [R_0, D_inf])
sol = solve(prob, Tsit5(), saveat=times_vec)
inc_cumulative = sol[4, :]
inc_daily = [0.0; diff(inc_cumulative)]
return inc_daily
endsimulate_sir (generic function with 1 method)
Now let’s define a model with priors and sample from the prior predictive distribution:
@model function sir_model(times, S0, I0, n_obs)
# Priors
R_0 ~ Uniform(0.5, 10.0)
D_inf ~ Uniform(1.0, 14.0)
ρ ~ Uniform(0.0, 1.0) # Reporting rate
# Simulate
inc = simulate_sir(R_0, D_inf, S0, I0, times)
# Likelihood
lambdas = [max(ρ * inc[i], 1e-10) for i in 1:n_obs]
obs ~ arraydist(Poisson.(lambdas))
endsir_model (generic function with 2 methods)
max(..., 1e-10) everywhere?
You’ll see max(value, 1e-10) throughout the course wherever we create a Poisson distribution. Poisson(λ) requires λ > 0 — passing exactly zero causes an error. In practice, the model can produce zero expected incidence (e.g. before the epidemic starts or after it ends), and numerical noise from the ODE solver can even produce tiny negative values. The max(..., 1e-10) guard ensures we always pass a valid (tiny but positive) rate. The value 1e-10 is small enough to be effectively zero for all practical purposes.
# Prior predictive sampling
times = 0.0:1.0:59.0
S0, I0 = 279.0, 5.0
# Sample from priors (no data conditioning — obs will be sampled from the model)
prior_samples = sample(sir_model(times, S0, I0, 60), Prior(), 200, progress=false)Chains MCMC chain (200×66×1 Array{Float64, 3}):
Iterations = 1:1:200
Number of chains = 1
Samples per chain = 200
Wall duration = 0.9 seconds
Compute duration = 0.9 seconds
parameters = R_0, D_inf, ρ, obs[1], obs[2], obs[3], obs[4], obs[5], obs[6], obs[7], obs[8], obs[9], obs[10], obs[11], obs[12], obs[13], obs[14], obs[15], obs[16], obs[17], obs[18], obs[19], obs[20], obs[21], obs[22], obs[23], obs[24], obs[25], obs[26], obs[27], obs[28], obs[29], obs[30], obs[31], obs[32], obs[33], obs[34], obs[35], obs[36], obs[37], obs[38], obs[39], obs[40], obs[41], obs[42], obs[43], obs[44], obs[45], obs[46], obs[47], obs[48], obs[49], obs[50], obs[51], obs[52], obs[53], obs[54], obs[55], obs[56], obs[57], obs[58], obs[59], obs[60]
internals = logprior, loglikelihood, logjoint
Use `describe(chains)` for summary statistics and quantiles.
sir_model(times, S0, I0, 60)- creates a model instance without conditioning on data. Because no observations are provided via|, theobs ~ arraydist(Poisson.(lambdas))statement acts generatively — Turing samples new observations rather than computing likelihoods. This is different fromdecondition()(which you’ll see below for posterior predictive checks) — here we never conditioned the model in the first place.Prior()- a sampler that draws from prior distributions only, ignoring the likelihood. This gives us parameter combinations purely from our prior beliefs.
This pattern has no direct R equivalent. In Stan, you’d write a separate generated quantities block for prior predictive simulation. In plain R, you’d write a separate simulation function. In Turing, the same model works for both fitting (condition with | (; obs = data)) and simulation (leave unconditioned) — no separate code needed.
# Generate trajectories from prior samples
p_prior = plot(title="Prior Predictive Check",
xlabel="Time (days)", ylabel="Expected daily cases",
legend=false)
for i in 1:100
R_0 = prior_samples[:R_0][i]
D_inf = prior_samples[:D_inf][i]
ρ = prior_samples[:ρ][i]
inc = simulate_sir(R_0, D_inf, S0, I0, times)
obs_expected = ρ .* inc
plot!(p_prior, collect(times), obs_expected, alpha=0.2, color=:blue)
end
plot!(p_prior, ylim=(0, 150))Questions to ask:
- Do the simulated epidemics look plausible given what you know about influenza?
- Is the range of peak sizes reasonable?
- Are there impossible trajectories (negative values, explosions)?
If prior predictive simulations look unreasonable, revise your priors before fitting data. This is much easier than debugging a fitted model.
The current priors allow \(R_0\) up to 10, which is unrealistic for influenza. Try:
- Change the prior to
R_0 ~ Uniform(1.0, 5.0)in the model definition above - Re-run the prior predictive sampling
- Plot the new trajectories alongside the originals
How does the range of epidemic curves change? Are the new trajectories more realistic?
If your prior predictive check shows wildly varying trajectories - some with 0 cases, others with thousands - your priors may be too vague. Consider using domain knowledge to constrain them. E.g. for influenza, \(R_0\) between 1 and 5 might be more realistic than 0.5 to 10.
Posterior predictive checks
After fitting the model, posterior predictive checks ask: can the fitted model generate data that looks like what we observed?
The SIR model we’re about to fit will not capture the Tristan da Cunha data well - specifically, it will miss the second wave around day 25-35. SIR assumes permanent immunity after recovery, so it cannot produce two waves from a single introduction. Posterior predictive checks reveal when our model structure is inadequate, motivating the SEITL/SEIT4L models in later sessions.
# Load data and fit model
flu_tdc = CSV.read(datadir("flu_tdc_1971.csv"), DataFrame)
model = sir_model(times, S0, I0, length(flu_tdc.obs)) | (; obs = flu_tdc.obs)
chain = sample(model, NUTS(), 1000, progress=false)┌ Info: Found initial step size └ ϵ = 0.2
Chains MCMC chain (1000×17×1 Array{Float64, 3}):
Iterations = 501:1:1500
Number of chains = 1
Samples per chain = 1000
Wall duration = 12.55 seconds
Compute duration = 12.55 seconds
parameters = R_0, D_inf, ρ
internals = n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size, logprior, loglikelihood, logjoint
Use `describe(chains)` for summary statistics and quantiles.
# Generate posterior predictive samples using predict + decondition
predictions = predict(decondition(model), chain)
# Extract into a list of trajectories for plotting and analysis
n_sims = min(100, size(predictions, 1))
n_obs = length(flu_tdc.obs)
ppc_samples = [
[predictions[Symbol("obs[$j]")][i, 1] for j in 1:n_obs]
for i in 1:n_sims
]100-element Vector{Vector{Float64}}:
[0.0, 2.0, 4.0, 0.0, 3.0, 1.0, 5.0, 4.0, 14.0, 11.0 … 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 1.0, 3.0, 2.0, 6.0, 5.0, 6.0, 8.0, 12.0, 9.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 3.0, 1.0, 4.0, 10.0, 4.0, 3.0, 2.0, 4.0, 4.0 … 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 2.0, 3.0, 3.0, 6.0, 5.0, 6.0, 5.0, 7.0, 6.0 … 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 2.0, 8.0, 4.0, 5.0, 8.0, 5.0, 3.0, 8.0, 5.0 … 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]
[0.0, 1.0, 3.0, 6.0, 6.0, 6.0, 11.0, 14.0, 8.0, 14.0 … 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 2.0, 2.0, 3.0, 6.0, 7.0, 5.0, 10.0, 13.0 … 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 3.0, 8.0, 4.0, 6.0, 7.0, 8.0, 6.0, 9.0, 9.0 … 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 2.0, 3.0, 4.0, 6.0, 2.0, 6.0, 5.0, 7.0, 7.0 … 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]
[0.0, 5.0, 1.0, 4.0, 4.0, 3.0, 7.0, 12.0, 10.0, 11.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
⋮
[0.0, 2.0, 1.0, 5.0, 2.0, 4.0, 3.0, 5.0, 8.0, 9.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 3.0, 2.0, 2.0, 4.0, 7.0, 3.0, 5.0, 5.0, 6.0 … 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 3.0, 2.0, 5.0, 3.0, 4.0, 4.0, 15.0, 9.0, 7.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
[0.0, 1.0, 2.0, 2.0, 6.0, 4.0, 7.0, 8.0, 6.0, 6.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 1.0, 3.0, 0.0, 8.0, 6.0, 7.0, 9.0, 5.0, 6.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]
[0.0, 3.0, 2.0, 2.0, 3.0, 5.0, 5.0, 9.0, 9.0, 11.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 5.0, 3.0, 2.0, 2.0, 6.0, 5.0, 4.0, 11.0, 9.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]
[0.0, 3.0, 3.0, 4.0, 3.0, 6.0, 6.0, 2.0, 4.0, 12.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 5.0, 4.0, 4.0, 7.0, 3.0, 4.0, 6.0, 8.0, 12.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
predict(decondition(model), chain)
This generates posterior predictive samples in two steps:
decondition(model)— removes the conditioning on observed data. Ourmodelwas created withsir_model(...) | (; obs = flu_tdc.obs), sodecondition()reverses this, turning theobs ~ arraydist(Poisson.(lambdas))statement from a likelihood computation back into a generative sample.predict(..., chain)— for each parameter sample in the chain, runs the model forward and samples new observations from the Poisson distribution.
The result is a chain-like object where predictions[Symbol("obs[$j]")] gives the predicted values for the j-th observation across all posterior samples.
# Plot posterior predictive check
p_ppc = plot(title="Posterior Predictive Check",
xlabel="Time (days)", ylabel="Daily cases")
for sim in ppc_samples
plot!(p_ppc, flu_tdc.time, sim, alpha=0.1, color=:lightblue, label="")
end
scatter!(p_ppc, flu_tdc.time, flu_tdc.obs,
color=:red, label="Observed", markersize=4)
plot!(p_ppc, legend=:topright)What to look for:
- Do the simulated data envelopes contain the observations?
- Are there systematic patterns the model misses?
- Does the model capture the variability in the data?
For the Tristan da Cunha data with a simple SIR model, you’ll likely see the model fails to capture the second wave. This is expected - SIR doesn’t have temporary immunity. This motivates the SEITL/SEIT4L models in later sessions.
Quantifying posterior predictive fit
Visual checks are valuable but subjective. We can also compute summary statistics to quantify fit.
# Compute summary statistics for each posterior predictive sample
function compute_summaries(trajectory)
peak_value = maximum(trajectory)
peak_time = argmax(trajectory) - 1 # Convert to 0-indexed time
total_cases = sum(trajectory)
return (peak=peak_value, peak_time=peak_time, total=total_cases)
end
# Summaries for observed data
obs_summary = compute_summaries(flu_tdc.obs)
println("Observed: peak=$(obs_summary.peak), peak_time=$(obs_summary.peak_time), total=$(obs_summary.total)")
# Summaries for posterior predictive samples
ppc_summaries = [compute_summaries(sim) for sim in ppc_samples]
ppc_peaks = [s.peak for s in ppc_summaries]
ppc_peak_times = [s.peak_time for s in ppc_summaries]
ppc_totals = [s.total for s in ppc_summaries]Observed: peak=47, peak_time=6, total=312
100-element Vector{Float64}:
212.0
197.0
191.0
177.0
196.0
217.0
220.0
227.0
210.0
221.0
⋮
220.0
194.0
234.0
192.0
209.0
196.0
185.0
204.0
207.0
# Plot distributions of summary statistics
p1 = histogram(ppc_peaks, bins=20, alpha=0.7, label="Posterior pred.",
xlabel="Peak cases", ylabel="Count", title="Peak Size")
vline!(p1, [obs_summary.peak], color=:red, linewidth=2, label="Observed")
p2 = histogram(ppc_peak_times, bins=20, alpha=0.7, label="Posterior pred.",
xlabel="Peak time (days)", ylabel="Count", title="Peak Timing")
vline!(p2, [obs_summary.peak_time], color=:red, linewidth=2, label="Observed")
p3 = histogram(ppc_totals, bins=20, alpha=0.7, label="Posterior pred.",
xlabel="Total cases", ylabel="Count", title="Total Cases")
vline!(p3, [obs_summary.total], color=:red, linewidth=2, label="Observed")
plot(p1, p2, p3, layout=(1, 3), size=(900, 300), legend=:topright)If the observed value (red line) falls in the tails of the posterior predictive distribution, the model struggles to explain that aspect of the data.
- Look at the three histograms above. For which summary statistics (peak size, peak timing, total cases) does the observed value fall well within the posterior predictive distribution?
- For which does it fall in the tails?
- What does this tell you about which aspects of the epidemic the SIR model captures well vs poorly?
We can quantify this with a posterior predictive p-value: the proportion of simulated values that are as extreme as or more extreme than the observed value.
# Posterior predictive p-values
p_peak = mean(ppc_peaks .>= obs_summary.peak)
p_time = mean(ppc_peak_times .>= obs_summary.peak_time)
p_total = mean(ppc_totals .>= obs_summary.total)
println("P-values:")
println(" Peak size: $p_peak")
println(" Peak timing: $p_time")
println(" Total cases: $p_total")P-values:
Peak size: 0.0
Peak timing: 1.0
Total cases: 0.0
Values close to 0 or 1 indicate the model struggles to reproduce that aspect of the data. A p-value of 0.02 for peak timing, for example, would mean only 2% of simulated epidemics peaked as late as or later than the observed data — suggesting the model systematically predicts earlier peaks.
These p-values are descriptive, not hypothesis tests — there is no “reject if p < 0.05” rule here. They quantify how surprised you should be, given the model. A value of 0.15 doesn’t mean the model is “fine”; a value of 0.03 doesn’t mean it’s “wrong”. Use them alongside visual checks and scientific judgement.
Residual analysis
Residuals are the difference between observed and predicted values. Patterns in residuals can reveal model deficiencies.
# Compute posterior mean predictions
mean_R_0 = mean(chain[:R_0])
mean_D_inf = mean(chain[:D_inf])
mean_ρ = mean(chain[:ρ])
# Use the actual time points from the data
obs_times = flu_tdc.time
mean_inc = simulate_sir(mean_R_0, mean_D_inf, S0, I0, obs_times)
mean_pred = mean_ρ .* mean_inc
# Compute residuals (use explicit indexing to match lengths)
n_obs = length(flu_tdc.obs)
resid = flu_tdc.obs .- mean_pred[1:n_obs]59-element Vector{Float64}:
0.0
-1.5381319486561553
-3.0837130275101603
6.281440473813147
1.5560000745026201
26.745563896406285
40.864238010188984
29.938443532646094
21.004970677931247
2.1129450129920233
⋮
1.8861300156443592
-0.09885233905535495
-0.08598669703825909
-0.07473906871658624
-0.06486467845245887
-0.056258844725661394
-0.048800646402230426
-0.04236916234839955
0.9631698699269886
# Plot residuals over time
p_resid = plot(obs_times, resid,
xlabel="Time (days)", ylabel="Residual (obs - pred)",
title="Residuals Over Time", legend=false,
marker=:circle, markersize=3)
hline!(p_resid, [0], color=:black, linestyle=:dash)What patterns indicate problems?
If residuals show a systematic trend — consistently positive or negative over certain time periods — the model is missing important temporal structure. If residuals at one time point predict residuals at the next (autocorrelation), the model dynamics are incomplete. And if the spread of residuals changes depending on the predicted value (heteroscedasticity — literally “different scatter”), the observation model’s variance structure is probably wrong. For example, with a Poisson observation model the variance equals the mean, so residuals should be larger when predicted cases are high; if they’re much larger than expected, the data may be overdispersed.
For the SIR fit to Tristan da Cunha, you should see clear structure: positive residuals during the second wave that the SIR model can’t capture.
# Check for autocorrelation in residuals
lag1_resid = resid[1:end-1]
lag0_resid = resid[2:end]
p_acf = scatter(lag1_resid, lag0_resid,
xlabel="Residual at t", ylabel="Residual at t+1",
title="Residual Autocorrelation", legend=false,
alpha=0.6)Strong correlation here suggests the model is missing important dynamics - exactly what we’d expect when fitting SIR to two-wave data.
Simulation-based calibration
A more rigorous check asks: does my inference procedure recover true parameters when I know them?
Simulation-based calibration (SBC) works as follows:
- Draw parameters from the prior: \(\theta^* \sim p(\theta)\)
- Simulate data from the model: \(y^* \sim p(y|\theta^*)\)
- Fit the model to \(y^*\) and get posterior samples
- Check if \(\theta^*\) falls within the posterior as expected
If your inference is correct, the true parameter should be uniformly distributed within the posterior samples (i.e., the 95% credible interval should contain the truth 95% of the time).
# SBC for R_0 (simplified example with fewer iterations for speed)
function run_sbc(n_sbc=20)
ranks = Int[]
for i in 1:n_sbc
# 1. Draw from prior
R_0_true = rand(Uniform(0.5, 10.0))
D_inf_true = rand(Uniform(1.0, 14.0))
ρ_true = rand(Uniform(0.0, 1.0))
# 2. Simulate data
inc = simulate_sir(R_0_true, D_inf_true, S0, I0, times)
obs_sim = rand.(Poisson.(max.(ρ_true .* inc, 1e-10)))
# 3. Fit model (short chain for speed)
model_sim = sir_model(times, S0, I0, length(obs_sim)) | (; obs = obs_sim)
chain_sim = sample(model_sim, NUTS(), 200, progress=false)
# 4. Compute rank of true value
R_0_samples = vec(chain_sim[:R_0])
rank = sum(R_0_samples .< R_0_true)
push!(ranks, rank)
end
return ranks
end
# This takes a while - in practice you'd run more iterations
sbc_ranks = run_sbc(20)# Plot SBC histogram
histogram(sbc_ranks, bins=10,
xlabel="Rank", ylabel="Count",
title="SBC Rank Histogram for R_0",
legend=false)A uniform histogram indicates well-calibrated inference. Peaks at the edges suggest the posterior is too narrow (overconfident) or too wide (underconfident).
Full SBC requires many simulations (100+) and is computationally expensive. It is particularly important when implementing new models, using complex inference algorithms, or when you need high confidence that your computational pipeline is correct. For established models with well-tested software, prior and posterior predictive checks catch most problems, but SBC provides the strongest guarantee that inference is working as intended.
When models fail checks
What should you do when model checks reveal problems?
If prior predictive simulations look unreasonable, start by examining your priors. Priors that are too vague will generate wildly varying trajectories so you should use your knowledge of the system to constrain them. If you see impossible values (negative counts, probabilities outside [0,1]), add appropriate bounds or switch to a distribution family that respects the constraints.
When posterior predictive checks fail, the model itself needs attention. Systematic bias such as predictions consistently too high or too low suggests missing mechanisms. If the model captures the mean but not the variability, your observation model may be wrong; switching from Poisson to Negative Binomial, for instance, allows for overdispersion. When specific features are missed (like a second epidemic wave), consider what biological mechanisms could explain them.
Residual patterns point to specific fixes. Temporal trends in residuals suggest time-varying parameters or additional compartments. Autocorrelated residuals indicate the model dynamics are incomplete. And if residual variance grows with predicted values (heteroscedasticity), the observation model’s variance structure needs adjustment.
Remember that model failure is information. A model that fails to capture the second wave in Tristan da Cunha tells us something important: simple SIR dynamics are insufficient, motivating the temporary immunity models we explore in later sessions.
There is no universal threshold. All models are wrong — the question is whether yours is useful for your specific purpose. A model is good enough when:
- It captures the features you care about (e.g. peak timing for planning surge capacity)
- Posterior predictive checks don’t show systematic failures in those features
- Adding complexity doesn’t meaningfully improve fit for your question
- You’ve documented what the model does and doesn’t capture
A model that misses a second wave is fine if you only care about the first wave’s peak. It’s not fine if you’re trying to understand long-term immunity.
The SIR model fails to capture the second wave in the Tristan da Cunha data. Based on the residual patterns and posterior predictive checks above:
- What biological mechanism might explain a second wave after the first has subsided?
- What structural change to the model would be needed to accommodate this?
- Would changing the observation model (e.g., Poisson to Negative Binomial) help, or is this a problem with the underlying dynamics?
The next session on SEITL models explores one potential answer to these questions.
Summary
Converged MCMC chains don’t guarantee a good model - they only tell us the sampler worked. Model checking asks the deeper question: does our model actually describe the data?
Prior predictive checks simulate from priors before seeing observations, catching unreasonable assumptions early. Posterior predictive checks simulate from the fitted model and compare to data - systematic discrepancies reveal model misspecification.
Residual patterns diagnose specific problems: trends suggest time-varying parameters, autocorrelation suggests missing dynamics. Model checking is inherently iterative - each failure guides the next improvement.
- Prior predictive checks validate that your priors generate plausible data before you see observations
- Posterior predictive checks test whether the fitted model can reproduce key features of the observed data
- Residual patterns diagnose specific model deficiencies: trends, autocorrelation, heteroscedasticity
- Simulation-based calibration rigorously tests whether inference recovers known parameters
- Model failure is informative - it guides model improvement and reveals missing mechanisms
Going further
Prior sensitivity analysis
A key question in Bayesian inference: how much do your conclusions depend on prior choices? Prior sensitivity analysis systematically varies priors and checks if the posterior changes substantially.
Why it matters: If switching from Uniform(1, 20) to Normal(5, 3) for \(R_0\) dramatically changes your posterior, then your data isn’t very informative about \(R_0\) - the prior is doing most of the work. This is important to know before drawing scientific conclusions.
How to do it:
- Vary prior widths: Try priors that are 2× wider and 2× narrower
- Vary prior families: Try Uniform vs Normal vs LogNormal for positive parameters
- Vary prior locations: Shift means if using informative priors
- Compare posteriors: Plot overlapping posterior densities; compute summary statistics
Interpretation:
- If posteriors are similar → conclusions are robust to prior choices
- If posteriors differ substantially → data is weakly informative; report sensitivity
- If posteriors differ only in tails → typical; report credible intervals from multiple priors
For the SIR model, try fitting with R_0 ~ Uniform(1, 10) vs R_0 ~ Uniform(1, 50) and compare the posteriors. The peak timing in our data strongly constrains \(R_0\), so you should see similar posteriors - a sign that the data is informative.
From surveillance data to model-ready format
This course uses clean, pre-processed datasets. Real surveillance data requires preparation before modelling. Here are the main considerations:
Data quality issues:
- Missing values: Days with no reports may mean zero cases or missing data. Clarify with data providers.
- Reporting delays: Cases reported today may have symptom onset days or weeks ago. For real-time forecasting, you need to model this delay explicitly or use “nowcasting” methods.
- Backfill corrections: Historical data may be revised as delayed reports arrive. Use version-controlled datasets and document which version you used.
- Definition changes: What counts as a “case” may change during an outbreak (clinical vs. PCR-confirmed vs. rapid test). Document definitions and consider whether time-varying detection rates are needed.
Temporal alignment:
- Ensure your observation times match what your model predicts. If your model tracks infection times but your data reports symptom onset or report dates, you need to account for these delays.
- Weekly data requires careful handling of partial weeks at the start/end of series.
- For models with continuous time, decide how to aggregate (daily incidence vs. point prevalence).
Population and geographic scope:
- Confirm that population denominators match your data source (resident population? Including visitors?).
- For spatial models, ensure geographic boundaries are consistent across data sources.
Practical checklist before fitting:
- Plot the raw data - look for obvious anomalies, zeros, and outliers
- Document the data source, access date, and any transformations applied
- Check that case counts are plausible given population size
- Verify time alignment between model and data
- Consider whether a reporting rate parameter (ρ) is needed
For more on data preprocessing for epidemic models, see Reich et al. (2019) on collaborative forecasting infrastructure and Gostic et al. (2020) on practical identifiability.
Next session
The next session marks a shift in the course. So far, we’ve focused on inference methods using a simple SIR model as a teaching example. You’ve learned:
- How to define models and compute posteriors (Introduction)
- How to sample from posteriors efficiently (MCMC)
- How to verify convergence (MCMC Diagnostics)
- How to check if your model makes sense (this session)
Now we’ll apply these methods to a real scientific problem: explaining the unusual two-wave influenza outbreak on Tristan da Cunha in 1971. This requires a more complex model (SEITL) and, because the population was small, stochastic simulation.
The next session introduces new concepts (stochastic models, Gillespie algorithm, Erlang distributions), but the inference framework remains the same - you’ll use everything you’ve learned.
In the SEITL session, we introduce the Tristan da Cunha dataset and the SEITL model with temporary immunity. This sets up the remaining sessions where we’ll fit stochastic models using particle methods.
References
- Gelman et al. (2020). Bayesian Workflow. arXiv. (The definitive guide to iterative model building)
- Gelman et al. (2013). Bayesian Data Analysis, Chapter 6. (Comprehensive treatment of model checking)
- Gabry et al. (2019). Visualization in Bayesian workflow. JRSS A 182(2), 389-402.
- Talts et al. (2018). Validating Bayesian Inference Algorithms with Simulation-Based Calibration. arXiv.
- Turing.jl documentation - Prior and posterior predictive sampling