Model Checking and Validation

Estimated time: 1.5-2 hours

Introduction

Where we are: You’ve learned to calculate posteriors (Introduction), sample them with MCMC, and check convergence (Diagnostics). Now we ask: even if the sampler worked, does the model make sense?

You’ve learned how to fit models and diagnose MCMC convergence. But a converged chain doesn’t mean your model is correct - it just means the sampler has explored the posterior properly. This session focuses on a different question: does your model actually describe the data well?

Model checking is about building trust in your inferences. A model that doesn’t generate data resembling your observations shouldn’t be trusted to make predictions or inform policy.

Objectives

In this session you will:

Use prior predictive checks to validate model specification
Use posterior predictive checks to assess model fit
Interpret residual patterns to diagnose model problems
Understand simulation-based calibration for validating inference pipelines

Setup

Source file

The source file of this session is located at sessions/model_checking.qmd.

Load libraries

using Random
using Distributions
using DifferentialEquations
using DataFrames
using Plots
using StatsPlots
using Turing
using MCMCChains
using CSV
using DrWatson
using StatsBase

Initialisation

Random.seed!(1234)

TaskLocalRNG()

The Bayesian workflow

Model checking isn’t a one-time step at the end - it’s part of the process. Gelman et al. (2020) describe this as the “Bayesian workflow”:

Specify model - choose structure, priors, likelihood
Prior predictive check - do simulations from priors look reasonable?
Fit model - run MCMC or other inference
Diagnose sampler - check convergence (previous session)
Posterior predictive check - does the fitted model generate realistic data?
Iterate - if checks fail, revise the model

So far we’ve focused on steps 1, 3, and 4. This session covers the checking steps (2 and 5) that tell you whether the model itself makes sense - not just whether the sampler worked.

Prior predictive checks

Before fitting any data, you should ask: do my priors make sense? Prior predictive checking simulates data purely from your prior assumptions, without conditioning on observations.

Let’s use the SIR model from the introduction:

function sir_ode!(du, u, p, t)
    R_0, D_inf = p
    β = R_0 / D_inf
    γ = 1.0 / D_inf
    S, I, R, C = u
    N = S + I + R

    du[1] = -β * S * I / N
    du[2] = β * S * I / N - γ * I
    du[3] = γ * I
    du[4] = β * S * I / N  # Cumulative incidence
end

function simulate_sir(R_0, D_inf, S0, I0, times)
    times_vec = collect(times)
    u0 = [S0, I0, 0.0, 0.0]
    prob = ODEProblem(sir_ode!, u0, (times_vec[1], times_vec[end]), [R_0, D_inf])
    sol = solve(prob, Tsit5(), saveat=times_vec)

    inc_cumulative = sol[4, :]
    inc_daily = [0.0; diff(inc_cumulative)]

    return inc_daily
end

simulate_sir (generic function with 1 method)

Now let’s define a model with priors and sample from the prior predictive distribution:

@model function sir_model(times, S0, I0, n_obs)
    # Priors
    R_0 ~ Uniform(0.5, 10.0)
    D_inf ~ Uniform(1.0, 14.0)
    ρ ~ Uniform(0.0, 1.0)  # Reporting rate

    # Simulate
    inc = simulate_sir(R_0, D_inf, S0, I0, times)

    # Likelihood
    lambdas = [max(ρ * inc[i], 1e-10) for i in 1:n_obs]
    obs ~ arraydist(Poisson.(lambdas))
end

sir_model (generic function with 2 methods)

Why max(..., 1e-10) everywhere?

You’ll see max(value, 1e-10) throughout the course wherever we create a Poisson distribution. Poisson(λ) requires λ > 0 — passing exactly zero causes an error. In practice, the model can produce zero expected incidence (e.g. before the epidemic starts or after it ends), and numerical noise from the ODE solver can even produce tiny negative values. The max(..., 1e-10) guard ensures we always pass a valid (tiny but positive) rate. The value 1e-10 is small enough to be effectively zero for all practical purposes.

# Prior predictive sampling
times = 0.0:1.0:59.0
S0, I0 = 279.0, 5.0

# Sample from priors (no data conditioning — obs will be sampled from the model)
prior_samples = sample(sir_model(times, S0, I0, 60), Prior(), 200, progress=false)

Chains MCMC chain (200×66×1 Array{Float64, 3}):

Iterations        = 1:1:200
Number of chains  = 1
Samples per chain = 200
Wall duration     = 0.6 seconds
Compute duration  = 0.6 seconds
parameters        = R_0, D_inf, ρ, obs[1], obs[2], obs[3], obs[4], obs[5], obs[6], obs[7], obs[8], obs[9], obs[10], obs[11], obs[12], obs[13], obs[14], obs[15], obs[16], obs[17], obs[18], obs[19], obs[20], obs[21], obs[22], obs[23], obs[24], obs[25], obs[26], obs[27], obs[28], obs[29], obs[30], obs[31], obs[32], obs[33], obs[34], obs[35], obs[36], obs[37], obs[38], obs[39], obs[40], obs[41], obs[42], obs[43], obs[44], obs[45], obs[46], obs[47], obs[48], obs[49], obs[50], obs[51], obs[52], obs[53], obs[54], obs[55], obs[56], obs[57], obs[58], obs[59], obs[60]
internals         = logprior, loglikelihood, logjoint

Use `describe(chains)` for summary statistics and quantiles.

Code walkthrough

sir_model(times, S0, I0, 60) - creates a model instance without conditioning on data. Because no observations are provided via |, the obs ~ arraydist(Poisson.(lambdas)) statement acts generatively — Turing samples new observations rather than computing likelihoods. This is different from decondition() (which you’ll see below for posterior predictive checks) — here we never conditioned the model in the first place.
Prior() - a sampler that draws from prior distributions only, ignoring the likelihood. This gives us parameter combinations purely from our prior beliefs.

Coming from R?

This pattern has no direct R equivalent. In Stan, you’d write a separate generated quantities block for prior predictive simulation. In plain R, you’d write a separate simulation function. In Turing, the same model works for both fitting (condition with | (; obs = data)) and simulation (leave unconditioned) — no separate code needed.

# Generate trajectories from prior samples
p_prior = plot(title="Prior Predictive Check",
               xlabel="Time (days)", ylabel="Expected daily cases",
               legend=false)

for i in 1:100
    R_0 = prior_samples[:R_0][i]
    D_inf = prior_samples[:D_inf][i]
    ρ = prior_samples[:ρ][i]

    inc = simulate_sir(R_0, D_inf, S0, I0, times)
    obs_expected = ρ .* inc
    plot!(p_prior, collect(times), obs_expected, alpha=0.2, color=:blue)
end

plot!(p_prior, ylim=(0, 150))

Questions to ask:

Do the simulated epidemics look plausible given what you know about influenza?
Is the range of peak sizes reasonable?
Are there impossible trajectories (negative values, explosions)?

If prior predictive simulations look unreasonable, revise your priors before fitting data. This is much easier than debugging a fitted model.

Exercise: Tighten the priors and compare

The current priors allow $R_0$ up to 10, which is unrealistic for influenza. Try:

Change the prior to R_0 ~ Uniform(1.0, 5.0) in the model definition above
Re-run the prior predictive sampling
Plot the new trajectories alongside the originals

How does the range of epidemic curves change? Are the new trajectories more realistic?

When priors are too vague

If your prior predictive check shows wildly varying trajectories - some with 0 cases, others with thousands - your priors may be too vague. Consider using domain knowledge to constrain them. E.g. for influenza, $R_0$ between 1 and 5 might be more realistic than 0.5 to 10.

Posterior predictive checks

After fitting the model, posterior predictive checks ask: can the fitted model generate data that looks like what we observed?

What to expect

The SIR model we’re about to fit will not capture the Tristan da Cunha data well - specifically, it will miss the second wave around day 25-35. SIR assumes permanent immunity after recovery, so it cannot produce two waves from a single introduction. Posterior predictive checks reveal when our model structure is inadequate, motivating the SEITL/SEIT4L models in later sessions.

# Load data and fit model
flu_tdc = CSV.read(datadir("flu_tdc_1971.csv"), DataFrame)

model = sir_model(times, S0, I0, length(flu_tdc.obs)) | (; obs = flu_tdc.obs)
chain = sample(model, NUTS(), 1000, progress=false)

┌ Info: Found initial step size
└   ϵ = 0.2

Chains MCMC chain (1000×17×1 Array{Float64, 3}):

Iterations        = 501:1:1500
Number of chains  = 1
Samples per chain = 1000
Wall duration     = 9.76 seconds
Compute duration  = 9.76 seconds
parameters        = R_0, D_inf, ρ
internals         = n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size, logprior, loglikelihood, logjoint

Use `describe(chains)` for summary statistics and quantiles.

# Generate posterior predictive samples using predict + decondition
predictions = predict(decondition(model), chain)

# Extract into a list of trajectories for plotting and analysis
n_sims = min(100, size(predictions, 1))
n_obs = length(flu_tdc.obs)
ppc_samples = [
    [predictions[Symbol("obs[$j]")][i, 1] for j in 1:n_obs]
    for i in 1:n_sims
]

100-element Vector{Vector{Float64}}:
 [0.0, 3.0, 6.0, 1.0, 2.0, 7.0, 2.0, 6.0, 10.0, 4.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0]
 [0.0, 5.0, 2.0, 4.0, 2.0, 3.0, 4.0, 8.0, 9.0, 7.0  …  0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 [0.0, 1.0, 4.0, 5.0, 3.0, 3.0, 6.0, 3.0, 8.0, 8.0  …  1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 [0.0, 7.0, 2.0, 6.0, 3.0, 4.0, 3.0, 9.0, 5.0, 13.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 [0.0, 0.0, 5.0, 5.0, 6.0, 3.0, 6.0, 8.0, 3.0, 9.0  …  1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 [0.0, 0.0, 2.0, 6.0, 5.0, 2.0, 10.0, 13.0, 5.0, 13.0  …  1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 [0.0, 3.0, 2.0, 7.0, 5.0, 3.0, 2.0, 8.0, 16.0, 12.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 [0.0, 3.0, 1.0, 2.0, 5.0, 4.0, 5.0, 7.0, 10.0, 5.0  …  0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0]
 [0.0, 1.0, 2.0, 2.0, 4.0, 8.0, 5.0, 8.0, 7.0, 8.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 [0.0, 2.0, 3.0, 8.0, 3.0, 3.0, 5.0, 12.0, 10.0, 7.0  …  1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]
 ⋮
 [0.0, 2.0, 4.0, 4.0, 4.0, 5.0, 5.0, 8.0, 10.0, 7.0  …  0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 [0.0, 2.0, 0.0, 8.0, 3.0, 8.0, 11.0, 9.0, 6.0, 12.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 [0.0, 0.0, 2.0, 2.0, 7.0, 5.0, 8.0, 6.0, 8.0, 10.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]
 [0.0, 2.0, 2.0, 5.0, 5.0, 3.0, 12.0, 12.0, 9.0, 17.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 [0.0, 2.0, 2.0, 4.0, 3.0, 2.0, 5.0, 6.0, 10.0, 8.0  …  0.0, 2.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]
 [0.0, 0.0, 5.0, 4.0, 10.0, 6.0, 6.0, 10.0, 9.0, 8.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 [0.0, 0.0, 7.0, 3.0, 2.0, 4.0, 3.0, 9.0, 8.0, 13.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 [0.0, 1.0, 1.0, 2.0, 1.0, 5.0, 9.0, 12.0, 7.0, 9.0  …  1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
 [0.0, 4.0, 4.0, 2.0, 4.0, 6.0, 6.0, 1.0, 8.0, 16.0  …  1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]

Code explainer: predict(decondition(model), chain)

This generates posterior predictive samples in two steps:

decondition(model) — removes the conditioning on observed data. Our model was created with sir_model(...) | (; obs = flu_tdc.obs), so decondition() reverses this, turning the obs ~ arraydist(Poisson.(lambdas)) statement from a likelihood computation back into a generative sample.
predict(..., chain) — for each parameter sample in the chain, runs the model forward and samples new observations from the Poisson distribution.

The result is a chain-like object where predictions[Symbol("obs[$j]")] gives the predicted values for the j-th observation across all posterior samples.

Plotting code

# Plot posterior predictive check
p_ppc = plot(title="Posterior Predictive Check",
             xlabel="Time (days)", ylabel="Daily cases")

for sim in ppc_samples
    plot!(p_ppc, flu_tdc.time, sim, alpha=0.1, color=:lightblue, label="")
end

scatter!(p_ppc, flu_tdc.time, flu_tdc.obs,
         color=:red, label="Observed", markersize=4)
plot!(p_ppc, legend=:topright)

What to look for:

Do the simulated data envelopes contain the observations?
Are there systematic patterns the model misses?
Does the model capture the variability in the data?

For the Tristan da Cunha data with a simple SIR model, you’ll likely see the model fails to capture the second wave. This is expected - SIR doesn’t have temporary immunity. This motivates the SEITL/SEIT4L models in later sessions.

Quantifying posterior predictive fit

Visual checks are valuable but subjective. We can also compute summary statistics to quantify fit.

# Compute summary statistics for each posterior predictive sample
function compute_summaries(trajectory)
    peak_value = maximum(trajectory)
    peak_time = argmax(trajectory) - 1  # Convert to 0-indexed time
    total_cases = sum(trajectory)
    return (peak=peak_value, peak_time=peak_time, total=total_cases)
end

# Summaries for observed data
obs_summary = compute_summaries(flu_tdc.obs)
println("Observed: peak=$(obs_summary.peak), peak_time=$(obs_summary.peak_time), total=$(obs_summary.total)")

# Summaries for posterior predictive samples
ppc_summaries = [compute_summaries(sim) for sim in ppc_samples]
ppc_peaks = [s.peak for s in ppc_summaries]
ppc_peak_times = [s.peak_time for s in ppc_summaries]
ppc_totals = [s.total for s in ppc_summaries]

Observed: peak=47, peak_time=6, total=312

100-element Vector{Float64}:
 172.0
 203.0
 229.0
 214.0
 207.0
 201.0
 208.0
 196.0
 199.0
 205.0
   ⋮
 207.0
 223.0
 234.0
 239.0
 215.0
 236.0
 201.0
 217.0
 217.0

Plotting code

# Plot distributions of summary statistics
p1 = histogram(ppc_peaks, bins=20, alpha=0.7, label="Posterior pred.",
               xlabel="Peak cases", ylabel="Count", title="Peak Size")
vline!(p1, [obs_summary.peak], color=:red, linewidth=2, label="Observed")

p2 = histogram(ppc_peak_times, bins=20, alpha=0.7, label="Posterior pred.",
               xlabel="Peak time (days)", ylabel="Count", title="Peak Timing")
vline!(p2, [obs_summary.peak_time], color=:red, linewidth=2, label="Observed")

p3 = histogram(ppc_totals, bins=20, alpha=0.7, label="Posterior pred.",
               xlabel="Total cases", ylabel="Count", title="Total Cases")
vline!(p3, [obs_summary.total], color=:red, linewidth=2, label="Observed")

plot(p1, p2, p3, layout=(1, 3), size=(900, 300), legend=:topright)

If the observed value (red line) falls in the tails of the posterior predictive distribution, the model struggles to explain that aspect of the data.

Exercise: Interpret the posterior predictive checks

Look at the three histograms above. For which summary statistics (peak size, peak timing, total cases) does the observed value fall well within the posterior predictive distribution?
For which does it fall in the tails?
What does this tell you about which aspects of the epidemic the SIR model captures well vs poorly?

We can quantify this with a posterior predictive p-value: the proportion of simulated values that are as extreme as or more extreme than the observed value.

# Posterior predictive p-values
p_peak = mean(ppc_peaks .>= obs_summary.peak)
p_time = mean(ppc_peak_times .>= obs_summary.peak_time)
p_total = mean(ppc_totals .>= obs_summary.total)

println("P-values:")
println("  Peak size: $p_peak")
println("  Peak timing: $p_time")
println("  Total cases: $p_total")

P-values:
  Peak size: 0.0
  Peak timing: 0.99
  Total cases: 0.0

Values close to 0 or 1 indicate the model struggles to reproduce that aspect of the data. A p-value of 0.02 for peak timing, for example, would mean only 2% of simulated epidemics peaked as late as or later than the observed data — suggesting the model systematically predicts earlier peaks.

Note

These p-values are descriptive, not hypothesis tests — there is no “reject if p < 0.05” rule here. They quantify how surprised you should be, given the model. A value of 0.15 doesn’t mean the model is “fine”; a value of 0.03 doesn’t mean it’s “wrong”. Use them alongside visual checks and scientific judgement.

Residual analysis

Residuals are the difference between observed and predicted values. Patterns in residuals can reveal model deficiencies.

# Compute posterior mean predictions
mean_R_0 = mean(chain[:R_0])
mean_D_inf = mean(chain[:D_inf])
mean_ρ = mean(chain[:ρ])

# Use the actual time points from the data
obs_times = flu_tdc.time
mean_inc = simulate_sir(mean_R_0, mean_D_inf, S0, I0, obs_times)
mean_pred = mean_ρ .* mean_inc

# Compute residuals (use explicit indexing to match lengths)
n_obs = length(flu_tdc.obs)
resid = flu_tdc.obs .- mean_pred[1:n_obs]

59-element Vector{Float64}:
  0.0
 -1.538747216721958
 -3.081140330320278
  6.288443151558076
  1.5688423041753081
 26.765720169364013
 40.89307331264174
 29.97696854140731
 21.053575691478702
  2.171088058524374
  ⋮
  1.8848258708904815
 -0.09995497959274532
 -0.08692231906498059
 -0.07561080666250734
 -0.06565360018951709
 -0.056959467988574944
 -0.04941135860468992
 -0.04289222058292731
  0.962717351342795

# Plot residuals over time
p_resid = plot(obs_times, resid,
               xlabel="Time (days)", ylabel="Residual (obs - pred)",
               title="Residuals Over Time", legend=false,
               marker=:circle, markersize=3)
hline!(p_resid, [0], color=:black, linestyle=:dash)

What patterns indicate problems?

If residuals show a systematic trend — consistently positive or negative over certain time periods — the model is missing important temporal structure. If residuals at one time point predict residuals at the next (autocorrelation), the model dynamics are incomplete. And if the spread of residuals changes depending on the predicted value (heteroscedasticity — literally “different scatter”), the observation model’s variance structure is probably wrong. For example, with a Poisson observation model the variance equals the mean, so residuals should be larger when predicted cases are high; if they’re much larger than expected, the data may be overdispersed.

For the SIR fit to Tristan da Cunha, you should see clear structure: positive residuals during the second wave that the SIR model can’t capture.

# Check for autocorrelation in residuals
lag1_resid = resid[1:end-1]
lag0_resid = resid[2:end]

p_acf = scatter(lag1_resid, lag0_resid,
                xlabel="Residual at t", ylabel="Residual at t+1",
                title="Residual Autocorrelation", legend=false,
                alpha=0.6)

Strong correlation here suggests the model is missing important dynamics - exactly what we’d expect when fitting SIR to two-wave data.

Simulation-based calibration

A more rigorous check asks: does my inference procedure recover true parameters when I know them?

Simulation-based calibration (SBC) works as follows:

Draw parameters from the prior: $\theta^* \sim p(\theta)$
Simulate data from the model: $y^* \sim p(y|\theta^*)$
Fit the model to $y^*$ and get posterior samples
Check if $\theta^*$ falls within the posterior as expected

If your inference is correct, the true parameter should be uniformly distributed within the posterior samples (i.e., the 95% credible interval should contain the truth 95% of the time).

# SBC for R_0 (simplified example with fewer iterations for speed)
function run_sbc(n_sbc=20)
    ranks = Vector{Int}(undef, n_sbc)

    Threads.@threads for i in 1:n_sbc
        # 1. Draw from prior
        R_0_true = rand(Uniform(0.5, 10.0))
        D_inf_true = rand(Uniform(1.0, 14.0))
        ρ_true = rand(Uniform(0.0, 1.0))

        # 2. Simulate data
        inc = simulate_sir(R_0_true, D_inf_true, S0, I0, times)
        obs_sim = rand.(Poisson.(max.(ρ_true .* inc, 1e-10)))

        # 3. Fit model (short chain for speed)
        model_sim = sir_model(times, S0, I0, length(obs_sim)) | (; obs = obs_sim)
        chain_sim = sample(model_sim, NUTS(), 200, progress=false)

        # 4. Compute rank of true value
        R_0_samples = vec(chain_sim[:R_0])
        ranks[i] = sum(R_0_samples .< R_0_true)
    end

    return ranks
end

# This takes a while - in practice you'd run more iterations
sbc_ranks = run_sbc(20)

Code explainer

Threads.@threads — runs the SBC iterations in parallel across available threads. Each iteration fits an independent model, so this is safe and gives a roughly linear speedup.
Vector{Int}(undef, n_sbc) — pre-allocates the ranks vector so each thread writes to its own index (ranks[i] = ...). We can’t use push! here because threads appending to a shared vector is a race condition.
To actually use more than one thread, start Julia with julia --threads=auto (or set JULIA_NUM_THREADS). Without that, @threads silently runs serially.

# Plot SBC histogram
histogram(sbc_ranks, bins=10,
          xlabel="Rank", ylabel="Count",
          title="SBC Rank Histogram for R_0",
          legend=false)

A uniform histogram indicates well-calibrated inference. Peaks at the edges suggest the posterior is too narrow (overconfident) or too wide (underconfident).

SBC in practice

Full SBC requires many simulations (100+) and is computationally expensive. It is particularly important when implementing new models, using complex inference algorithms, or when you need high confidence that your computational pipeline is correct. For established models with well-tested software, prior and posterior predictive checks catch most problems, but SBC provides the strongest guarantee that inference is working as intended.

When models fail checks

What should you do when model checks reveal problems?

If prior predictive simulations look unreasonable, start by examining your priors. Priors that are too vague will generate wildly varying trajectories so you should use your knowledge of the system to constrain them. If you see impossible values (negative counts, probabilities outside [0,1]), add appropriate bounds or switch to a distribution family that respects the constraints.

When posterior predictive checks fail, the model itself needs attention. Systematic bias such as predictions consistently too high or too low suggests missing mechanisms. If the model captures the mean but not the variability, your observation model may be wrong; switching from Poisson to Negative Binomial, for instance, allows for overdispersion. When specific features are missed (like a second epidemic wave), consider what biological mechanisms could explain them.

Residual patterns point to specific fixes. Temporal trends in residuals suggest time-varying parameters or additional compartments. Autocorrelated residuals indicate the model dynamics are incomplete. And if residual variance grows with predicted values (heteroscedasticity), the observation model’s variance structure needs adjustment.

Remember that model failure is information. A model that fails to capture the second wave in Tristan da Cunha tells us something important: simple SIR dynamics are insufficient, motivating the temporary immunity models we explore in later sessions.

When is a model “good enough”?

There is no universal threshold. All models are wrong — the question is whether yours is useful for your specific purpose. A model is good enough when:

It captures the features you care about (e.g. peak timing for planning surge capacity)
Posterior predictive checks don’t show systematic failures in those features
Adding complexity doesn’t meaningfully improve fit for your question
You’ve documented what the model does and doesn’t capture

A model that misses a second wave is fine if you only care about the first wave’s peak. It’s not fine if you’re trying to understand long-term immunity.

Exercise: Diagnosing the SIR failure

The SIR model fails to capture the second wave in the Tristan da Cunha data. Based on the residual patterns and posterior predictive checks above:

What biological mechanism might explain a second wave after the first has subsided?
What structural change to the model would be needed to accommodate this?
Would changing the observation model (e.g., Poisson to Negative Binomial) help, or is this a problem with the underlying dynamics?

The next session on SEITL models explores one potential answer to these questions.

Learning points

Prior predictive checks validate that your priors generate plausible data before you see observations
Posterior predictive checks test whether the fitted model can reproduce key features of the observed data
Residual patterns diagnose specific model deficiencies: trends, autocorrelation, heteroscedasticity
Simulation-based calibration rigorously tests whether inference recovers known parameters
Model failure is informative - it guides model improvement and reveals missing mechanisms

Going further

Prior sensitivity analysis

A key question in Bayesian inference: how much do your conclusions depend on prior choices? Prior sensitivity analysis systematically varies priors and checks if the posterior changes substantially.

Why it matters: If switching from Uniform(1, 20) to Normal(5, 3) for $R_0$ dramatically changes your posterior, then your data isn’t very informative about $R_0$ - the prior is doing most of the work. This is important to know before drawing scientific conclusions.

How to do it:

Vary prior widths: Try priors that are 2× wider and 2× narrower
Vary prior families: Try Uniform vs Normal vs LogNormal for positive parameters
Vary prior locations: Shift means if using informative priors
Compare posteriors: Plot overlapping posterior densities; compute summary statistics

Interpretation:

If posteriors are similar → conclusions are robust to prior choices
If posteriors differ substantially → data is weakly informative; report sensitivity
If posteriors differ only in tails → typical; report credible intervals from multiple priors

For the SIR model, try fitting with R_0 ~ Uniform(1, 10) vs R_0 ~ Uniform(1, 50) and compare the posteriors. The peak timing in our data strongly constrains $R_0$, so you should see similar posteriors - a sign that the data is informative.

From surveillance data to model-ready format

This course uses clean, pre-processed datasets. Real surveillance data requires preparation before modelling. Here are the main considerations:

Data quality issues:

Missing values: Days with no reports may mean zero cases or missing data. Clarify with data providers.
Reporting delays: Cases reported today may have symptom onset days or weeks ago. For real-time forecasting, you need to model this delay explicitly or use “nowcasting” methods.
Backfill corrections: Historical data may be revised as delayed reports arrive. Use version-controlled datasets and document which version you used.
Definition changes: What counts as a “case” may change during an outbreak (clinical vs. PCR-confirmed vs. rapid test). Document definitions and consider whether time-varying detection rates are needed.

Temporal alignment:

Ensure your observation times match what your model predicts. If your model tracks infection times but your data reports symptom onset or report dates, you need to account for these delays.
Weekly data requires careful handling of partial weeks at the start/end of series.
For models with continuous time, decide how to aggregate (daily incidence vs. point prevalence).

Population and geographic scope:

Confirm that population denominators match your data source (resident population? Including visitors?).
For spatial models, ensure geographic boundaries are consistent across data sources.

Practical checklist before fitting:

Plot the raw data - look for obvious anomalies, zeros, and outliers
Document the data source, access date, and any transformations applied
Check that case counts are plausible given population size
Verify time alignment between model and data
Consider whether a reporting rate parameter (ρ) is needed

For more on data preprocessing for epidemic models, see Reich et al. (2019) on collaborative forecasting infrastructure and Gostic et al. (2020) on practical identifiability.

Next session

Transition: From methods to application

The next session marks a shift in the course. So far, we’ve focused on inference methods using a simple SIR model as a teaching example. You’ve learned:

How to define models and compute posteriors (Introduction)
How to sample from posteriors efficiently (MCMC)
How to verify convergence (MCMC Diagnostics)
How to check if your model makes sense (this session)

Now we’ll apply these methods to a real scientific problem: explaining the unusual two-wave influenza outbreak on Tristan da Cunha in 1971. This requires a more complex model (SEITL) and, because the population was small, stochastic simulation.

The next session introduces new concepts (stochastic models, Gillespie algorithm, Erlang distributions), but the inference framework remains the same - you’ll use everything you’ve learned.

In the SEITL session, we introduce the Tristan da Cunha dataset and the SEITL model with temporary immunity. This sets up the remaining sessions where we’ll fit stochastic models using particle methods.

References

Gelman et al. (2020). Bayesian Workflow. arXiv. (The definitive guide to iterative model building)
Gelman et al. (2013). Bayesian Data Analysis, Chapter 6. (Comprehensive treatment of model checking)
Gabry et al. (2019). Visualization in Bayesian workflow. JRSS A 182(2), 389-402.
Talts et al. (2018). Validating Bayesian Inference Algorithms with Simulation-Based Calibration. arXiv.
Turing.jl documentation - Prior and posterior predictive sampling