MCMC Diagnostics

Estimated time: 1.5-2 hours

Lecture slides

Introduction

In the previous session, you learned how MCMC algorithms explore posterior distributions by taking random walks through parameter space. You implemented Metropolis-Hastings yourself and saw how Turing.jl’s NUTS algorithm automates everything with gradient-based sampling. But there’s a critical question we haven’t yet fully addressed: How do you know if the inference procedure has worked?

MCMC algorithms don’t come with guarantees. A chain will always give you numbers, but those numbers might not be representative posterior samples. The chain could be stuck exploring only part of the parameter space, still working towards the target distribution, or mixing so slowly that your “samples” are nearly duplicates. MCMC only guarantees unbiased sampling in the limit of infinitely many samples. Using unreliable MCMC output leads to incorrect inference - wrong parameter estimates, overconfident intervals, and bad scientific conclusions.

This session teaches you how to diagnose MCMC chains systematically. You will learn to spot problems visually with trace plots, quantify convergence with R̂ and effective sample size, and understand when your posterior samples are trustworthy. These diagnostic skills are essential for any serious Bayesian analysis.

Objectives

In this session, you will learn how to assess whether your MCMC chains have converged and whether your posterior samples are reliable. This is an important component of trustworthy Bayesian analysis.

You will learn to:

  1. Visually diagnose chains using trace plots and other visualisations
  2. Calculate quantitative diagnostics like R̂ and effective sample size
  3. Identify common problems and how to fix them
  4. Use Turing.jl’s built-in diagnostics for automated checking
  5. Apply best practices for robust MCMC in practice
  Activating project at `~/work/mfiidd/mfiidd`

Source file

The source file of this session is located at sessions/mcmc_diagnostics.qmd.

Load libraries

using Random ## for random numbers
using Distributions ## for probability distributions
using DifferentialEquations ## for differential equations
using DataFrames ## for data frames
using Plots ## for plots
using StatsPlots ## for statistical plots
using Turing ## for probabilistic programming and MCMC
using MCMCChains ## for MCMC diagnostics
using StatsBase ## for autocorrelation

Initialisation

We set a random seed for reproducibility. Setting this ensures that you should get exactly the same results on your computer as we do.

Random.seed!(12345)
TaskLocalRNG()

Load data and model from previous session

We’ll use the same SIR model and epidemic data from previous sessions.

Main.Notebook.simulate_sir
# Initial state and time range
init_state = [999, 1, 0]
t0, tend = (0, 30)

# Epidemic data
epi1 = DataFrame(
    time = 1:1:30,
    obs = [
        1, 3, 4, 11, 19, 38, 67, 112, 183, 210, 237, 182, 178, 154, 95, 68, 51,
        30, 32, 18, 19, 9, 2, 2, 2, 0, 0, 1, 0, 0
    ]
)

# Define our SIR model in Turing.jl (same as MCMC session)
@model function sir_model(init_state, times, n_obs)
    # Priors
    R_0 ~ Uniform(1.0, 20.0)
    D_inf ~ Uniform(1.0, 14.0)

    # Simulate ODE
    prob = ODEProblem(sir_ode!, init_state, (times[1], times[end]), [R_0, D_inf])
    sol = solve(prob, saveat = times)

    # Likelihood
    I = sol[2, :]
    # Use max to handle tiny numerical errors from automatic differentiation
    lambdas = [max(I[i], 0.0) for i in 1:n_obs]
    obs ~ arraydist(Poisson.(lambdas))
end

# Create model instance
model = sir_model(init_state, epi1.time, length(epi1.obs)) | (; obs = epi1.obs)

Visual diagnostics

1. Trace plots: The most important diagnostic

Trace plots show parameter values over iterations. A well-mixed chain looks like a “hairy caterpillar” - densely filled with rapid up-and-down fluctuations around a stable mean:

# Good chain: well-mixing NUTS
chain_good = sample(model, NUTS(), 2000)
# chain_good[:R_0] extracts the samples for R_0 from the chain object
# You can use this with any parameter name defined in your @model function
plot(
    chain_good[:R_0], title = "Good Chain: R_0", xlabel = "Iteration",
    ylabel = "R_0", linewidth = 1
)
# For comparison, let's look at the other parameter too
p1 = plot(
    chain_good[:R_0], title = "R_0 Trace", xlabel = "Iteration",
    ylabel = "R_0", linewidth = 1
)
p2 = plot(
    chain_good[:D_inf], title = "D_inf Trace", xlabel = "Iteration",
    ylabel = "D_inf", linewidth = 1
)

plot(p1, p2, layout = (2, 1), size = (800, 600))

The features that you’ll want to look out for in trace plots are:

  • Good mixing: Chain moves freely around parameter space
  • Stationarity: No obvious trends or drifts
  • Fat hairy caterpillar: Rapid exploration of the posterior

The kind of features to avoid are:

  • Poor mixing: Chain gets stuck in regions
  • Trends: Chain still climbing/descending
  • Thin worm: Chain not exploring enough

2. Running averages

Running averages should stabilise if the chain has converged:

function running_mean(x)
    return [mean(x[1:i]) for i in eachindex(x)]
end

R_0_samples = vec(chain_good[:R_0])
p1 = plot(
    running_mean(R_0_samples), title = "Running Mean: R_0",
    xlabel = "Iteration", ylabel = "Running Mean R_0", linewidth = 2
)
hline!(
    p1, [mean(R_0_samples)], color = :red, linestyle = :dash,
    label = "Final mean"
)

D_inf_samples = vec(chain_good[:D_inf])
p2 = plot(
    running_mean(D_inf_samples), title = "Running Mean: D_inf",
    xlabel = "Iteration", ylabel = "Running Mean D_inf", linewidth = 2
)
hline!(
    p2, [mean(D_inf_samples)], color = :red, linestyle = :dash,
    label = "Final mean"
)

plot(p1, p2, layout = (2, 1))

3. Autocorrelation

MCMC samples are correlated. Good chains have autocorrelation that decays quickly:

function autocorr_plot(chain, param_name; max_lag = 100)
    samples = Array(chain[param_name])  # Convert to plain array
    lags = 0:max_lag
    # autocor from StatsBase expects a vector and returns scalar for each lag
    autocorrs = [autocor(samples, [lag])[1] for lag in lags]

    plot(
        lags, autocorrs, title = "Autocorrelation: $param_name",
        xlabel = "Lag", ylabel = "Autocorrelation", linewidth = 2
    )
    hline!([0], color = :gray, linestyle = :dash, label = "")
    hline!(
        [-0.05, 0.05], color = :red, linestyle = :dash, label = "±5% threshold"
    )
end

p1 = autocorr_plot(chain_good, :R_0)
p2 = autocorr_plot(chain_good, :D_inf)

plot(p1, p2, layout = (2, 1), size = (800, 600))

We can see that autocorrelation drops quickly to near zero and both parameters show good mixing. After ~20-30 MCMC steps, samples are essentially independent

WarningExercise: Interpret the visual diagnostics

Look back at the trace plots, running mean plots, and autocorrelation plots above.

  1. What would a bad trace plot look like? Sketch or describe what you would see if the chain was stuck or not mixing well.
  2. If the running mean was still trending upward after 2000 iterations, what would that suggest?
  3. If autocorrelation remained high (say > 0.5) even at lag 100, what would that indicate about your effective sample size?

4. Burn-in and warmup

MCMC chains often start from arbitrary initial values and need time to “find” the high-probability regions of the posterior. Samples from this initial exploration phase (called “burn-in”) are discarded because they don’t represent the target distribution.

Modern samplers like NUTS have automatic warmup periods where they adapt the step size to achieve good acceptance rates, learn the geometry of the posterior and explore to find typical parameter values.

In our earlier example, when we ran sample(model, NUTS(), 2000), Turing actually ran ~3000 iterations total: 1000 warmup iterations (automatically discarded) plus 2000 sampling iterations. The returned chain contains only the 2000 post-warmup samples.

By default, NUTS uses 1000 warmup iterations regardless of how many sampling iterations you request. You can change this with the n_adapts keyword: NUTS(0.65; n_adapts=2000) uses 2000 warmup iterations instead. For most models, the default 1000 is sufficient.

With NUTS, you typically don’t need to manually discard burn-in samples. The warmup is automatic and adaptive. However, you should still check trace plots to ensure the chain has converged - if you see trends even after warmup, you may need longer adaptation.

Quantitative diagnostics

1. R̂ statistic (Gelman-Rubin diagnostic)

R̂ compares within-chain and between-chain variance. It should be < 1.1 (ideally < 1.01):

NoteRunning Julia with multiple threads

To run multiple chains in parallel, Julia must be started with multiple threads. Check your current thread count with Threads.nthreads(). If it shows 1, restart Julia with:

julia -t auto          # Use all available CPU cores
julia -t 4             # Use exactly 4 threads

In VS Code, you can set this permanently in settings: search for “Julia: Num Threads” and set it to “auto” or a specific number. Without multiple threads, chains will run sequentially (slower but still correct).

# Run multiple chains for R̂ calculation
# Detect available threads to avoid overwhelming CI systems
n_chains = max(2, min(4, Threads.nthreads()))
println("Running $n_chains chains in parallel ($(Threads.nthreads()) threads available)")
chain_multi = sample(model, NUTS(), MCMCThreads(), 2000, n_chains)
  • MCMCThreads() - tells Turing to run multiple chains in parallel using Julia threads. Much faster than running sequentially on multi-core machines.
  • sample(model, sampler, MCMCThreads(), n_samples, n_chains) - the extended signature for parallel sampling. Here we request n_chains chains, each with n_samples samples.
  • Threads.nthreads() - returns how many threads Julia has available. Start Julia with julia -t auto to use all CPU cores.
# Check what each individual chain found
println("Individual chain summaries:")
for i in 1:n_chains
    chain_i = chain_multi[:, :, i]
    r0_mean = mean(chain_i[:R_0])
    dinf_mean = mean(chain_i[:D_inf])
    println(
        "  Chain $i: R_0 = $(round(r0_mean, digits = 2)), ",
        "D_inf = $(round(dinf_mean, digits = 2))"
    )
end

# Calculate R̂
println("\nR̂ values:")
rhat_values = DataFrame(rhat(chain_multi))
for row in eachrow(rhat_values)
    param = row.parameters
    value = row.rhat
    status = value < 1.1 ? "good" : "bad"
    println("  $param: $(round(value, digits = 3)) $status")
end
Individual chain summaries:
  Chain 1: R_0 = 2.45, D_inf = 1.94
  Chain 2: R_0 = 2.45, D_inf = 1.95

R̂ values:
  R_0: 1.004 good
  D_inf: 1.004 good
  • chain_multi[:, :, i] - extracts the i-th chain from a multi-chain object. The three dimensions are: samples × parameters × chains.
  • rhat(chain) - calculates the R̂ (Gelman-Rubin) diagnostic. Compares variance within chains to variance between chains. Values < 1.1 indicate convergence; < 1.01 is even better.
# Visualise multiple chains
plot(chain_multi)

2. Effective Sample Size (ESS)

ESS tells you how many independent samples your correlated MCMC samples are equivalent to:

ess_values = DataFrame(ess(chain_multi))
n_samples = length(chain_multi)
2000
  • ess(chain) - calculates Effective Sample Size. MCMC samples are correlated, so N samples don’t give you N independent draws. ESS estimates how many independent samples your chain is equivalent to.
println("Effective Sample Size:")
for row in eachrow(ess_values)
    param = row.parameters
    value = row.ess
    efficiency = value / n_samples * 100
    status = value > 100 ? "good" : "bad"  # Rule of thumb: ESS > 100
    println(
        "  $param: $(round(value, digits = 0)) / $n_samples ",
        "($(round(efficiency, digits = 1))%) $status"
    )
end
Effective Sample Size:
  R_0: 1040.0 / 2000 (52.0%) good
  D_inf: 1087.0 / 2000 (54.4%) good

ESS interpretation:

  • ESS ≈ N: Very efficient, little autocorrelation
  • ESS << N: High autocorrelation, need more samples
  • ESS < 100: Generally insufficient for reliable inference

3. Monte Carlo Standard Error (MCSE)

MCSE quantifies uncertainty in your posterior estimates due to finite sampling:

mcse_values = DataFrame(mcse(chain_multi))
posterior_means = DataFrame(mean(chain_multi))

println("Posterior means ± MCSE:")
for row in eachrow(mcse_values)
    param = row.parameters
    mcse_val = row.mcse
    # Get mean for this parameter
    mean_val = posterior_means[posterior_means.parameters .== param, :mean][1]
    rel_error = mcse_val / abs(mean_val) * 100
    println(
        "  $param: $(round(mean_val, digits = 3)) ± ",
        "$(round(mcse_val, digits = 4)) ",
        "($(round(rel_error, digits = 2))% relative error)"
    )
end
Posterior means ± MCSE:
  R_0: 2.446 ± 0.0011 (0.05% relative error)
  D_inf: 1.945 ± 0.001 (0.05% relative error)

Common problems and solutions

Problem 1: Poor mixing (chain gets stuck)

Common causes:

  • Algorithm too slow: Basic samplers get stuck in regions
  • Poor step size: Proposals too small or too large
  • Difficult geometry: Highly correlated parameters, narrow ridges
  • Multimodality: Chain trapped in one mode

Solutions:

  1. Use better algorithms: NUTS (which we’re using!) is usually excellent
  2. Run longer: Increase number of samples if mixing is just slow
  3. Reparametrise: Transform parameters to improve geometry
  4. Multiple chains: Run several chains to detect stuck chains

For modern gradient-based samplers like NUTS, poor mixing is rare if:

  • Your model is specified correctly
  • The posterior is reasonably well-behaved
  • You’re using appropriate priors

Problem 2: Divergent transitions (for HMC/NUTS)

NUTS simulates a ball rolling over the posterior surface. A divergent transition occurs when the simulation becomes numerically unstable — the ball “flies off” to implausible values instead of following the contour. This typically happens when the posterior has sharp ridges or funnels: the step size that works in flat regions is too large for the narrow parts, causing numerical errors to explode.

A small number of divergences (say <1% of samples) may be acceptable, but many divergences indicate biased sampling — the sampler is systematically avoiding regions it can’t navigate, so the samples underrepresent those areas.

# Check for divergent transitions
if haskey(chain_good.info, :divergent_transitions)
    n_divergent = sum(chain_good.info[:divergent_transitions])
    println("Divergent transitions: $n_divergent / $(length(chain_good))")
    if n_divergent > 0
        println("Consider increasing adapt_delta or reparameterizing")
    else
        println("No divergent transitions")
    end
end

Problem 3: Multi-modal posteriors

Sometimes posteriors have multiple peaks, and chains might get trapped in one mode. Running multiple chains with different starting points helps detect this - if chains converge to different modes, R̂ will be high.

Solutions for multimodal posteriors: 1. Run multiple chains with different starting points 2. Use more advanced samplers (parallel tempering, etc.) 3. Check R̂: Will be high if chains are in different modes

Here’s an artificial example with two modes to show how multiple chains help:

# Create a artificial multimodal example
@model function multimodal_example()
    x ~ Normal(0, 5)  # Wide prior

    # Likelihood with two modes
    mode1_loglik = logpdf(Normal(-2, 0.5), x)
    mode2_loglik = logpdf(Normal(2, 0.5), x)
    Turing.@addlogprob! log(0.5 * exp(mode1_loglik) + 0.5 * exp(mode2_loglik))
end
multimodal_example (generic function with 2 methods)
  • Turing.@addlogprob! manually adds a value to the log-probability of the model. This is useful when you can’t express the likelihood using ~ syntax — here we need a mixture of two Normals, which requires computing the log-probability by hand.
  • The ~ syntax (e.g. x ~ Normal(0, 5)) automatically adds the log-density to the model’s total log-probability. @addlogprob! does the same thing, but you compute the value yourself.
# Single chain might miss one mode
chain_single = sample(multimodal_example(), NUTS(), 1000)

# Multiple chains more likely to find both modes
chains_multi = [sample(multimodal_example(), NUTS(), 1000) for _ in 1:4]
p = plot(title = "Multimodal Posterior")
for (i, chain) in enumerate(chains_multi)
    density!(p, chain[:x], alpha = 0.7, label = "Chain $i")
end
plot!(p)

Different chains find different modes, which would show up as high R̂ values.

Posterior predictive checks

Convergence diagnostics tell you if the algorithm worked. Posterior predictive checks tell you if the model makes sense.

Why check predictions? Even if MCMC converged perfectly, your model might still be wrong. The model could be mis-specified, missing important features, or based on incorrect assumptions. Posterior predictive checks reveal whether your model actually generates data that looks like what you observed.

# Generate predictions from the posterior
# decondition() removes the observed data, so Turing samples new observations
predictions = predict(decondition(model), chain_multi)
Chains MCMC chain (2000×30×2 Array{Float64, 3}):

Iterations        = 1:1:2000
Number of chains  = 2
Samples per chain = 2000
parameters        = obs[1], obs[2], obs[3], obs[4], obs[5], obs[6], obs[7], obs[8], obs[9], obs[10], obs[11], obs[12], obs[13], obs[14], obs[15], obs[16], obs[17], obs[18], obs[19], obs[20], obs[21], obs[22], obs[23], obs[24], obs[25], obs[26], obs[27], obs[28], obs[29], obs[30]
internals         = 

Use `describe(chains)` for summary statistics and quantiles.
  • predict(model, chain) - generates new data from the model using parameter values from the posterior chain. This is the basis for posterior predictive checks.
  • decondition(model) - removes the conditioning (observed data) from the model, so that obs ~ arraydist(Poisson.(lambdas)) samples new observations instead of computing likelihoods against the data. This is how you generate posterior predictive samples from a conditioned model.
# Plot posterior predictive samples
p = plot(title = "Posterior Predictive Check", xlabel = "Time", ylabel = "Cases")

# Plot multiple posterior predictive samples
for i in 1:50
    # Extract predictions for each time point
    pred_obs = [predictions[Symbol("obs[$j]")][i, 1] for j in 1:length(epi1.obs)]
    plot!(p, epi1.time, pred_obs, alpha = 0.3, color = :lightblue, label = "")
end

# Add real data
scatter!(p, epi1.time, epi1.obs, label = "Real data", markersize = 4, color = :red)
plot!(p)
  1. decondition(model) removes the observed data from the model
    • Our model was created with sir_model(...) | (; obs = epi1.obs), which conditions on the observed data. decondition() reverses this, telling Turing to sample new observations rather than computing likelihoods against the data.
  2. predict(decondition(model), chain_multi) uses parameters from the posterior chain to simulate new data
    • For each MCMC sample: takes R₀ and D_inf from the chain → solves ODE → samples new obs from Poisson
  3. The result is stored with keys like Symbol("obs[1]"), Symbol("obs[2]"), etc.
    • Because our model uses obs ~ arraydist(Poisson.(lambdas)), Turing names each element of the predicted vector obs[1], obs[2], etc. To access these in Julia, we convert the string to a Symbol: Symbol("obs[$j]") becomes :obs[1], :obs[2], etc.
    • predictions[Symbol("obs[$j]")][i, 1] then extracts the i-th posterior sample for the j-th observation. The 1 is the chain index (we have one chain here).

Do the simulated epidemics (blue lines) look similar to the real data (red points)? They should capture the timing, peak size, and overall shape. If they don’t, your model needs improvement - no amount of MCMC will fix a fundamentally wrong model.

The complete workflow

  1. Run multiple chains (at least 4)
  2. Check trace plots visually
  3. Calculate R̂ and ESS
  4. Check for divergences (NUTS)
  5. Do posterior predictive checks
  6. Re-run if needed with more samples or different settings

Advanced diagnostics

This section covers refinements to the basic diagnostics above. If you’re short on time, you can skip ahead to the troubleshooting guide — the core diagnostics (R̂, ESS, trace plots) from the previous section are sufficient for most analyses.

Split-R̂

Modern R̂ splits each chain in half for more robust diagnostics:

# This is what rhat() actually computes - split chains for better diagnostics
rhat_split = DataFrame(rhat(chain_multi))
println("Split-R̂ values (recommended):")
for row in eachrow(rhat_split)
    println("  $(row.parameters): $(round(row.rhat, digits = 4))")
end
Split-R̂ values (recommended):
  R_0: 1.0038
  D_inf: 1.004

Bulk and tail ESS

Turing.jl provides different ESS measures:

# Different types of ESS
ess_bulk = DataFrame(ess(chain_multi, kind = :bulk))    # Central part of distribution
ess_tail = DataFrame(ess(chain_multi, kind = :tail))    # Tails of distribution

println("Bulk ESS (central posterior):")
for row in eachrow(ess_bulk)
    println("  $(row.parameters): $(round(row.ess, digits = 0))")
end

println("\nTail ESS (posterior tails):")
for row in eachrow(ess_tail)
    println("  $(row.parameters): $(round(row.ess, digits = 0))")
end
Bulk ESS (central posterior):
  R_0: 1040.0
  D_inf: 1087.0

Tail ESS (posterior tails):
  R_0: 1180.0
  D_inf: 1241.0

Energy diagnostics (for HMC/NUTS)

NUTS uses Hamiltonian dynamics. Energy plots help diagnose geometric problems:

# Energy diagnostic plot
if haskey(chain_good.info, :energy)
    energy = chain_good.info[:energy]
    histogram(energy, title = "Energy Distribution", xlabel = "Energy", ylabel = "Frequency", alpha = 0.7)
end

Troubleshooting guide

Problem Symptoms Solutions
Poor mixing Trace plots show trends, low ESS Use NUTS, tune proposals, longer runs
Multimodality High R̂, different chain means Multiple chains, better initialisation
Divergences NUTS warnings, biased sampling Increase adapt_delta, reparametrise
Slow convergence High autocorrelation, low ESS Better algorithm, longer runs, reparametrise
Non-stationarity Trending trace plots Longer warmup, check model specification

Practical example: Debugging a problematic model

Let’s create and diagnose a problematic model:

# Poorly specified model with correlations and poor scaling
@model function problematic_model()
    # Poor scaling: one parameter much larger than others
    R_0 ~ Normal(0, 100)     # Wrong scale
    D_inf ~ Normal(0, 0.01)  # Very tight

    # Add some correlation structure to make it harder
    x1 ~ Normal(R_0, 1)
    x2 ~ Normal(R_0, 1)

    # Some fake observations
    for i in 1:10
        0.0 ~ Normal(x1 + x2 - 2*R_0, 0.1)
    end
end
problematic_model (generic function with 2 methods)
# This model will have problems
chain_problematic = sample(problematic_model(), NUTS(), 1000)
# Diagnose it
println("R̂ values:")
display(DataFrame(rhat(chain_problematic)))

println("\nESS values:")
display(DataFrame(ess(chain_problematic)))
R̂ values:

ESS values:
4×2 DataFrame
Row parameters rhat
Symbol Float64
1 R_0 1.21678
2 D_inf 1.01
3 x1 1.21652
4 x2 1.2173
4×3 DataFrame
Row parameters ess ess_per_sec
Symbol Float64 Float64
1 R_0 4.49209 0.767484
2 D_inf 149.025 25.4612
3 x1 4.49439 0.767878
4 x2 4.48912 0.766977
# Look at the trace plots
plot(chain_problematic)
WarningExercise: Diagnose and fix the problematic model
  1. Look at the trace plots and quantitative diagnostics above. What specific problems can you identify?
  2. Which parameters show the worst R̂ and ESS values?
  3. Try modifying the problematic_model above to improve its behaviour. Hint: Change the prior scales to be more similar across parameters, or remove the correlation structure.
  4. Run the modified model and compare the diagnostics.

The problems are:

  • Mismatched scales: R_0 ~ Normal(0, 100) is very wide while D_inf ~ Normal(0, 0.01) is extremely tight. The sampler struggles to adapt its step size to work well for both simultaneously.
  • Redundant parameters: x1 and x2 both depend on R_0 and appear only as x1 + x2 - 2*R_0, making them poorly identified — many combinations of x1 and x2 give the same sum.

A fixed version:

@model function fixed_model()
    R_0 ~ Normal(0, 1)       # Moderate scale
    D_inf ~ Normal(0, 1)     # Same scale as R_0

    x1 ~ Normal(R_0, 1)
    x2 ~ Normal(R_0, 1)

    for i in 1:10
        0.0 ~ Normal(x1 + x2 - 2*R_0, 0.1)
    end
end

chain_fixed = sample(fixed_model(), NUTS(), 1000)
plot(chain_fixed)

After fixing the scales, R̂ should be much closer to 1.0 and ESS should increase substantially.

How to fix problematic models:

  1. Reparametrise: Use centered parameterizations
  2. Better priors: Match scales of parameters
  3. Longer adaptation: Increase warmup period
  4. Check model: Maybe the model specification is wrong

How the pieces fit together

After three sessions, you now have a complete workflow for Bayesian inference with deterministic models. In the Introduction, you learned the core concepts: priors encode beliefs before seeing data, likelihoods measure how well parameters explain observations, and Bayes’ theorem combines them into a posterior. In MCMC, you saw how to sample from posteriors that are too complex for grid search - from basic Metropolis-Hastings through adaptive methods (RAM) to gradient-based samplers (NUTS). This session covered diagnostics: trace plots for visual assessment, R̂ for convergence, and ESS for effective sample size.

The workflow: Define model → Run MCMC → Check convergence → Check model fit → Iterate

But MCMC convergence doesn’t mean your model is correct - that’s next in Model Checking.

Summary

MCMC always produces numbers, but those numbers might not represent the posterior. Visual diagnostics (trace plots, autocorrelation) give quick intuition; quantitative diagnostics (R̂, ESS) make assessment rigorous.

R̂ compares within-chain and between-chain variance - values above 1.1 indicate non-convergence. ESS estimates how many independent samples you actually have after accounting for autocorrelation. Even 10,000 samples might give ESS of only 100 if the chain mixes slowly.

Common problems have standard solutions: poor mixing (reparameterise or use NUTS), divergences (increase adapt_delta), multimodality (multiple chains with different starts). Run at least four chains and check diagnostics before trusting any results.

TipLearning points
  • Always check convergence - never trust MCMC without diagnostics
  • Multiple chains are essential for robust inference
  • R̂ < 1.1 and ESS > 100 are minimum requirements
  • NUTS is usually better than basic Metropolis-Hastings
  • When in doubt, run longer and use more chains

MCMC only works if you use it correctly. Diagnostics aren’t optional.

References

Next session: Converged chains don’t guarantee a good model. In Model Checking and Validation, we’ll learn how to check whether your model actually describes the data well.