Activating project at `~/work/mfiidd/mfiidd`
MCMC Diagnostics
Estimated time: 1.5-2 hours
Introduction
In the previous session, you learned how MCMC algorithms explore posterior distributions by taking random walks through parameter space. You implemented Metropolis-Hastings yourself and saw how Turing.jl’s NUTS algorithm automates everything with gradient-based sampling. But there’s a critical question we haven’t yet fully addressed: How do you know if the inference procedure has worked?
MCMC algorithms don’t come with guarantees. A chain will always give you numbers, but those numbers might not be representative posterior samples. The chain could be stuck exploring only part of the parameter space, still working towards the target distribution, or mixing so slowly that your “samples” are nearly duplicates. MCMC only guarantees unbiased sampling in the limit of infinitely many samples. Using unreliable MCMC output leads to incorrect inference - wrong parameter estimates, overconfident intervals, and bad scientific conclusions.
This session teaches you how to diagnose MCMC chains systematically. You will learn to spot problems visually with trace plots, quantify convergence with R̂ and effective sample size, and understand when your posterior samples are trustworthy. These diagnostic skills are essential for any serious Bayesian analysis.
Objectives
In this session, you will learn how to assess whether your MCMC chains have converged and whether your posterior samples are reliable. This is an important component of trustworthy Bayesian analysis.
You will learn to:
- Visually diagnose chains using trace plots and other visualisations
- Calculate quantitative diagnostics like R̂ and effective sample size
- Identify common problems and how to fix them
- Use Turing.jl’s built-in diagnostics for automated checking
- Apply best practices for robust MCMC in practice
Source file
The source file of this session is located at sessions/mcmc_diagnostics.qmd.
Load libraries
using Random ## for random numbers
using Distributions ## for probability distributions
using DifferentialEquations ## for differential equations
using DataFrames ## for data frames
using Plots ## for plots
using StatsPlots ## for statistical plots
using Turing ## for probabilistic programming and MCMC
using MCMCChains ## for MCMC diagnostics
using StatsBase ## for autocorrelationInitialisation
We set a random seed for reproducibility. Setting this ensures that you should get exactly the same results on your computer as we do.
Random.seed!(12345)TaskLocalRNG()
Load data and model from previous session
We’ll use the same SIR model and epidemic data from previous sessions.
Main.Notebook.simulate_sir
src/sir_model.jl)
# Initial state and time range
init_state = [999, 1, 0]
t0, tend = (0, 30)
# Epidemic data
epi1 = DataFrame(
time = 1:1:30,
obs = [
1, 3, 4, 11, 19, 38, 67, 112, 183, 210, 237, 182, 178, 154, 95, 68, 51,
30, 32, 18, 19, 9, 2, 2, 2, 0, 0, 1, 0, 0
]
)
# Define our SIR model in Turing.jl (same as MCMC session)
@model function sir_model(init_state, times, n_obs)
# Priors
R_0 ~ Uniform(1.0, 20.0)
D_inf ~ Uniform(1.0, 14.0)
# Simulate ODE
prob = ODEProblem(sir_ode!, init_state, (times[1], times[end]), [R_0, D_inf])
sol = solve(prob, saveat = times)
# Likelihood
I = sol[2, :]
# Use max to handle tiny numerical errors from automatic differentiation
lambdas = [max(I[i], 0.0) for i in 1:n_obs]
obs ~ arraydist(Poisson.(lambdas))
end
# Create model instance
model = sir_model(init_state, epi1.time, length(epi1.obs)) | (; obs = epi1.obs)Visual diagnostics
1. Trace plots: The most important diagnostic
Trace plots show parameter values over iterations. A well-mixed chain looks like a “hairy caterpillar” - densely filled with rapid up-and-down fluctuations around a stable mean:
# Good chain: well-mixing NUTS
chain_good = sample(model, NUTS(), 2000)# chain_good[:R_0] extracts the samples for R_0 from the chain object
# You can use this with any parameter name defined in your @model function
plot(
chain_good[:R_0], title = "Good Chain: R_0", xlabel = "Iteration",
ylabel = "R_0", linewidth = 1
)# For comparison, let's look at the other parameter too
p1 = plot(
chain_good[:R_0], title = "R_0 Trace", xlabel = "Iteration",
ylabel = "R_0", linewidth = 1
)
p2 = plot(
chain_good[:D_inf], title = "D_inf Trace", xlabel = "Iteration",
ylabel = "D_inf", linewidth = 1
)
plot(p1, p2, layout = (2, 1), size = (800, 600))The features that you’ll want to look out for in trace plots are:
- Good mixing: Chain moves freely around parameter space
- Stationarity: No obvious trends or drifts
- Fat hairy caterpillar: Rapid exploration of the posterior
The kind of features to avoid are:
- Poor mixing: Chain gets stuck in regions
- Trends: Chain still climbing/descending
- Thin worm: Chain not exploring enough
2. Running averages
Running averages should stabilise if the chain has converged:
function running_mean(x)
return [mean(x[1:i]) for i in eachindex(x)]
end
R_0_samples = vec(chain_good[:R_0])
p1 = plot(
running_mean(R_0_samples), title = "Running Mean: R_0",
xlabel = "Iteration", ylabel = "Running Mean R_0", linewidth = 2
)
hline!(
p1, [mean(R_0_samples)], color = :red, linestyle = :dash,
label = "Final mean"
)
D_inf_samples = vec(chain_good[:D_inf])
p2 = plot(
running_mean(D_inf_samples), title = "Running Mean: D_inf",
xlabel = "Iteration", ylabel = "Running Mean D_inf", linewidth = 2
)
hline!(
p2, [mean(D_inf_samples)], color = :red, linestyle = :dash,
label = "Final mean"
)
plot(p1, p2, layout = (2, 1))3. Autocorrelation
MCMC samples are correlated. Good chains have autocorrelation that decays quickly:
function autocorr_plot(chain, param_name; max_lag = 100)
samples = Array(chain[param_name]) # Convert to plain array
lags = 0:max_lag
# autocor from StatsBase expects a vector and returns scalar for each lag
autocorrs = [autocor(samples, [lag])[1] for lag in lags]
plot(
lags, autocorrs, title = "Autocorrelation: $param_name",
xlabel = "Lag", ylabel = "Autocorrelation", linewidth = 2
)
hline!([0], color = :gray, linestyle = :dash, label = "")
hline!(
[-0.05, 0.05], color = :red, linestyle = :dash, label = "±5% threshold"
)
end
p1 = autocorr_plot(chain_good, :R_0)
p2 = autocorr_plot(chain_good, :D_inf)
plot(p1, p2, layout = (2, 1), size = (800, 600))We can see that autocorrelation drops quickly to near zero and both parameters show good mixing. After ~20-30 MCMC steps, samples are essentially independent
Look back at the trace plots, running mean plots, and autocorrelation plots above.
- What would a bad trace plot look like? Sketch or describe what you would see if the chain was stuck or not mixing well.
- If the running mean was still trending upward after 2000 iterations, what would that suggest?
- If autocorrelation remained high (say > 0.5) even at lag 100, what would that indicate about your effective sample size?
4. Burn-in and warmup
MCMC chains often start from arbitrary initial values and need time to “find” the high-probability regions of the posterior. Samples from this initial exploration phase (called “burn-in”) are discarded because they don’t represent the target distribution.
Modern samplers like NUTS have automatic warmup periods where they adapt the step size to achieve good acceptance rates, learn the geometry of the posterior and explore to find typical parameter values.
In our earlier example, when we ran sample(model, NUTS(), 2000), Turing actually ran ~3000 iterations total: 1000 warmup iterations (automatically discarded) plus 2000 sampling iterations. The returned chain contains only the 2000 post-warmup samples.
By default, NUTS uses 1000 warmup iterations regardless of how many sampling iterations you request. You can change this with the n_adapts keyword: NUTS(0.65; n_adapts=2000) uses 2000 warmup iterations instead. For most models, the default 1000 is sufficient.
With NUTS, you typically don’t need to manually discard burn-in samples. The warmup is automatic and adaptive. However, you should still check trace plots to ensure the chain has converged - if you see trends even after warmup, you may need longer adaptation.
Quantitative diagnostics
1. R̂ statistic (Gelman-Rubin diagnostic)
R̂ compares within-chain and between-chain variance. It should be < 1.1 (ideally < 1.01):
To run multiple chains in parallel, Julia must be started with multiple threads. Check your current thread count with Threads.nthreads(). If it shows 1, restart Julia with:
julia -t auto # Use all available CPU cores
julia -t 4 # Use exactly 4 threadsIn VS Code, you can set this permanently in settings: search for “Julia: Num Threads” and set it to “auto” or a specific number. Without multiple threads, chains will run sequentially (slower but still correct).
# Run multiple chains for R̂ calculation
# Detect available threads to avoid overwhelming CI systems
n_chains = max(2, min(4, Threads.nthreads()))
println("Running $n_chains chains in parallel ($(Threads.nthreads()) threads available)")
chain_multi = sample(model, NUTS(), MCMCThreads(), 2000, n_chains)MCMCThreads()- tells Turing to run multiple chains in parallel using Julia threads. Much faster than running sequentially on multi-core machines.sample(model, sampler, MCMCThreads(), n_samples, n_chains)- the extended signature for parallel sampling. Here we requestn_chainschains, each withn_samplessamples.Threads.nthreads()- returns how many threads Julia has available. Start Julia withjulia -t autoto use all CPU cores.
# Check what each individual chain found
println("Individual chain summaries:")
for i in 1:n_chains
chain_i = chain_multi[:, :, i]
r0_mean = mean(chain_i[:R_0])
dinf_mean = mean(chain_i[:D_inf])
println(
" Chain $i: R_0 = $(round(r0_mean, digits = 2)), ",
"D_inf = $(round(dinf_mean, digits = 2))"
)
end
# Calculate R̂
println("\nR̂ values:")
rhat_values = DataFrame(rhat(chain_multi))
for row in eachrow(rhat_values)
param = row.parameters
value = row.rhat
status = value < 1.1 ? "good" : "bad"
println(" $param: $(round(value, digits = 3)) $status")
endIndividual chain summaries:
Chain 1: R_0 = 2.45, D_inf = 1.94
Chain 2: R_0 = 2.45, D_inf = 1.95
R̂ values:
R_0: 1.004 good
D_inf: 1.004 good
chain_multi[:, :, i]- extracts the i-th chain from a multi-chain object. The three dimensions are: samples × parameters × chains.rhat(chain)- calculates the R̂ (Gelman-Rubin) diagnostic. Compares variance within chains to variance between chains. Values < 1.1 indicate convergence; < 1.01 is even better.
# Visualise multiple chains
plot(chain_multi)2. Effective Sample Size (ESS)
ESS tells you how many independent samples your correlated MCMC samples are equivalent to:
ess_values = DataFrame(ess(chain_multi))
n_samples = length(chain_multi)2000
ess(chain)- calculates Effective Sample Size. MCMC samples are correlated, so N samples don’t give you N independent draws. ESS estimates how many independent samples your chain is equivalent to.
println("Effective Sample Size:")
for row in eachrow(ess_values)
param = row.parameters
value = row.ess
efficiency = value / n_samples * 100
status = value > 100 ? "good" : "bad" # Rule of thumb: ESS > 100
println(
" $param: $(round(value, digits = 0)) / $n_samples ",
"($(round(efficiency, digits = 1))%) $status"
)
endEffective Sample Size:
R_0: 1040.0 / 2000 (52.0%) good
D_inf: 1087.0 / 2000 (54.4%) good
ESS interpretation:
- ESS ≈ N: Very efficient, little autocorrelation
- ESS << N: High autocorrelation, need more samples
- ESS < 100: Generally insufficient for reliable inference
3. Monte Carlo Standard Error (MCSE)
MCSE quantifies uncertainty in your posterior estimates due to finite sampling:
mcse_values = DataFrame(mcse(chain_multi))
posterior_means = DataFrame(mean(chain_multi))
println("Posterior means ± MCSE:")
for row in eachrow(mcse_values)
param = row.parameters
mcse_val = row.mcse
# Get mean for this parameter
mean_val = posterior_means[posterior_means.parameters .== param, :mean][1]
rel_error = mcse_val / abs(mean_val) * 100
println(
" $param: $(round(mean_val, digits = 3)) ± ",
"$(round(mcse_val, digits = 4)) ",
"($(round(rel_error, digits = 2))% relative error)"
)
endPosterior means ± MCSE:
R_0: 2.446 ± 0.0011 (0.05% relative error)
D_inf: 1.945 ± 0.001 (0.05% relative error)
Common problems and solutions
Problem 1: Poor mixing (chain gets stuck)
Common causes:
- Algorithm too slow: Basic samplers get stuck in regions
- Poor step size: Proposals too small or too large
- Difficult geometry: Highly correlated parameters, narrow ridges
- Multimodality: Chain trapped in one mode
Solutions:
- Use better algorithms: NUTS (which we’re using!) is usually excellent
- Run longer: Increase number of samples if mixing is just slow
- Reparametrise: Transform parameters to improve geometry
- Multiple chains: Run several chains to detect stuck chains
For modern gradient-based samplers like NUTS, poor mixing is rare if:
- Your model is specified correctly
- The posterior is reasonably well-behaved
- You’re using appropriate priors
Problem 2: Divergent transitions (for HMC/NUTS)
NUTS simulates a ball rolling over the posterior surface. A divergent transition occurs when the simulation becomes numerically unstable — the ball “flies off” to implausible values instead of following the contour. This typically happens when the posterior has sharp ridges or funnels: the step size that works in flat regions is too large for the narrow parts, causing numerical errors to explode.
A small number of divergences (say <1% of samples) may be acceptable, but many divergences indicate biased sampling — the sampler is systematically avoiding regions it can’t navigate, so the samples underrepresent those areas.
# Check for divergent transitions
if haskey(chain_good.info, :divergent_transitions)
n_divergent = sum(chain_good.info[:divergent_transitions])
println("Divergent transitions: $n_divergent / $(length(chain_good))")
if n_divergent > 0
println("Consider increasing adapt_delta or reparameterizing")
else
println("No divergent transitions")
end
endProblem 3: Multi-modal posteriors
Sometimes posteriors have multiple peaks, and chains might get trapped in one mode. Running multiple chains with different starting points helps detect this - if chains converge to different modes, R̂ will be high.
Solutions for multimodal posteriors: 1. Run multiple chains with different starting points 2. Use more advanced samplers (parallel tempering, etc.) 3. Check R̂: Will be high if chains are in different modes
Here’s an artificial example with two modes to show how multiple chains help:
# Create a artificial multimodal example
@model function multimodal_example()
x ~ Normal(0, 5) # Wide prior
# Likelihood with two modes
mode1_loglik = logpdf(Normal(-2, 0.5), x)
mode2_loglik = logpdf(Normal(2, 0.5), x)
Turing.@addlogprob! log(0.5 * exp(mode1_loglik) + 0.5 * exp(mode2_loglik))
endmultimodal_example (generic function with 2 methods)
Turing.@addlogprob!manually adds a value to the log-probability of the model. This is useful when you can’t express the likelihood using~syntax — here we need a mixture of two Normals, which requires computing the log-probability by hand.- The
~syntax (e.g.x ~ Normal(0, 5)) automatically adds the log-density to the model’s total log-probability.@addlogprob!does the same thing, but you compute the value yourself.
# Single chain might miss one mode
chain_single = sample(multimodal_example(), NUTS(), 1000)
# Multiple chains more likely to find both modes
chains_multi = [sample(multimodal_example(), NUTS(), 1000) for _ in 1:4]p = plot(title = "Multimodal Posterior")
for (i, chain) in enumerate(chains_multi)
density!(p, chain[:x], alpha = 0.7, label = "Chain $i")
end
plot!(p)Different chains find different modes, which would show up as high R̂ values.
Posterior predictive checks
Convergence diagnostics tell you if the algorithm worked. Posterior predictive checks tell you if the model makes sense.
Why check predictions? Even if MCMC converged perfectly, your model might still be wrong. The model could be mis-specified, missing important features, or based on incorrect assumptions. Posterior predictive checks reveal whether your model actually generates data that looks like what you observed.
# Generate predictions from the posterior
# decondition() removes the observed data, so Turing samples new observations
predictions = predict(decondition(model), chain_multi)Chains MCMC chain (2000×30×2 Array{Float64, 3}):
Iterations = 1:1:2000
Number of chains = 2
Samples per chain = 2000
parameters = obs[1], obs[2], obs[3], obs[4], obs[5], obs[6], obs[7], obs[8], obs[9], obs[10], obs[11], obs[12], obs[13], obs[14], obs[15], obs[16], obs[17], obs[18], obs[19], obs[20], obs[21], obs[22], obs[23], obs[24], obs[25], obs[26], obs[27], obs[28], obs[29], obs[30]
internals =
Use `describe(chains)` for summary statistics and quantiles.
predict(model, chain)- generates new data from the model using parameter values from the posterior chain. This is the basis for posterior predictive checks.decondition(model)- removes the conditioning (observed data) from the model, so thatobs ~ arraydist(Poisson.(lambdas))samples new observations instead of computing likelihoods against the data. This is how you generate posterior predictive samples from a conditioned model.
# Plot posterior predictive samples
p = plot(title = "Posterior Predictive Check", xlabel = "Time", ylabel = "Cases")
# Plot multiple posterior predictive samples
for i in 1:50
# Extract predictions for each time point
pred_obs = [predictions[Symbol("obs[$j]")][i, 1] for j in 1:length(epi1.obs)]
plot!(p, epi1.time, pred_obs, alpha = 0.3, color = :lightblue, label = "")
end
# Add real data
scatter!(p, epi1.time, epi1.obs, label = "Real data", markersize = 4, color = :red)
plot!(p)decondition(model)removes the observed data from the model- Our
modelwas created withsir_model(...) | (; obs = epi1.obs), which conditions on the observed data.decondition()reverses this, telling Turing to sample new observations rather than computing likelihoods against the data.
- Our
predict(decondition(model), chain_multi)uses parameters from the posterior chain to simulate new data- For each MCMC sample: takes R₀ and D_inf from the chain → solves ODE → samples new obs from Poisson
- The result is stored with keys like
Symbol("obs[1]"),Symbol("obs[2]"), etc.- Because our model uses
obs ~ arraydist(Poisson.(lambdas)), Turing names each element of the predicted vectorobs[1],obs[2], etc. To access these in Julia, we convert the string to aSymbol:Symbol("obs[$j]")becomes:obs[1],:obs[2], etc. predictions[Symbol("obs[$j]")][i, 1]then extracts thei-th posterior sample for thej-th observation. The1is the chain index (we have one chain here).
- Because our model uses
Do the simulated epidemics (blue lines) look similar to the real data (red points)? They should capture the timing, peak size, and overall shape. If they don’t, your model needs improvement - no amount of MCMC will fix a fundamentally wrong model.
The complete workflow
- Run multiple chains (at least 4)
- Check trace plots visually
- Calculate R̂ and ESS
- Check for divergences (NUTS)
- Do posterior predictive checks
- Re-run if needed with more samples or different settings
Advanced diagnostics
This section covers refinements to the basic diagnostics above. If you’re short on time, you can skip ahead to the troubleshooting guide — the core diagnostics (R̂, ESS, trace plots) from the previous section are sufficient for most analyses.
Split-R̂
Modern R̂ splits each chain in half for more robust diagnostics:
# This is what rhat() actually computes - split chains for better diagnostics
rhat_split = DataFrame(rhat(chain_multi))
println("Split-R̂ values (recommended):")
for row in eachrow(rhat_split)
println(" $(row.parameters): $(round(row.rhat, digits = 4))")
endSplit-R̂ values (recommended):
R_0: 1.0038
D_inf: 1.004
Bulk and tail ESS
Turing.jl provides different ESS measures:
# Different types of ESS
ess_bulk = DataFrame(ess(chain_multi, kind = :bulk)) # Central part of distribution
ess_tail = DataFrame(ess(chain_multi, kind = :tail)) # Tails of distribution
println("Bulk ESS (central posterior):")
for row in eachrow(ess_bulk)
println(" $(row.parameters): $(round(row.ess, digits = 0))")
end
println("\nTail ESS (posterior tails):")
for row in eachrow(ess_tail)
println(" $(row.parameters): $(round(row.ess, digits = 0))")
endBulk ESS (central posterior):
R_0: 1040.0
D_inf: 1087.0
Tail ESS (posterior tails):
R_0: 1180.0
D_inf: 1241.0
Energy diagnostics (for HMC/NUTS)
NUTS uses Hamiltonian dynamics. Energy plots help diagnose geometric problems:
# Energy diagnostic plot
if haskey(chain_good.info, :energy)
energy = chain_good.info[:energy]
histogram(energy, title = "Energy Distribution", xlabel = "Energy", ylabel = "Frequency", alpha = 0.7)
endTroubleshooting guide
| Problem | Symptoms | Solutions |
|---|---|---|
| Poor mixing | Trace plots show trends, low ESS | Use NUTS, tune proposals, longer runs |
| Multimodality | High R̂, different chain means | Multiple chains, better initialisation |
| Divergences | NUTS warnings, biased sampling | Increase adapt_delta, reparametrise |
| Slow convergence | High autocorrelation, low ESS | Better algorithm, longer runs, reparametrise |
| Non-stationarity | Trending trace plots | Longer warmup, check model specification |
Practical example: Debugging a problematic model
Let’s create and diagnose a problematic model:
# Poorly specified model with correlations and poor scaling
@model function problematic_model()
# Poor scaling: one parameter much larger than others
R_0 ~ Normal(0, 100) # Wrong scale
D_inf ~ Normal(0, 0.01) # Very tight
# Add some correlation structure to make it harder
x1 ~ Normal(R_0, 1)
x2 ~ Normal(R_0, 1)
# Some fake observations
for i in 1:10
0.0 ~ Normal(x1 + x2 - 2*R_0, 0.1)
end
endproblematic_model (generic function with 2 methods)
# This model will have problems
chain_problematic = sample(problematic_model(), NUTS(), 1000)# Diagnose it
println("R̂ values:")
display(DataFrame(rhat(chain_problematic)))
println("\nESS values:")
display(DataFrame(ess(chain_problematic)))R̂ values:
ESS values:
| Row | parameters | rhat |
|---|---|---|
| Symbol | Float64 | |
| 1 | R_0 | 1.21678 |
| 2 | D_inf | 1.01 |
| 3 | x1 | 1.21652 |
| 4 | x2 | 1.2173 |
| Row | parameters | ess | ess_per_sec |
|---|---|---|---|
| Symbol | Float64 | Float64 | |
| 1 | R_0 | 4.49209 | 0.767484 |
| 2 | D_inf | 149.025 | 25.4612 |
| 3 | x1 | 4.49439 | 0.767878 |
| 4 | x2 | 4.48912 | 0.766977 |
# Look at the trace plots
plot(chain_problematic)- Look at the trace plots and quantitative diagnostics above. What specific problems can you identify?
- Which parameters show the worst R̂ and ESS values?
- Try modifying the
problematic_modelabove to improve its behaviour. Hint: Change the prior scales to be more similar across parameters, or remove the correlation structure. - Run the modified model and compare the diagnostics.
The problems are:
- Mismatched scales:
R_0 ~ Normal(0, 100)is very wide whileD_inf ~ Normal(0, 0.01)is extremely tight. The sampler struggles to adapt its step size to work well for both simultaneously. - Redundant parameters:
x1andx2both depend onR_0and appear only asx1 + x2 - 2*R_0, making them poorly identified — many combinations ofx1andx2give the same sum.
A fixed version:
@model function fixed_model()
R_0 ~ Normal(0, 1) # Moderate scale
D_inf ~ Normal(0, 1) # Same scale as R_0
x1 ~ Normal(R_0, 1)
x2 ~ Normal(R_0, 1)
for i in 1:10
0.0 ~ Normal(x1 + x2 - 2*R_0, 0.1)
end
end
chain_fixed = sample(fixed_model(), NUTS(), 1000)
plot(chain_fixed)After fixing the scales, R̂ should be much closer to 1.0 and ESS should increase substantially.
How to fix problematic models:
- Reparametrise: Use centered parameterizations
- Better priors: Match scales of parameters
- Longer adaptation: Increase warmup period
- Check model: Maybe the model specification is wrong
How the pieces fit together
After three sessions, you now have a complete workflow for Bayesian inference with deterministic models. In the Introduction, you learned the core concepts: priors encode beliefs before seeing data, likelihoods measure how well parameters explain observations, and Bayes’ theorem combines them into a posterior. In MCMC, you saw how to sample from posteriors that are too complex for grid search - from basic Metropolis-Hastings through adaptive methods (RAM) to gradient-based samplers (NUTS). This session covered diagnostics: trace plots for visual assessment, R̂ for convergence, and ESS for effective sample size.
The workflow: Define model → Run MCMC → Check convergence → Check model fit → Iterate
But MCMC convergence doesn’t mean your model is correct - that’s next in Model Checking.
Summary
MCMC always produces numbers, but those numbers might not represent the posterior. Visual diagnostics (trace plots, autocorrelation) give quick intuition; quantitative diagnostics (R̂, ESS) make assessment rigorous.
R̂ compares within-chain and between-chain variance - values above 1.1 indicate non-convergence. ESS estimates how many independent samples you actually have after accounting for autocorrelation. Even 10,000 samples might give ESS of only 100 if the chain mixes slowly.
Common problems have standard solutions: poor mixing (reparameterise or use NUTS), divergences (increase adapt_delta), multimodality (multiple chains with different starts). Run at least four chains and check diagnostics before trusting any results.
- Always check convergence - never trust MCMC without diagnostics
- Multiple chains are essential for robust inference
- R̂ < 1.1 and ESS > 100 are minimum requirements
- NUTS is usually better than basic Metropolis-Hastings
- When in doubt, run longer and use more chains
MCMC only works if you use it correctly. Diagnostics aren’t optional.
References
- Vehtari et al. (2021). Rank-normalization, folding, and localization: An improved R̂ for assessing convergence of MCMC. Bayesian Analysis 16(2), 667-718. (Modern R̂ and ESS)
- Gelman et al. (2013). Bayesian Data Analysis, 3rd ed. Chapman & Hall/CRC. (Chapter 11 covers diagnostics)
- MCMCChains.jl documentation - Chain diagnostics in Julia
- ArviZ - Exploratory analysis of Bayesian models (Python, with Julia bindings via ArviZ.jl)
Next session: Converged chains don’t guarantee a good model. In Model Checking and Validation, we’ll learn how to check whether your model actually describes the data well.