Activating project at `~/work/mfiidd/mfiidd`
Markov Chain Monte Carlo (MCMC) sampling
Estimated time: 2.5-3 hours
Lecture slides: PowerPoint / PDF
Introduction
In the previous session, you learned Bayesian inference through manual calculations - computing priors, likelihoods, and posteriors and exploring the posterior distribution using grid search. This approach introduced you to the core components of conducting inference, but it also exposed a fundamental problem: grid search doesn’t scale.
This session introduces Markov Chain Monte Carlo (MCMC), the algorithmic solution that makes Bayesian inference practical for real problems. Rather than exhaustively searching parameter space, MCMC intelligently explores it by taking random walks that spend more time in high-probability regions.
You’ll build understanding by coding a simple MCMC algorithm yourself, then see how modern tools like Turing.jl automate everything with state-of-the-art methods. By the end, you’ll understand both what MCMC does under the hood and how to use it effectively in practice.
Objectives
The aim of this session is to learn how to sample from posterior distributions using MCMC. By building the algorithm yourself first, you’ll understand what modern tools like Turing.jl do under the hood.
In this session, you will:
- Implement the Metropolis-Hastings MCMC algorithm and apply it to a simple problem
- Experience the challenges of using MCMC: proposal tuning and mixing
- See how Turing.jl automates Metropolis-Hastings for you
- Learn why NUTS is the modern upgrade that solves some of the problems with Metropolis-Hastings
- Apply NUTS to conduct Bayesian inference on the SIR model
Source file
The source file of this session is located at sessions/mcmc.qmd.
Load libraries
using Random ## for random numbers
using Distributions ## for probability distributions
using DifferentialEquations ## for differential equations
using DataFrames ## for data frames
using Plots ## for plots
using StatsPlots ## for statistical plots
using Turing ## for probabilistic programming and MCMC
using MCMCChains ## for MCMC diagnostics
using AdvancedMH ## for adaptive MCMC samplersInitialisation
We set a random seed for reproducibility. Setting this ensures that you should get exactly the same results on your computer as we do.
Random.seed!(1234)TaskLocalRNG()
Load data and model from previous session
We’ll use the same SIR model and epidemic data from the introduction session.
Main.Notebook.simulate_sir
src/sir_model.jl)
Key functions:
sir_ode!(du, u, p, t)- ODE system for S, I, R compartmentssimulate_sir(R_0, D_inf, S0, I0, times)- Returns DataFrame with trajectory
# Initial state and time range
init_state = [999, 1, 0]
t0, tend = (0, 30)
# Epidemic data
epi1 = DataFrame(
time = 1:1:30,
obs = [
1, 3, 4, 11, 19, 38, 67, 112, 183, 210, 237, 182, 178, 154, 95, 68, 51,
30, 32, 18, 19, 9, 2, 2, 2, 0, 0, 1, 0, 0
]
)| Row | time | obs |
|---|---|---|
| Int64 | Int64 | |
| 1 | 1 | 1 |
| 2 | 2 | 3 |
| 3 | 3 | 4 |
| 4 | 4 | 11 |
| 5 | 5 | 19 |
| 6 | 6 | 38 |
| 7 | 7 | 67 |
| 8 | 8 | 112 |
| 9 | 9 | 183 |
| 10 | 10 | 210 |
| 11 | 11 | 237 |
| 12 | 12 | 182 |
| 13 | 13 | 178 |
| ⋮ | ⋮ | ⋮ |
| 19 | 19 | 32 |
| 20 | 20 | 18 |
| 21 | 21 | 19 |
| 22 | 22 | 9 |
| 23 | 23 | 2 |
| 24 | 24 | 2 |
| 25 | 25 | 2 |
| 26 | 26 | 0 |
| 27 | 27 | 0 |
| 28 | 28 | 1 |
| 29 | 29 | 0 |
| 30 | 30 | 0 |
From manual exploration to algorithmic sampling
In the previous session, we manually explored the posterior by trying different parameter combinations. This worked for our simple SIR model with 2 parameters, but we also saw that the number of combinations to explore can be prohibitive.
Instead of trying all combinations, Markov chain Monte Carlo works in the following way:
- Start somewhere in parameter space
- Propose a small move to nearby parameters
- Accept or reject the move based on how much better/worse the posterior is
- Repeat many times to build up a sample
This is the Metropolis-Hastings algorithm - let’s implement it ourselves!
Metropolis-Hastings for a simple case
Let’s start by sampling from a simple 1D Normal distribution to understand the algorithm:
# Target distribution: Normal(5, 2)
target_mean = 5.0
target_std = 2.0
target_dist = Normal(target_mean, target_std)
# Helper function to evaluate log-density of target
function eval_log_density(x)
return logpdf(target_dist, x)
end
# Simple Metropolis-Hastings sampler
function metropolis_hastings_1d(log_density_fn, n_samples, initial_value,
proposal_std)
# Initialise sample vector
samples = Float64[]
# current_x tracks the state of the chain
# initially, set to given initial value
current_x = initial_value
# Evaluate log probability at the initial value
current_log_prob = log_density_fn(current_x)
# Track number of acceptances
n_accepted = 0
for i in 1:n_samples
# Propose new state
proposal_x = current_x + randn() * proposal_std
# Evaluate log probability at target
proposal_log_prob = log_density_fn(proposal_x)
# Calculate acceptance probability (in log space)
# Note: This is the Metropolis ratio (symmetric proposal)
# Full Metropolis-Hastings would include proposal ratio q(x|x')/q(x'|x)
# But Normal proposal is symmetric, so this ratio = 1 and cancels out
log_alpha = proposal_log_prob - current_log_prob
alpha = exp(log_alpha)
# Accept or reject
if rand() < alpha
current_x = proposal_x
current_log_prob = proposal_log_prob
n_accepted += 1
end # if reject, current_x stays what it is
push!(samples, current_x)
end
acceptance_rate = n_accepted / n_samples
return samples, acceptance_rate
end
# Test our sampler
samples_mh, acceptance_rate = metropolis_hastings_1d(
eval_log_density, 5000, 0.0, 1.0
)
println("Acceptance rate: $(round(acceptance_rate * 100, digits = 1))%")
println("Sample mean: $(round(mean(samples_mh), digits = 2)) (true: $(target_mean))")
println("Sample std: $(round(std(samples_mh), digits = 2)) (true: $(target_std))")Acceptance rate: 84.0%
Sample mean: 5.12 (true: 5.0)
Sample std: 2.08 (true: 2.0)
# Visualise the results
p1 = plot(
samples_mh[1:1000], title = "Trace Plot (first 1000 samples)",
xlabel = "Iteration", ylabel = "x", linewidth = 1
)
hline!(p1, [target_mean], color = :red, linewidth = 2, label = "True mean")
p2 = histogram(
samples_mh[1001:end], normalize = :pdf, alpha = 0.7,
title = "Sampled Distribution", xlabel = "x", ylabel = "Density",
label = "MCMC samples"
)
# Overlay true distribution
x_range = range(-5, 15, length = 100)
plot!(
p2, x_range, pdf.(target_dist, x_range), linewidth = 3, color = :red,
label = "True distribution"
)
plot(p1, p2, layout = (2, 1), size = (800, 600))The general principle: MCMC as a universal random number generator
What we just built solves a fundamental problem in computational statistics: given a probability distribution where we can evaluate the density at any point, how do we generate random samples from it?
Many useful probability distributions don’t have simple sampling algorithms. Easy distributions like Normal (randn()) and Uniform (rand()) have built-in samplers. But posteriors, mixtures, and custom models have no simple formula.
MCMC solves this by only requiring that we can evaluate the log-probability at any point. We used a Normal distribution above (even though we could just use randn()!) to demonstrate this general principle.
Even “simple” distributions like the Normal don’t have a direct sampling formula. When you call randn() in Julia or rnorm() in R, the computer uses transformation methods:
Common approaches:
Box-Muller transform: Convert two uniform random numbers to two normal random numbers
# Simplified version u1, u2 = rand(), rand() z1 = sqrt(-2 * log(u1)) * cos(2π * u2) z2 = sqrt(-2 * log(u1)) * sin(2π * u2) # z1 and z2 are standard normalInverse CDF method: Transform uniform → normal via inverse cumulative distribution function
Ziggurat algorithm: Fast, more complex method (what Julia actually uses)
Even “easy” distributions require clever tricks. For complex posteriors with correlations, ODEs, and constraints, no such formula exists, and we need a method like MCMC.
The posterior distribution is just another probability distribution. We can evaluate it at any point: log_posterior(θ) = log_prior(θ) + log_likelihood(θ|data). Because of this, MCMC can sample from it. The samples give us everything we need: visualisation (plot the distribution shape), summaries (mean, median, credible intervals), and forward simulation (with parameter correlations preserved).
When we apply MCMC to the SIR model later, we will be doing exactly the same thing as we did here for the Normal example - just with a more complex probability distribution that happens to be a Bayesian posterior. MCMC doesn’t “know” about Bayesian inference; it just knows how to sample from distributions where we can evaluate the density.
Understanding proposal distributions
The choice of proposal standard deviation matters:
# Test different proposal standard deviations
proposal_stds = [0.1, 0.5, 1.0, 5.0, 10.0, 50.0]
results = []
for prop_std in proposal_stds
samples, acc_rate = metropolis_hastings_1d(eval_log_density, 5000, 0.0, prop_std)
push!(results, (std = prop_std, samples = samples, acc_rate = acc_rate))
end
# Print acceptance rates
println("Acceptance rates for different proposal std:")
for result in results
println(
" Proposal std = $(result.std): ",
"$(round(result.acc_rate*100, digits = 1))%"
)
end
# Plot trace plots for different proposal stds
plots = []
for (i, result) in enumerate(results)
p = plot(
result.samples,
title = "Prop. std = $(result.std), " *
"Acc = $(round(result.acc_rate*100, digits = 1))%",
xlabel = "Iteration", ylabel = "x", linewidth = 1, ylim = (-5, 15)
)
hline!(p, [target_mean], color = :red, linewidth = 2, label = "")
push!(plots, p)
end
plot(plots..., layout = (length(plots), 1), size = (800, 1000))Acceptance rates for different proposal std:
Proposal std = 0.1: 98.5%
Proposal std = 0.5: 91.8%
Proposal std = 1.0: 85.1%
Proposal std = 5.0: 42.6%
Proposal std = 10.0: 24.2%
Proposal std = 50.0: 5.0%
- Too small proposals → High acceptance, but slow exploration
- Too large proposals → Low acceptance, wasted computation
- Optimal: ~23% acceptance rate for random walk Metropolis
For the normal distribution, sampling was straightforward. But efficient sampling from complex posterior distributions requires a good proposal distribution, which can be difficult to find. This is where more sophisticated algorithms that tune the proposal distribution automatically to a given problem really help.
You’ve now implemented Metropolis-Hastings from scratch and understand its core mechanics. The next sections show how modern tools automate this. If you’re working through this in one sitting, this is a natural pause point.
Automating MCMC with Turing.jl
Now that you understand how MCMC works internally, let’s see how Turing.jl automates everything you just coded.
Turing.jl on the Normal distribution
First, let’s use Turing.jl to sample from the exact same Normal distribution you just coded manually:
# Define the same Normal(5, 2) distribution in Turing
@model function normal_model()
x ~ Normal(5.0, 2.0)
end@model- a Turing.jl macro that defines a probabilistic model. Everything inside becomes part of the model specification.x ~ Normal(5.0, 2.0)- the~operator means “is distributed as”. This declares thatxis a random variable drawn from a Normal distribution with mean 5 and standard deviation 2. Turing will infer the value ofxduring sampling.
If you’ve used probabilistic programming languages, you’ll recognise the ~ syntax. Turing’s @model is similar to model definitions in Stan, NIMBLE, and BUGS/JAGS:
# NIMBLE / BUGS / JAGS
model({
x ~ dnorm(5, sd = 2)
})// Stan
model {
x ~ normal(5, 2);
}The key advantage of Turing is that the model is written in native Julia, so you can use loops, conditionals, and call any Julia function (like our ODE solver) directly inside the model — no separate modelling language needed.
# Sample using Metropolis-Hastings (normal proposals with std 10.0)
chain_normal_mh = sample(normal_model(), MH([10.0;;]), 50000)sample(model, sampler, n)- runs the MCMC sampler on the model forniterations, returning a chain of samples.MH(proposal)- Metropolis-Hastings sampler.MHexpects a matrix for the proposal covariance. The[10.0;;]syntax creates a 1×1 matrix — the double semicolon;;tells Julia “this is a matrix, not a vector”. Without it,[10.0]would be a 1-element vector, whichMHwouldn’t accept. The value 10.0 sets the proposal variance (so the proposal standard deviation is √10 ≈ 3.2).
# Compare to your manual implementation
println("Manual M-H:")
println(" Mean: $(round(mean(samples_mh), digits = 2))")
println(" Std: $(round(std(samples_mh), digits = 2))")
println("\nTuring.jl M-H:")
describe(chain_normal_mh)Manual M-H:
Mean: 5.12
Std: 2.08
Turing.jl M-H:
Chains MCMC chain (50000×4×1 Array{Float64, 3}):
Iterations = 1:1:50000
Number of chains = 1
Samples per chain = 50000
Wall duration = 4.33 seconds
Compute duration = 4.33 seconds
parameters = x
internals = logprior, loglikelihood, logjoint
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat ⋯
Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯
x 5.0080 1.9952 0.0212 8887.3055 10687.6710 1.0001 ⋯
1 column omitted
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
x 1.1032 3.6642 5.0047 6.3546 8.9090
describe(chain)- prints summary statistics for all parameters in the chain, including mean, standard deviation, and quantiles.chain[:x]- extracts the samples for parameterxfrom the chain. You can use this to access any parameter defined in your@modelfunction.
If you’ve used Stan, the workflow is similar: sample() runs inference and returns a chain object, describe() gives summary statistics, and chain[:x] extracts samples for a parameter.
| Action | rstan | cmdstanr | Turing.jl |
|---|---|---|---|
| Run sampler | sampling(model, ...) |
model$sample(...) |
sample(model, NUTS(), n) |
| Summary | summary(fit) |
fit$summary() |
describe(chain) |
| Extract parameter | as.array(fit, pars="x") |
fit$draws("x") |
chain[:x] |
# Visualise - same result!
p1 = histogram(
samples_mh, normalize = :pdf, alpha = 0.7,
title = "Manual M-H", xlabel = "x", ylabel = "Density", label = "Manual"
)
p2 = histogram(
chain_normal_mh[:x], normalize = :pdf, alpha = 0.7,
title = "Turing M-H", xlabel = "x", ylabel = "Density", label = "Turing"
)
plot(p1, p2, layout = (1, 2), size = (800, 300))Turing.jl did exactly what we coded earlier, but in a single call to the sample() function.
Applying Turing.jl to the SIR model
Turing.jl uses the @model macro to define Bayesian models. This is the same structure we’ve been using manually - priors, simulation, and likelihood - but now in a form that Turing can automatically sample from:
# Define our SIR model in Turing.jl
@model function sir_model(init_state, times, n_obs)
# Weakly informative priors
R_0 ~ truncated(Normal(2.5, 1.5), lower=1.0)
D_inf ~ truncated(Normal(5.0, 3.0), lower=0.5)
# Simulate ODE (same as manual approach)
prob = ODEProblem(sir_ode!, init_state, (times[1], times[end]), [R_0, D_inf])
sol = solve(prob, saveat = times)
# Likelihood (Poisson observation model of prevalence)
I = sol[2, :]
# max(..., 0.0) guards against tiny negative values from the ODE solver
# (numerical noise can push I slightly below zero near the end of an epidemic)
lambdas = [max(I[i], 0.0) for i in 1:n_obs]
obs ~ arraydist(Poisson.(lambdas))
end
# Create model instance
#| output: false
model = sir_model(init_state, epi1.time, length(epi1.obs)) | (; obs = epi1.obs)DynamicPPL.Model{typeof(sir_model), (:init_state, :times, :n_obs), (), (), Tuple{Vector{Int64}, Vector{Int64}, Int64}, Tuple{}, DynamicPPL.ConditionContext{@NamedTuple{obs::Vector{Int64}}, DynamicPPL.DefaultContext}, false}(sir_model, (init_state = [999, 1, 0], times = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], n_obs = 30), NamedTuple(), ConditionContext((obs = [1, 3, 4, 11, 19, 38, 67, 112, 183, 210, 237, 182, 178, 154, 95, 68, 51, 30, 32, 18, 19, 9, 2, 2, 2, 0, 0, 1, 0, 0],), DynamicPPL.DefaultContext()))
arraydist and | conditioning
Two new pieces of syntax appear here:
obs ~ arraydist(Poisson.(lambdas)) — declares a vector of observations where each element has its own distribution. arraydist creates an independent product distribution: obs[1] ~ Poisson(lambdas[1]), obs[2] ~ Poisson(lambdas[2]), etc. The dot in Poisson.(lambdas) is Julia’s broadcasting — it applies Poisson to each element of lambdas.
sir_model(...) | (; obs = epi1.obs) — the | operator conditions the model on observed data. It tells Turing “don’t sample obs — use these observed values instead and compute the likelihood”. The (; obs = epi1.obs) is a named tuple matching the variable name obs inside the model to the data — the semicolon is Julia’s syntax for a named tuple with no positional arguments.
This separation of model structure from data is a key Turing pattern: the @model function defines the generative process, and | supplies the observations. The same model can be used unconditioned (for simulation), conditioned (for inference), or deconditioned after fitting for posterior predictive checks (decondition() — you’ll see this in the Model Checking session).
Sample the SIR model with Metropolis-Hastings
Now let’s use the Metropolis-Hastings algorithm on our epidemiological model:
# Same Metropolis-Hastings algorithm, but automated
chain_mh = sample(model, MH([1.0 0.0; 0.0 1.0]), 5000)[1.0 0.0; 0.0 1.0]- a 2×2 matrix in Julia. Spaces separate columns, semicolons separate rows. This is the identity matrix, used here as the proposal covariance for MH. The diagonal values (1.0) set the proposal standard deviation for each parameter; off-diagonal zeros mean proposals are independent.
println("Turing.jl Metropolis-Hastings results:")
describe(chain_mh)Turing.jl Metropolis-Hastings results:
Chains MCMC chain (5000×5×1 Array{Float64, 3}):
Iterations = 1:1:5000
Number of chains = 1
Samples per chain = 5000
Wall duration = 3.37 seconds
Compute duration = 3.37 seconds
parameters = R_0, D_inf
internals = logprior, loglikelihood, logjoint
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat e ⋯
Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯
R_0 2.4220 0.1247 0.0174 17.3560 11.4341 1.3178 ⋯
D_inf 1.9168 0.1012 0.0125 16.2724 11.4341 1.3185 ⋯
1 column omitted
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
R_0 2.3892 2.3892 2.3892 2.3892 2.5760
D_inf 1.8947 1.8947 1.8947 1.8947 2.0119
# Visualise
plot(
chain_mh[:R_0], title = "Turing MH: R_0 Trace", xlabel = "Iteration",
ylabel = "R_0", linewidth = 1
)Finding good proposals by hand is difficult. The goal is to experience the challenge, not to find a perfect solution.
Suggested approach: - Spend 5-10 minutes experimenting with different proposal matrices - If you’re stuck after 10 minutes, reveal the solution - you’ve learned the key lesson - The frustration is the point: manual tuning doesn’t scale to real problems
The trace plot above shows poor mixing - the chain gets stuck and doesn’t explore efficiently. Your task is to improve it by adjusting the proposal covariance matrix.
The proposal covariance matrix controls how the sampler proposes new parameter values:
# Current: identity matrix (proposal std = 1.0 for both parameters)
MH([1.0 0.0; 0.0 1.0])
# Structure: [var_R0 cov; cov var_Dinf]
# Diagonal = variance (std²) for each parameter
# Off-diagonal = covariance between parametersTry these experiments (run each for 5000 samples):
- Smaller proposals:
MH([0.1 0.0; 0.0 0.1])- what happens to acceptance rate? - Larger proposals:
MH([5.0 0.0; 0.0 5.0])- does it explore faster or slower? - Different scales:
MH([0.5 0.0; 0.0 0.1])- R₀ and D_inf may need different step sizes - Add correlation:
MH([1.0 0.5; 0.5 1.0])- if parameters are correlated in the posterior, correlated proposals may help
Questions to answer:
- What acceptance rate gives the best mixing? (Hint: aim for ~23% for random walk MH)
- Can you find settings that give a well-mixed trace for both parameters simultaneously?
- How long did tuning take you? Imagine doing this for 10+ parameters…
Acceptance rate too high (>50%)? Your proposals are too small. The chain accepts most moves but takes tiny steps - you’ll see the trace plot moving very slowly.
Acceptance rate too low (<10%)? Your proposals are too large. Most proposed jumps land in low-probability regions and get rejected - you’ll see the trace plot “stuck” for many iterations.
Sweet spot (~20-30%): A healthy balance of exploration and acceptance.
Check your acceptance rate with: mean(diff(chain[:R_0].data) .!= 0)
Look at the posterior uncertainty for each parameter: - R₀ might range from ~1.5 to ~3.5 (spread of ~2) - D_inf might range from ~1.5 to ~3.0 (spread of ~1.5)
If you use the same proposal std for both, one will be well-tuned while the other is either too large or too small. Try making the proposal std roughly proportional to the posterior width.
Rule of thumb: proposal std ≈ 0.1-0.3 × posterior width
Run the chain and look at a scatter plot of R₀ vs D_inf samples. If they’re negatively correlated (higher R₀ → lower D_inf), then independent proposals are inefficient.
Imagine the posterior is a narrow diagonal ridge. Independent proposals move horizontally or vertically, often stepping off the ridge. Correlated proposals can move along the ridge.
Try a small negative off-diagonal value: MH([0.3 -0.05; -0.05 0.1])
# Space for your experiments - modify the proposal matrix and re-run
# chain_tuned = sample(model, MH([??? ???; ??? ???]), 5000)
# plot(chain_tuned)After experimentation, a proposal like this works reasonably well:
# Different scales for each parameter, with negative correlation
chain_tuned = sample(model, MH([0.3 -0.1; -0.1 0.1]), 5000)Why this works:
Smaller proposals (0.3 for R₀, 0.1 for D_inf): The posterior is concentrated in a small region. Large proposals jump outside this region and get rejected.
Different scales: R₀ has a larger range (~1-20) than D_inf (~1-14), and the posterior uncertainty differs. Matching proposal scale to posterior scale improves efficiency.
Negative correlation (-0.1): Higher R₀ values are consistent with lower D_inf values (both increase transmission rate). Correlated proposals explore this ridge more efficiently.
The key insight: Finding these values required trial and error. Now imagine doing this for a model with 10 parameters (55 covariance values to tune). This is why adaptive methods like RAM and gradient-based methods like NUTS exist.
The difficulty you just experienced is why manual tuning doesn’t scale. With 2 parameters, you have 3 values to tune (two variances, one covariance). With 10 parameters, you’d have 55 values. This motivates adaptive methods that learn the proposal automatically.
Adaptive MCMC - Learning better proposals
The problem with standard MH is that we need to manually tune the proposal covariance. If parameters are correlated, independent proposals (diagonal covariance) will mix poorly.
Robust Adaptive Metropolis (RAM) solves this by automatically learning the posterior covariance during sampling:
# Adaptive MCMC - learns proposal covariance automatically
chain_ram = sample(
model,
externalsampler(AdvancedMH.RobustAdaptiveMetropolis()),
5000;
check_model=false
)RobustAdaptiveMetropolis()- from AdvancedMH.jl, this sampler adapts its proposal covariance based on the samples seen so far. It learns the correlation structure of the posterior and proposes moves that respect it.externalsampler()- wraps an external sampler (from AdvancedMH.jl) for use with Turing models.check_model=false- skips model validation (needed for some external samplers).
# Compare MH vs Adaptive MH
p1 = plot(chain_mh[:R_0], title="Standard MH: R_0", xlabel="Iteration", ylabel="R_0", legend=false)
p2 = plot(chain_ram[:R_0], title="Adaptive MH (RAM): R_0", xlabel="Iteration", ylabel="R_0", legend=false)
plot(p1, p2, layout=(1, 2), size=(800, 300))Notice how the adaptive sampler explores the parameter space more efficiently.
NUTS - A gradient-based upgrade
For models where we can compute gradients (like our ODE model), NUTS (No U-Turn Sampler) offers even better efficiency:
# Modern gradient-based MCMC
# Expect: ~30-60 seconds on first run (includes compilation), ~10-20 seconds on subsequent runs
chain_nuts = sample(model, NUTS(), 2000)NUTS()- the No U-Turn Sampler, a gradient-based MCMC algorithm that automatically tunes its step size and trajectory length. Much more efficient than Metropolis-Hastings for continuous parameters. The algorithm uses automatic differentiation to compute gradients of the log-posterior.
How do gradients help with sampling? Metropolis-Hastings proposes random steps and hopes for the best — like exploring a dark room by stumbling around. NUTS computes the gradient (slope) of the log-posterior, which tells it which direction leads to higher probability. It then simulates a ball rolling along this surface, following the natural contours of the posterior. This lets it take large, efficient steps along ridges and valleys instead of the small, random steps of MH. The “No U-Turn” part automatically decides when to stop rolling, so you don’t need to tune a trajectory length.
println("NUTS results:")
describe(chain_nuts)NUTS results:
Chains MCMC chain (2000×16×1 Array{Float64, 3}):
Iterations = 1001:1:3000
Number of chains = 1
Samples per chain = 2000
Wall duration = 23.67 seconds
Compute duration = 23.67 seconds
parameters = R_0, D_inf
internals = n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size, logprior, loglikelihood, logjoint
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat e ⋯
Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯
R_0 2.4468 0.0391 0.0016 589.7937 407.7591 1.0003 ⋯
D_inf 1.9458 0.0382 0.0016 571.4452 407.4433 1.0007 ⋯
1 column omitted
Quantiles
parameters 2.5% 25.0% 50.0% 75.0% 97.5%
Symbol Float64 Float64 Float64 Float64 Float64
R_0 2.3670 2.4215 2.4476 2.4733 2.5229
D_inf 1.8681 1.9216 1.9452 1.9723 2.0179
# Compare Metropolis-Hastings vs NUTS
p1 = plot(
chain_mh[:R_0], title = "Metropolis-Hastings: R_0", xlabel = "Iteration",
ylabel = "R_0", linewidth = 1, legend = false
)
p2 = plot(
chain_nuts[:R_0], title = "NUTS: R_0", xlabel = "Iteration",
ylabel = "R_0", linewidth = 1, legend = false
)
p3 = plot(
chain_mh[:D_inf], title = "Metropolis-Hastings: D_inf",
xlabel = "Iteration", ylabel = "D_inf", linewidth = 1, legend = false
)
p4 = plot(
chain_nuts[:D_inf], title = "NUTS: D_inf", xlabel = "Iteration",
ylabel = "D_inf", linewidth = 1, legend = false
)
plot(p1, p2, p3, p4, layout = (2, 2), size = (800, 600))Notice that:
- NUTS mixes better (explores parameter space more efficiently)
- NUTS needs fewer samples for same quality
- NUTS had automatic warmup (no manual burn-in needed — see below)
MCMC chains start at an arbitrary point in parameter space, which may be far from the high-probability region. The initial samples as the chain “finds its way” to the posterior are called burn-in (or warmup). These samples don’t represent the posterior and are typically discarded.
With Metropolis-Hastings, you need to decide how many burn-in samples to discard by inspecting trace plots. NUTS handles this automatically: it uses the warmup period to adapt its step size and mass matrix, then discards those samples for you.
What NUTS gives you automatically:
- No tuning required - automatically adapts step size and mass matrix during warmup
- Efficient exploration - uses gradients to navigate parameter space intelligently
- Automatic warmup - built-in adaptation period (no manual burn-in decisions)
- Better convergence - typically needs far fewer samples than Metropolis-Hastings
# Built-in convergence diagnostics
using MCMCChains
summarystats(chain_nuts)Summary Statistics parameters mean std mcse ess_bulk ess_tail rhat e ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯ R_0 2.4468 0.0391 0.0016 589.7937 407.7591 1.0003 ⋯ D_inf 1.9458 0.0382 0.0016 571.4452 407.4433 1.0007 ⋯ 1 column omitted
summarystats(chain)- returns a table with mean, std, MCSE (Monte Carlo Standard Error), ESS (Effective Sample Size), and R̂ (convergence diagnostic). We’ll learn what these mean in the next session.
| Sampler | Best for | Limitations |
|---|---|---|
| MH (Metropolis-Hastings) | Simple models, discrete parameters, learning MCMC | Requires manual tuning, slow for many parameters |
| RAM (Robust Adaptive Metropolis) | Models where gradients aren’t available (e.g. stochastic simulations) | Still random-walk, less efficient than gradient methods |
| NUTS (No U-Turn Sampler) | Continuous parameters with differentiable models (ODEs, most Turing models) | Can’t handle discrete parameters; needs gradients |
Rule of thumb: Start with NUTS(). If your model has discrete parameters or stochastic components that break automatic differentiation, fall back to MH or RAM.
Posterior analysis with Turing.jl
Now we can analyse the posterior distribution:
# Parameter correlations
corner_plot = corner(chain_nuts, [:R_0, :D_inf])corner(chain, params)- creates a “corner plot” showing the marginal distributions (diagonal) and pairwise joint distributions (off-diagonal) for the selected parameters. This visualises parameter correlations.
# Credible intervals
quantile(chain_nuts[:R_0], [0.025, 0.25, 0.5, 0.75, 0.975])5-element Vector{Float64}:
2.367011139325599
2.4215085019088867
2.4475837433696648
2.473317665361791
2.5229458296181546
These summaries give us point estimates and uncertainty, but should we trust these results? The next session (MCMC Diagnostics) teaches you how to check whether your chains converged and whether your model makes sense.
Summary
We started by seeing why grid search fails in higher dimensions - the curse of dimensionality makes exhaustive exploration impossible. MCMC solves this by constructing a random walk that, over time, visits parameter values in proportion to their posterior probability.
The Metropolis-Hastings algorithm is the foundation: propose a move, accept or reject based on the posterior ratio. The challenge is choosing good proposals - too small and the chain moves slowly, too large and most moves are rejected. We saw how Robust Adaptive Metropolis (RAM) learns the posterior covariance automatically, eliminating manual tuning.
For models with gradients (like our deterministic ODE), NUTS goes further - using gradient information to make large, efficient moves through parameter space. Finally, we saw how Turing.jl wraps all of this in an elegant interface: define your model with @model, call sample(), and analyse the results.
- MCMC is a universal sampler: It works for any distribution where you can evaluate density
- Metropolis-Hastings is simple and robust but requires tuning of proposal distributions
- Adaptive MCMC (RAM) learns proposal covariance automatically - essential for correlated parameters
- Turing.jl automates the mechanics of MCMC and other inference methods
- NUTS is more efficient when gradients are available (e.g., deterministic models)
- The Bayesian workflow: Define model → Choose sampler → Analyse samples
References
- Turing.jl documentation - Probabilistic programming in Julia
- Vihola (2012). Robust Adaptive Metropolis Algorithm with Coerced Acceptance Rate. Statistics and Computing 22, 997-1008. (The RAM algorithm)
- Hoffman & Gelman (2014). The No-U-Turn Sampler. JMLR 15, 1593-1623. (The NUTS algorithm)
- Betancourt (2017). A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv. (Excellent intuition for HMC/NUTS)
- MCMCChains.jl - Chain analysis and diagnostics
Next session: MCMC diagnostics - how to tell if your chains have converged and your results are trustworthy!
Appendix: Coming from R?
The Julia MH sampler above translates to R as follows:
# Target distribution: Normal(5, 2)
target_mean <- 5.0
target_std <- 2.0
eval_log_density <- function(x) {
dnorm(x, mean = target_mean, sd = target_std, log = TRUE)
}
metropolis_hastings_1d <- function(log_density_fn, n_samples,
initial_value, proposal_std) {
samples <- numeric(n_samples)
current_x <- initial_value
current_log_prob <- log_density_fn(current_x)
n_accepted <- 0
for (i in 1:n_samples) {
proposal_x <- current_x + rnorm(1, mean = 0, sd = proposal_std)
proposal_log_prob <- log_density_fn(proposal_x)
log_alpha <- proposal_log_prob - current_log_prob
if (runif(1) < exp(log_alpha)) {
current_x <- proposal_x
current_log_prob <- proposal_log_prob
n_accepted <- n_accepted + 1
}
samples[i] <- current_x
}
list(samples = samples, acceptance_rate = n_accepted / n_samples)
}
result <- metropolis_hastings_1d(eval_log_density, 5000, 0.0, 1.0)The core algorithm is identical - only syntax differs.