Markov Chain Monte Carlo (MCMC) sampling

Estimated time: 2.5-3 hours

Introduction

In the previous session, you learned Bayesian inference through manual calculations - computing priors, likelihoods, and posteriors and exploring the posterior distribution using grid search. This approach introduced you to the core components of conducting inference, but it also exposed a fundamental problem: grid search doesn’t scale.

This session introduces Markov Chain Monte Carlo (MCMC), the algorithmic solution that makes Bayesian inference practical for real problems. Rather than exhaustively searching parameter space, MCMC intelligently explores it by taking random walks that spend more time in high-probability regions.

You’ll build understanding by coding a simple MCMC algorithm yourself, then see how modern tools like Turing.jl automate everything with state-of-the-art methods. By the end, you’ll understand both what MCMC does under the hood and how to use it effectively in practice.

Objectives

The aim of this session is to learn how to sample from posterior distributions using MCMC. By building the algorithm yourself first, you’ll understand what modern tools like Turing.jl do under the hood.

In this session, you will:

Implement the Metropolis-Hastings MCMC algorithm and apply it to a simple problem
Experience the challenges of using MCMC: proposal tuning and mixing
See how Turing.jl automates Metropolis-Hastings for you
Learn why NUTS is the modern upgrade that solves some of the problems with Metropolis-Hastings
Apply NUTS to conduct Bayesian inference on the SIR model

Setup

"100"

Source file

The source file of this session is located at sessions/mcmc.qmd.

Load libraries

using Random ## for random numbers
using Distributions ## for probability distributions
using DifferentialEquations ## for differential equations
using DataFrames ## for data frames
using Plots ## for plots
using StatsPlots ## for statistical plots
using Turing ## for probabilistic programming and MCMC
using MCMCChains ## for MCMC diagnostics
using AdvancedMH ## for adaptive MCMC samplers

Initialisation

We set a random seed for reproducibility. Setting this ensures that you should get exactly the same results on your computer as we do.

Random.seed!(1234)

TaskLocalRNG()

Load data and model from previous session

We’ll use the same SIR model and epidemic data from the introduction session.

SIR Model Definition (from src/sir_model.jl)

Key functions:

sir_ode!(du, u, p, t) - ODE system for S, I, R compartments
simulate_sir(R_0, D_inf, S0, I0, times) - Returns DataFrame with trajectory

using DifferentialEquations
using Distributions
using DataFrames

"""
    sir_ode!(du, u, p, t)

SIR model ordinary differential equations.

# States
- `u[1]` (S): Susceptible
- `u[2]` (I): Infectious
- `u[3]` (R): Recovered

# Parameters
- `p[1]` (R_0): Basic reproduction number
- `p[2]` (D_inf): Infectious period (days)
"""
function sir_ode!(du, u, p, t)
    S, I, R = u
    R_0, D_inf = p
    N = S + I + R

    β = R_0 / D_inf
    ν = 1 / D_inf

    du[1] = -β * S * I / N          # dS/dt
    du[2] = β * S * I / N - ν * I   # dI/dt
    du[3] = ν * I                   # dR/dt
end

"""
    simulate_sir(R_0, D_inf, S0, I0, times)

Simulate the deterministic SIR model.

# Arguments
- `R_0`: Basic reproduction number
- `D_inf`: Infectious period (days)
- `S0`: Initial susceptible population
- `I0`: Initial infectious population
- `times`: Time points to return (e.g., 0.0:1.0:30.0)

# Returns
DataFrame with columns: time, S, I, R, Inc (daily incidence)
"""
function simulate_sir(R_0, D_inf, S0, I0, times)
    times_vec = collect(times)
    u0 = Float64[S0, I0, 0.0]
    prob = ODEProblem(sir_ode!, u0, (times_vec[1], times_vec[end]), [R_0, D_inf])
    sol = solve(prob, Tsit5(), saveat=times_vec)

    df = DataFrame(
        time = sol.t,
        S = [u[1] for u in sol.u],
        I = [u[2] for u in sol.u],
        R = [u[3] for u in sol.u]
    )

    # Compute daily incidence from change in R
    df.Inc = [0.0; diff(df.R)]

    return df
end

# Initial state and time range
init_state = [999, 1, 0]
t0, tend = (0, 30)

# Epidemic data
epi1 = DataFrame(
    time = 1:1:30,
    obs = [
        1, 3, 4, 11, 19, 38, 67, 112, 183, 210, 237, 182, 178, 154, 95, 68, 51,
        30, 32, 18, 19, 9, 2, 2, 2, 0, 0, 1, 0, 0
    ]
)

30×2 DataFrame

5 rows omitted

Row	time	obs
	Int64	Int64
1	1	1
2	2	3
3	3	4
4	4	11
5	5	19
6	6	38
7	7	67
8	8	112
9	9	183
10	10	210
11	11	237
12	12	182
13	13	178
⋮	⋮	⋮
19	19	32
20	20	18
21	21	19
22	22	9
23	23	2
24	24	2
25	25	2
26	26	0
27	27	0
28	28	1
29	29	0
30	30	0

From manual exploration to algorithmic sampling

In the previous session, we manually explored the posterior by trying different parameter combinations. This worked for our simple SIR model with 2 parameters, but we also saw that the number of combinations to explore can be prohibitive.

Instead of trying all combinations, Markov chain Monte Carlo works in the following way:

Start somewhere in parameter space
Propose a small move to nearby parameters
Accept or reject the move based on how much better/worse the posterior is
Repeat many times to build up a sample

This is the Metropolis-Hastings algorithm - let’s implement it ourselves!

Metropolis-Hastings for a simple case

Let’s start by sampling from a simple 1D Normal distribution to understand the algorithm:

# Target distribution: Normal(5, 2)
target_mean = 5.0
target_std = 2.0
target_dist = Normal(target_mean, target_std)

# Helper function to evaluate log-density of target
function eval_log_density(x)
    return logpdf(target_dist, x)
end

# Simple Metropolis-Hastings sampler
function metropolis_hastings_1d(log_density_fn, n_samples, initial_value,
                                proposal_std)
    # Initialise sample vector
    samples = Float64[]
    # current_x tracks the state of the chain
    # initially, set to given initial value
    current_x = initial_value
    # Evaluate log probability at the initial value
    current_log_prob = log_density_fn(current_x)
    # Track number of acceptances
    n_accepted = 0
    
    for i in 1:n_samples
        # Propose new state
        proposal_x = current_x + randn() * proposal_std
        # Evaluate log probability at target
        proposal_log_prob = log_density_fn(proposal_x)
        
        # Calculate acceptance probability (in log space)
        # Note: This is the Metropolis ratio (symmetric proposal)
        # Full Metropolis-Hastings would include proposal ratio q(x|x')/q(x'|x)
        # But Normal proposal is symmetric, so this ratio = 1 and cancels out
        log_alpha = proposal_log_prob - current_log_prob
        alpha = exp(log_alpha)

        # Accept or reject
        if rand() < alpha
            current_x = proposal_x
            current_log_prob = proposal_log_prob
            n_accepted += 1
        end # if reject, current_x stays what it is
        
        push!(samples, current_x)
    end
    
    acceptance_rate = n_accepted / n_samples
    return samples, acceptance_rate
end

# Test our sampler
samples_mh, acceptance_rate = metropolis_hastings_1d(
    eval_log_density, 5000, 0.0, 1.0
)

println("Acceptance rate: $(round(acceptance_rate * 100, digits = 1))%")
println("Sample mean: $(round(mean(samples_mh), digits = 2)) (true: $(target_mean))")
println("Sample std: $(round(std(samples_mh), digits = 2)) (true: $(target_std))")

Acceptance rate: 84.0%
Sample mean: 5.12 (true: 5.0)
Sample std: 2.08 (true: 2.0)

# Visualise the results
p1 = plot(
    samples_mh[1:1000], title = "Trace Plot (first 1000 samples)", 
    xlabel = "Iteration", ylabel = "x", linewidth = 1
)
hline!(p1, [target_mean], color = :red, linewidth = 2, label = "True mean")

p2 = histogram(
    samples_mh[1001:end], normalize = :pdf, alpha = 0.7, 
    title = "Sampled Distribution", xlabel = "x", ylabel = "Density",
    label = "MCMC samples"
)

# Overlay true distribution
x_range = range(-5, 15, length = 100)
plot!(
    p2, x_range, pdf.(target_dist, x_range), linewidth = 3, color = :red,
    label = "True distribution"
)

plot(p1, p2, layout = (2, 1), size = (800, 600))

The general principle: MCMC as a universal random number generator

What we just built solves a fundamental problem in computational statistics: given a probability distribution where we can evaluate the density at any point, how do we generate random samples from it?

Many useful probability distributions don’t have simple sampling algorithms. Easy distributions like Normal (randn()) and Uniform (rand()) have built-in samplers. But posteriors, mixtures, and custom models have no simple formula.

MCMC solves this by only requiring that we can evaluate the log-probability at any point. We used a Normal distribution above (even though we could just use randn()!) to demonstrate this general principle.

How does randn() actually work?

Even “simple” distributions like the Normal don’t have a direct sampling formula. When you call randn() in Julia or rnorm() in R, the computer uses transformation methods:

Common approaches:

Box-Muller transform: Convert two uniform random numbers to two normal random numbers

# Simplified version
u1, u2 = rand(), rand()
z1 = sqrt(-2 * log(u1)) * cos(2π * u2)
z2 = sqrt(-2 * log(u1)) * sin(2π * u2)
# z1 and z2 are standard normal

Inverse CDF method: Transform uniform → normal via inverse cumulative distribution function
Ziggurat algorithm: Fast, more complex method (what Julia actually uses)

Even “easy” distributions require clever tricks. For complex posteriors with correlations, ODEs, and constraints, no such formula exists, and we need a method like MCMC.

The posterior distribution is just another probability distribution. We can evaluate it at any point: log_posterior(θ) = log_prior(θ) + log_likelihood(θ|data). Because of this, MCMC can sample from it. The samples give us everything we need: visualisation (plot the distribution shape), summaries (mean, median, credible intervals), and forward simulation (with parameter correlations preserved).

When we apply MCMC to the SIR model later, we will be doing exactly the same thing as we did here for the Normal example - just with a more complex probability distribution that happens to be a Bayesian posterior. MCMC doesn’t “know” about Bayesian inference; it just knows how to sample from distributions where we can evaluate the density.

Understanding proposal distributions

The choice of proposal standard deviation matters:

# Test different proposal standard deviations
proposal_stds = [0.1, 0.5, 1.0, 5.0, 10.0, 50.0]
results = []

for prop_std in proposal_stds
    samples, acc_rate = metropolis_hastings_1d(eval_log_density, 5000, 0.0, prop_std)
    push!(results, (std = prop_std, samples = samples, acc_rate = acc_rate))
end

# Print acceptance rates
println("Acceptance rates for different proposal std:")
for result in results
    println(
        "  Proposal std = $(result.std): ",
        "$(round(result.acc_rate*100, digits = 1))%"
    )
end

# Plot trace plots for different proposal stds
plots = []
for (i, result) in enumerate(results)
    p = plot(
        result.samples,
        title = "Prop. std = $(result.std), " *
            "Acc = $(round(result.acc_rate*100, digits = 1))%", 
        xlabel = "Iteration", ylabel = "x", linewidth = 1, ylim = (-5, 15)
    )
    hline!(p, [target_mean], color = :red, linewidth = 2, label = "")
    push!(plots, p)
end

plot(plots..., layout = (length(plots), 1), size = (800, 1000))

Acceptance rates for different proposal std:
  Proposal std = 0.1: 98.5%
  Proposal std = 0.5: 91.8%
  Proposal std = 1.0: 85.1%
  Proposal std = 5.0: 42.6%
  Proposal std = 10.0: 24.2%
  Proposal std = 50.0: 5.0%

Learning points

Too small proposals → High acceptance, but slow exploration
Too large proposals → Low acceptance, wasted computation
Optimal: ~23% acceptance rate for random walk Metropolis

For the normal distribution, sampling was straightforward. But efficient sampling from complex posterior distributions requires a good proposal distribution, which can be difficult to find. This is where more sophisticated algorithms that tune the proposal distribution automatically to a given problem really help.

Good time for a break

You’ve now implemented Metropolis-Hastings from scratch and understand its core mechanics. The next sections show how modern tools automate this. If you’re working through this in one sitting, this is a natural pause point.

Automating MCMC with Turing.jl

Now that you understand how MCMC works internally, let’s see how Turing.jl automates everything you just coded.

Turing.jl on the Normal distribution

First, let’s use Turing.jl to sample from the exact same Normal distribution you just coded manually:

# Define the same Normal(5, 2) distribution in Turing
@model function normal_model()
    x ~ Normal(5.0, 2.0)
end

Code explainer

@model - a Turing.jl macro that defines a probabilistic model. Everything inside becomes part of the model specification.
x ~ Normal(5.0, 2.0) - the ~ operator means “is distributed as”. This declares that x is a random variable drawn from a Normal distribution with mean 5 and standard deviation 2. Turing will infer the value of x during sampling.

Coming from Stan/NIMBLE/BUGS?

If you’ve used probabilistic programming languages, you’ll recognise the ~ syntax. Turing’s @model is similar to model definitions in Stan, NIMBLE, and BUGS/JAGS:

# NIMBLE / BUGS / JAGS
model({
  x ~ dnorm(5, sd = 2)
})

// Stan
model {
  x ~ normal(5, 2);
}

The key advantage of Turing is that the model is written in native Julia, so you can use loops, conditionals, and call any Julia function (like our ODE solver) directly inside the model — no separate modelling language needed.

Going further: sampling without Turing

Turing’s @model macro is convenient syntax, but it’s not the only way to do MCMC in Julia. In the Introduction you built a log-posterior function by hand — that same function can be plugged into standalone samplers like DynamicHMC.jl via the LogDensityProblems.jl interface. This gives you access to samplers beyond what Turing offers, with fewer dependencies. See the reference page for a worked example.

# Sample using Metropolis-Hastings (normal proposals with std 10.0)
chain_normal_mh = sample(normal_model(), MH([10.0;;]), 50000)

Code explainer

sample(model, sampler, n) - runs the MCMC sampler on the model for n iterations, returning a chain of samples.
MH(proposal) - Metropolis-Hastings sampler. MH expects a matrix for the proposal covariance. The [10.0;;] syntax creates a 1×1 matrix — the double semicolon ;; tells Julia “this is a matrix, not a vector”. Without it, [10.0] would be a 1-element vector, which MH wouldn’t accept. The value 10.0 sets the proposal variance (so the proposal standard deviation is √10 ≈ 3.2).

# Compare to your manual implementation
println("Manual M-H:")
println("  Mean: $(round(mean(samples_mh), digits = 2))")
println("  Std: $(round(std(samples_mh), digits = 2))")
println("\nTuring.jl M-H:")
describe(chain_normal_mh)

Manual M-H:
  Mean: 5.12
  Std: 2.08

Turing.jl M-H:
Chains MCMC chain (50000×5×1 Array{Float64, 3}):

Iterations        = 1:1:50000
Number of chains  = 1
Samples per chain = 50000
Wall duration     = 2.96 seconds
Compute duration  = 2.96 seconds
parameters        = x
internals         = accepted, logprior, loglikelihood, logjoint

Summary Statistics

  parameters      mean       std      mcse    ess_bulk     ess_tail      rhat  ⋯
      Symbol   Float64   Float64   Float64     Float64      Float64   Float64  ⋯

           x    5.0080    1.9952    0.0212   8887.3055   10687.6710    1.0001  ⋯

                                                                1 column omitted

Quantiles

  parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
      Symbol   Float64   Float64   Float64   Float64   Float64 

           x    1.1032    3.6642    5.0047    6.3546    8.9090

Code explainer

describe(chain) - prints summary statistics for all parameters in the chain, including mean, standard deviation, and quantiles.
chain[:x] - extracts the samples for parameter x from the chain. You can use this to access any parameter defined in your @model function.

Coming from Stan?

If you’ve used Stan, the workflow is similar: sample() runs inference and returns a chain object, describe() gives summary statistics, and chain[:x] extracts samples for a parameter.

Action	rstan	cmdstanr	Turing.jl
Run sampler	`sampling(model, ...)`	`model$sample(...)`	`sample(model, NUTS(), n)`
Summary	`summary(fit)`	`fit$summary()`	`describe(chain)`
Extract parameter	`as.array(fit, pars="x")`	`fit$draws("x")`	`chain[:x]`

Plotting code

# Visualise - same result!
p1 = histogram(
    samples_mh, normalize = :pdf, alpha = 0.7,
    title = "Manual M-H", xlabel = "x", ylabel = "Density", label = "Manual"
)
p2 = histogram(
    chain_normal_mh[:x], normalize = :pdf, alpha = 0.7,
    title = "Turing M-H", xlabel = "x", ylabel = "Density", label = "Turing"
)
plot(p1, p2, layout = (1, 2), size = (800, 300))

Turing.jl did exactly what we coded earlier, but in a single call to the sample() function.

Applying Turing.jl to the SIR model

Turing.jl uses the @model macro to define Bayesian models. This is the same structure we’ve been using manually - priors, simulation, and likelihood - but now in a form that Turing can automatically sample from:

# Define our SIR model in Turing.jl
@model function sir_model(init_state, times, n_obs)
    # Weakly informative priors
    R_0 ~ truncated(Normal(2.5, 1.5), lower=1.0)
    D_inf ~ truncated(Normal(5.0, 3.0), lower=0.5)

    # Simulate ODE (same as manual approach)
    prob = ODEProblem(sir_ode!, init_state, (times[1], times[end]), [R_0, D_inf])
    sol = solve(prob, saveat = times)

    # Likelihood (Poisson observation model of prevalence)
    I = sol[2, :]
    # max(..., 0.0) guards against tiny negative values from the ODE solver
    # (numerical noise can push I slightly below zero near the end of an epidemic)
    lambdas = [max(I[i], 0.0) for i in 1:n_obs]
    obs ~ arraydist(Poisson.(lambdas))
end

# Create model instance
#| output: false

model = sir_model(init_state, epi1.time, length(epi1.obs)) | (; obs = epi1.obs)

DynamicPPL.Model{typeof(sir_model), (:init_state, :times, :n_obs), (), (), Tuple{Vector{Int64}, Vector{Int64}, Int64}, Tuple{}, DynamicPPL.CondFixContext{DynamicPPL.Condition, VarNamedTuple{(:obs,), Tuple{Vector{Int64}}}, DynamicPPL.DefaultContext}, false}(sir_model, (init_state = [999, 1, 0], times = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  21, 22, 23, 24, 25, 26, 27, 28, 29, 30], n_obs = 30), NamedTuple(), CondFixContext{DynamicPPL.Condition}(VarNamedTuple(obs = [1, 3, 4, 11, 19, 38, 67, 112, 183, 210, 237, 182, 178, 154, 95, 68, 51, 30, 32, 18, 19, 9, 2, 2, 2, 0, 0, 1, 0, 0],), DynamicPPL.DefaultContext()))

Code explainer: arraydist and | conditioning

Two new pieces of syntax appear here:

obs ~ arraydist(Poisson.(lambdas)) — declares a vector of observations where each element has its own distribution. arraydist creates an independent product distribution: obs[1] ~ Poisson(lambdas[1]), obs[2] ~ Poisson(lambdas[2]), etc. The dot in Poisson.(lambdas) is Julia’s broadcasting — it applies Poisson to each element of lambdas.

sir_model(...) | (; obs = epi1.obs) — the | operator conditions the model on observed data. It tells Turing “don’t sample obs — use these observed values instead and compute the likelihood”. The (; obs = epi1.obs) is a named tuple matching the variable name obs inside the model to the data — the semicolon is Julia’s syntax for a named tuple with no positional arguments.

This separation of model structure from data is a key Turing pattern: the @model function defines the generative process, and | supplies the observations. The same model can be used unconditioned (for simulation), conditioned (for inference), or deconditioned after fitting for posterior predictive checks (decondition() — you’ll see this in the Model Checking session).

Sample the SIR model with Metropolis-Hastings

Now let’s use the Metropolis-Hastings algorithm on our epidemiological model:

# Same Metropolis-Hastings algorithm, but automated
chain_mh = sample(model, MH([1.0 0.0; 0.0 1.0]), 5000)

Code explainer

[1.0 0.0; 0.0 1.0] - a 2×2 matrix in Julia. Spaces separate columns, semicolons separate rows. This is the identity matrix, used here as the proposal covariance for MH. The diagonal values (1.0) set the proposal standard deviation for each parameter; off-diagonal zeros mean proposals are independent.

println("Turing.jl Metropolis-Hastings results:")
describe(chain_mh)

Turing.jl Metropolis-Hastings results:
Chains MCMC chain (5000×6×1 Array{Float64, 3}):

Iterations        = 1:1:5000
Number of chains  = 1
Samples per chain = 5000
Wall duration     = 26.98 seconds
Compute duration  = 26.98 seconds
parameters        = R_0, D_inf
internals         = accepted, logprior, loglikelihood, logjoint

Summary Statistics

  parameters      mean       std      mcse   ess_bulk   ess_tail      rhat   e ⋯
      Symbol   Float64   Float64   Float64    Float64    Float64   Float64     ⋯

         R_0    2.4220    0.1247    0.0174    17.3560    11.4341    1.3178     ⋯
       D_inf    1.9168    0.1012    0.0125    16.2724    11.4341    1.3185     ⋯

                                                                1 column omitted

Quantiles

  parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
      Symbol   Float64   Float64   Float64   Float64   Float64 

         R_0    2.3892    2.3892    2.3892    2.3892    2.5760
       D_inf    1.8947    1.8947    1.8947    1.8947    2.0119

# Visualise
plot(
    chain_mh[:R_0], title = "Turing MH: R_0 Trace", xlabel = "Iteration",
    ylabel = "R_0", linewidth = 1
)

What to expect from the exercise below

Finding good proposals by hand is difficult. The goal is to experience the challenge, not to find a perfect solution.

Suggested approach: - Spend 5-10 minutes experimenting with different proposal matrices - If you’re stuck after 10 minutes, reveal the solution - you’ve learned the key lesson - The frustration is the point: manual tuning doesn’t scale to real problems

Exercise: Tune the proposal distribution

The trace plot above shows poor mixing - the chain gets stuck and doesn’t explore efficiently. Your task is to improve it by adjusting the proposal covariance matrix.

The proposal covariance matrix controls how the sampler proposes new parameter values:

# Current: identity matrix (proposal std = 1.0 for both parameters)
MH([1.0 0.0; 0.0 1.0])

# Structure: [var_R0  cov; cov  var_Dinf]
# Diagonal = variance (std²) for each parameter
# Off-diagonal = covariance between parameters

Try these experiments (run each for 5000 samples):

Smaller proposals: MH([0.1 0.0; 0.0 0.1]) - what happens to acceptance rate?
Larger proposals: MH([5.0 0.0; 0.0 5.0]) - does it explore faster or slower?
Different scales: MH([0.5 0.0; 0.0 0.1]) - R₀ and D_inf may need different step sizes
Add correlation: MH([1.0 0.5; 0.5 1.0]) - if parameters are correlated in the posterior, correlated proposals may help

Questions to answer:

What acceptance rate gives the best mixing? (Hint: aim for ~23% for random walk MH)
Can you find settings that give a well-mixed trace for both parameters simultaneously?
How long did tuning take you? Imagine doing this for 10+ parameters…

Hint 1: Interpreting acceptance rates (click if stuck after 2-3 attempts)

Acceptance rate too high (>50%)? Your proposals are too small. The chain accepts most moves but takes tiny steps - you’ll see the trace plot moving very slowly.

Acceptance rate too low (<10%)? Your proposals are too large. Most proposed jumps land in low-probability regions and get rejected - you’ll see the trace plot “stuck” for many iterations.

Sweet spot (~20-30%): A healthy balance of exploration and acceptance.

Check your acceptance rate with: mean(diff(chain[:R_0].data) .!= 0)

Hint 2: Why do R₀ and D_inf need different scales? (click if one parameter mixes well but not the other)

Look at the posterior uncertainty for each parameter: - R₀ might range from ~1.5 to ~3.5 (spread of ~2) - D_inf might range from ~1.5 to ~3.0 (spread of ~1.5)

If you use the same proposal std for both, one will be well-tuned while the other is either too large or too small. Try making the proposal std roughly proportional to the posterior width.

Rule of thumb: proposal std ≈ 0.1-0.3 × posterior width

Hint 3: Why might correlation help? (click if diagonal proposals aren’t working)

Run the chain and look at a scatter plot of R₀ vs D_inf samples. If they’re negatively correlated (higher R₀ → lower D_inf), then independent proposals are inefficient.

Imagine the posterior is a narrow diagonal ridge. Independent proposals move horizontally or vertically, often stepping off the ridge. Correlated proposals can move along the ridge.

Try a small negative off-diagonal value: MH([0.3 -0.05; -0.05 0.1])

# Space for your experiments - modify the proposal matrix and re-run
# chain_tuned = sample(model, MH([??? ???; ??? ???]), 5000)
# plot(chain_tuned)

Solution: A reasonable proposal (click to reveal)

After experimentation, a proposal like this works reasonably well:

# Different scales for each parameter, with negative correlation
chain_tuned = sample(model, MH([0.3 -0.1; -0.1 0.1]), 5000)

Why this works:

Smaller proposals (0.3 for R₀, 0.1 for D_inf): The posterior is concentrated in a small region. Large proposals jump outside this region and get rejected.
Different scales: R₀ has a larger range (~1-20) than D_inf (~1-14), and the posterior uncertainty differs. Matching proposal scale to posterior scale improves efficiency.
Negative correlation (-0.1): Higher R₀ values are consistent with lower D_inf values (both increase transmission rate). Correlated proposals explore this ridge more efficiently.

The key insight: Finding these values required trial and error. Now imagine doing this for a model with 10 parameters (55 covariance values to tune). This is why adaptive methods like RAM and gradient-based methods like NUTS exist.

The difficulty you just experienced is why manual tuning doesn’t scale. With 2 parameters, you have 3 values to tune (two variances, one covariance). With 10 parameters, you’d have 55 values. This motivates adaptive methods that learn the proposal automatically.

Adaptive MCMC - Learning better proposals

The problem with standard MH is that we need to manually tune the proposal covariance. If parameters are correlated, independent proposals (diagonal covariance) will mix poorly.

Robust Adaptive Metropolis (RAM) solves this by automatically learning the posterior covariance during sampling:

# Adaptive MCMC - learns proposal covariance automatically
chain_ram = sample(
    model,
    externalsampler(AdvancedMH.RobustAdaptiveMetropolis()),
    5000;
    check_model=false
)

Code explainer

RobustAdaptiveMetropolis() - from AdvancedMH.jl, this sampler adapts its proposal covariance based on the samples seen so far. It learns the correlation structure of the posterior and proposes moves that respect it.
externalsampler() - wraps an external sampler (from AdvancedMH.jl) for use with Turing models.
check_model=false - skips model validation (needed for some external samplers).

# Compare MH vs Adaptive MH
p1 = plot(chain_mh[:R_0], title="Standard MH: R_0", xlabel="Iteration", ylabel="R_0", legend=false)
p2 = plot(chain_ram[:R_0], title="Adaptive MH (RAM): R_0", xlabel="Iteration", ylabel="R_0", legend=false)
plot(p1, p2, layout=(1, 2), size=(800, 300))

Notice how the adaptive sampler explores the parameter space more efficiently.

NUTS - A gradient-based upgrade

For models where we can compute gradients (like our ODE model), NUTS (No U-Turn Sampler) offers even better efficiency:

# Modern gradient-based MCMC
# Expect: ~30-60 seconds on first run (includes compilation), ~10-20 seconds on subsequent runs
chain_nuts = sample(model, NUTS(), 2000)

Code explainer

NUTS() - the No U-Turn Sampler, a gradient-based MCMC algorithm that automatically tunes its step size and trajectory length. Much more efficient than Metropolis-Hastings for continuous parameters. The algorithm uses automatic differentiation to compute gradients of the log-posterior.

How do gradients help with sampling? Metropolis-Hastings proposes random steps and hopes for the best — like exploring a dark room by stumbling around. NUTS computes the gradient (slope) of the log-posterior, which tells it which direction leads to higher probability. It then simulates a ball rolling along this surface, following the natural contours of the posterior. This lets it take large, efficient steps along ridges and valleys instead of the small, random steps of MH. The “No U-Turn” part automatically decides when to stop rolling, so you don’t need to tune a trajectory length.

println("NUTS results:")
describe(chain_nuts)

NUTS results:
Chains MCMC chain (2000×16×1 Array{Float64, 3}):

Iterations        = 1001:1:3000
Number of chains  = 1
Samples per chain = 2000
Wall duration     = 4.32 seconds
Compute duration  = 4.32 seconds
parameters        = R_0, D_inf
internals         = n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size, logprior, loglikelihood, logjoint

Summary Statistics

  parameters      mean       std      mcse   ess_bulk   ess_tail      rhat   e ⋯
      Symbol   Float64   Float64   Float64    Float64    Float64   Float64     ⋯

         R_0    2.4473    0.0365    0.0016   549.9296   540.4817    1.0023     ⋯
       D_inf    1.9464    0.0356    0.0015   568.8326   590.8378    1.0054     ⋯

                                                                1 column omitted

Quantiles

  parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
      Symbol   Float64   Float64   Float64   Float64   Float64 

         R_0    2.3772    2.4228    2.4470    2.4712    2.5230
       D_inf    1.8790    1.9227    1.9459    1.9697    2.0191

Plotting code

# Compare Metropolis-Hastings vs NUTS
p1 = plot(
    chain_mh[:R_0], title = "Metropolis-Hastings: R_0", xlabel = "Iteration",
    ylabel = "R_0", linewidth = 1, legend = false
)
p2 = plot(
    chain_nuts[:R_0], title = "NUTS: R_0", xlabel = "Iteration",
    ylabel = "R_0", linewidth = 1, legend = false
)

p3 = plot(
    chain_mh[:D_inf], title = "Metropolis-Hastings: D_inf",
    xlabel = "Iteration", ylabel = "D_inf", linewidth = 1, legend = false
)
p4 = plot(
    chain_nuts[:D_inf], title = "NUTS: D_inf", xlabel = "Iteration",
    ylabel = "D_inf", linewidth = 1, legend = false
)

plot(p1, p2, p3, p4, layout = (2, 2), size = (800, 600))

Notice that:

NUTS mixes better (explores parameter space more efficiently)
NUTS needs fewer samples for same quality
NUTS had automatic warmup (no manual burn-in needed — see below)

What is burn-in?

MCMC chains start at an arbitrary point in parameter space, which may be far from the high-probability region. The initial samples as the chain “finds its way” to the posterior are called burn-in (or warmup). These samples don’t represent the posterior and are typically discarded.

With Metropolis-Hastings, you need to decide how many burn-in samples to discard by inspecting trace plots. NUTS handles this automatically: it uses the warmup period to adapt its step size and mass matrix, then discards those samples for you.

What NUTS gives you automatically:

No tuning required - automatically adapts step size and mass matrix during warmup
Efficient exploration - uses gradients to navigate parameter space intelligently
Automatic warmup - built-in adaptation period (no manual burn-in decisions)
Better convergence - typically needs far fewer samples than Metropolis-Hastings

# Built-in convergence diagnostics
using MCMCChains
summarystats(chain_nuts)

Summary Statistics

  parameters      mean       std      mcse   ess_bulk   ess_tail      rhat   e ⋯
      Symbol   Float64   Float64   Float64    Float64    Float64   Float64     ⋯

         R_0    2.4473    0.0365    0.0016   549.9296   540.4817    1.0023     ⋯
       D_inf    1.9464    0.0356    0.0015   568.8326   590.8378    1.0054     ⋯

                                                                1 column omitted

Code explainer

summarystats(chain) - returns a table with mean, std, MCSE (Monte Carlo Standard Error), ESS (Effective Sample Size), and R̂ (convergence diagnostic). We’ll learn what these mean in the next session.

Which sampler should I use?

Sampler	Best for	Limitations
MH (Metropolis-Hastings)	Simple models, discrete parameters, learning MCMC	Requires manual tuning, slow for many parameters
RAM (Robust Adaptive Metropolis)	Models where gradients aren’t available (e.g. stochastic simulations)	Still random-walk, less efficient than gradient methods
NUTS (No U-Turn Sampler)	Continuous parameters with differentiable models (ODEs, most Turing models)	Can’t handle discrete parameters; needs gradients

Rule of thumb: Start with NUTS(). If your model has discrete parameters or stochastic components that break automatic differentiation, fall back to MH or RAM.

Posterior analysis with Turing.jl

Now we can analyse the posterior distribution:

# Parameter correlations
corner_plot = corner(chain_nuts, [:R_0, :D_inf])

Code explainer

corner(chain, params) - creates a “corner plot” showing the marginal distributions (diagonal) and pairwise joint distributions (off-diagonal) for the selected parameters. This visualises parameter correlations.

# Credible intervals
quantile(chain_nuts[:R_0], [0.025, 0.25, 0.5, 0.75, 0.975])

5-element Vector{Float64}:
 2.377224823328504
 2.422775688841925
 2.4470482808408436
 2.471231847796678
 2.5229597160834225

These summaries give us point estimates and uncertainty, but should we trust these results? The next session (MCMC Diagnostics) teaches you how to check whether your chains converged and whether your model makes sense.

Learning points

MCMC is a universal sampler: It works for any distribution where you can evaluate density
Metropolis-Hastings is simple and robust but requires tuning of proposal distributions
Adaptive MCMC (RAM) learns proposal covariance automatically - essential for correlated parameters
Turing.jl automates the mechanics of MCMC and other inference methods
NUTS is more efficient when gradients are available (e.g., deterministic models)
The Bayesian workflow: Define model → Choose sampler → Analyse samples

References

Turing.jl documentation - Probabilistic programming in Julia
Vihola (2012). Robust Adaptive Metropolis Algorithm with Coerced Acceptance Rate. Statistics and Computing 22, 997-1008. (The RAM algorithm)
Hoffman & Gelman (2014). The No-U-Turn Sampler. JMLR 15, 1593-1623. (The NUTS algorithm)
Betancourt (2017). A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv. (Excellent intuition for HMC/NUTS)
MCMCChains.jl - Chain analysis and diagnostics

Next session: MCMC diagnostics - how to tell if your chains have converged and your results are trustworthy!

Appendix: Coming from R?

Metropolis-Hastings in R

The Julia MH sampler above translates to R as follows:

# Target distribution: Normal(5, 2)
target_mean <- 5.0
target_std <- 2.0

eval_log_density <- function(x) {
  dnorm(x, mean = target_mean, sd = target_std, log = TRUE)
}

metropolis_hastings_1d <- function(log_density_fn, n_samples,
                                   initial_value, proposal_std) {
  samples <- numeric(n_samples)
  current_x <- initial_value
  current_log_prob <- log_density_fn(current_x)
  n_accepted <- 0

  for (i in 1:n_samples) {
    proposal_x <- current_x + rnorm(1, mean = 0, sd = proposal_std)
    proposal_log_prob <- log_density_fn(proposal_x)

    log_alpha <- proposal_log_prob - current_log_prob
    if (runif(1) < exp(log_alpha)) {
      current_x <- proposal_x
      current_log_prob <- proposal_log_prob
      n_accepted <- n_accepted + 1
    }
    samples[i] <- current_x
  }

  list(samples = samples, acceptance_rate = n_accepted / n_samples)
}

result <- metropolis_hastings_1d(eval_log_density, 5000, 0.0, 1.0)

The core algorithm is identical - only syntax differs.