Activating project at `~/work/mfiidd/mfiidd`
Observation Models and Distance Metrics
Understanding the bridge between likelihood-based and likelihood-free inference
Estimated time: 2-2.5 hours | Advanced session | Requires: MCMC
Introduction
Do this session if you want to understand why we use Poisson or Negative Binomial likelihoods (not just how), or you’re curious about Approximate Bayesian Computation (ABC) as an alternative to likelihood-based inference.
So far in this course, we’ve used likelihood functions to connect our mathematical models to observed data. For example, we’ve assumed that observed case counts follow a Poisson distribution:
\[\text{obs}_t \sim \text{Poisson}(\rho \cdot I_t)\]
But this raises important questions:
- Why Poisson? Is this mechanistically justified?
- What if we use Negative Binomial instead? How much does it matter?
- Are we doing statistics or just curve fitting?
In this session, we’ll explore these questions and discover that the distinction between “likelihood-based” and “likelihood-free” inference is less sharp than it initially appears.
Objectives
In this session you will:
- Critically evaluate observation model choices (Poisson, Negative Binomial, etc.)
- Understand negative log-likelihood as an implicit distance metric
- Implement Approximate Bayesian Computation (ABC) as an explicit alternative
- Compare likelihood-based and distance-based inference approaches
- Make informed choices about observation models for your own research
Source file
The source file of this session is located at sessions/observation_models.qmd.
Load libraries
using DifferentialEquations
using Distributions
using DataFrames
using CSV
using Plots
using StatsPlots
using Turing
using MCMCChains
using Random
using LinearAlgebra
using DrWatson
using StatsBase # For Weights in ABC-SMCInitialisation
We set a random seed for reproducibility.
Random.seed!(1234)TaskLocalRNG()
Load Tristan da Cunha data
We’ll use the same influenza outbreak data from 1971:
# Load data
flu_tdc = CSV.read(datadir("flu_tdc_1971.csv"), DataFrame)
# Plot
scatter(
flu_tdc.time, flu_tdc.obs,
xlabel = "Time (days)", ylabel = "Daily incidence",
label = "Observed cases", markersize = 4, color = :red,
title = "Tristan da Cunha 1971 Influenza Outbreak",
legend = :topright
)We have 59 daily observations of case counts.
Simple SIR model for comparison
For this session, let’s use a simpler SIR model to make the concepts clearer:
"""
Simple SIR model for pedagogical purposes.
"""
function simulate_sir(θ, init_state, times)
R_0, D_inf = θ[:R_0], θ[:D_inf]
function sir_ode!(du, u, p, t)
S, I, R = u
β, γ = p
N = S + I + R
du[1] = -β * S * I / N
du[2] = β * S * I / N - γ * I
du[3] = γ * I
end
β = R_0 / D_inf
γ = 1.0 / D_inf
u0 = [init_state[:S], init_state[:I], init_state[:R]]
prob = ODEProblem(sir_ode!, u0, (times[1], times[end]), [β, γ])
sol = solve(prob, Tsit5(), saveat=times)
# Extract incidence (new infections per day)
S_values = sol[1, :]
incidence = [0.0; -diff(S_values)]
return DataFrame(time=times, S=S_values,
I=sol[2, :],
R=sol[3, :],
Inc=incidence)
endMain.Notebook.simulate_sir
# Try some parameters
times = flu_tdc.time
θ_test = Dict(:R_0 => 10.0, :D_inf => 2.5)
init_test = Dict(:S => 270.0, :I => 3.0, :R => 0.0)
traj_test = simulate_sir(θ_test, init_test, times)
# Plot
plot(traj_test.time, traj_test.Inc,
label="Predicted incidence", linewidth=2, color=:blue)
scatter!(flu_tdc.time, flu_tdc.obs,
label="Observed cases", markersize=4, color=:red,
xlabel="Time (days)", ylabel="Cases",
title="SIR Model vs Observed Data")Not a perfect fit, but good enough for illustration!
Approach 1: Poisson Likelihood (Traditional)
The standard approach
This is what we’ve been doing throughout the course:
@model function sir_poisson(times, init_state, n_obs)
# Priors
R_0 ~ Uniform(1.0, 20.0)
D_inf ~ Uniform(1.0, 10.0)
ρ ~ Uniform(0.1, 1.0) # Reporting rate
# Simulate
θ = Dict(:R_0 => R_0, :D_inf => D_inf)
traj = simulate_sir(θ, init_state, times)
# Likelihood: Poisson observation model
lambdas = [max(ρ * traj.Inc[i], 1e-10) for i in 1:n_obs]
obs ~ arraydist(Poisson.(lambdas))
return traj
endsir_poisson (generic function with 2 methods)
arraydist and | conditioning
obs ~ arraydist(Poisson.(lambdas)) declares a vector of independent Poisson observations, one per time point. The | operator, applied when creating the model instance below, conditions the model on the actual data. See the MCMC session for a full explanation of this pattern.
We fit this model:
# Fit with NUTS
model_poisson = sir_poisson(times, init_test, length(flu_tdc.obs)) | (; obs = flu_tdc.obs)
chain_poisson = sample(model_poisson, NUTS(0.65), 500, progress=false)
# The 0.65 is the target acceptance rate for NUTS — higher values (closer to 1)
# make the sampler more conservative, lower values allow larger steps.
# The default 0.65 works well for most models.┌ Info: Found initial step size └ ϵ = 0.2
Chains MCMC chain (500×17×1 Array{Float64, 3}):
Iterations = 251:1:750
Number of chains = 1
Samples per chain = 500
Wall duration = 14.19 seconds
Compute duration = 14.19 seconds
parameters = R_0, D_inf, ρ
internals = n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size, logprior, loglikelihood, logjoint
Use `describe(chains)` for summary statistics and quantiles.
# Plot posteriors
plot(chain_poisson[[:R_0, :D_inf, :ρ]], size=(900, 600))# Summary statistics
summarystats(chain_poisson)Summary Statistics parameters mean std mcse ess_bulk ess_tail rhat e ⋯ Symbol Float64 Float64 Float64 Float64 Float64 Float64 ⋯ R_0 1.7456 0.1141 0.0070 300.1356 314.8640 0.9993 ⋯ D_inf 3.2594 0.3184 0.0190 299.1105 298.2930 0.9989 ⋯ ρ 0.9920 0.0082 0.0006 114.3943 91.5676 1.0037 ⋯ 1 column omitted
What does Poisson assume?
The Poisson distribution has a key property: variance = mean
# Illustrate Poisson assumption
λ_values = [5, 20, 50]
p = plot(title="Poisson Distribution: Variance = Mean",
xlabel="Count", ylabel="Probability")
for λ in λ_values
dist = Poisson(λ)
x_range = max(0, λ-20):min(λ+20, λ*2)
plot!(p, x_range, [pdf(dist, x) for x in x_range],
label="λ=$λ (σ²=$λ)", linewidth=2)
end
pQuestion: Is this assumption justified for epidemic data?
- Under-reporting might be systematic (not random)
- Behavioural changes might create clustering
- Surveillance quality might vary over time
Approach 2: Negative Binomial Likelihood (Pragmatic)
Adding overdispersion
When Poisson doesn’t fit well (variance > mean), we often use Negative Binomial:
\[\text{obs}_t \sim \text{NegativeBinomial}(\mu_t, \phi)\]
where:
- \(\mu_t = \rho \cdot I_t\) (mean)
- \(\phi\) controls overdispersion (variance = \(\mu + \mu^2/\phi\))
@model function sir_negbin(times, init_state, n_obs)
# Priors
R_0 ~ Uniform(1.0, 20.0)
D_inf ~ Uniform(1.0, 10.0)
ρ ~ Uniform(0.1, 1.0)
φ ~ Exponential(10.0) # Overdispersion parameter
# Simulate
θ = Dict(:R_0 => R_0, :D_inf => D_inf)
traj = simulate_sir(θ, init_state, times)
# Likelihood: Negative Binomial observation model
mus = [max(ρ * traj.Inc[i], 1e-10) for i in 1:n_obs]
# NegativeBinomial(r, p) where r=φ, p=φ/(φ+μ)
obs ~ arraydist([NegativeBinomial(φ, φ/(φ+μ)) for μ in mus])
return traj
endsir_negbin (generic function with 2 methods)
NegativeBinomial(r, p)- Julia’s parameterisation usesr(number of successes) andp(success probability). To get meanμwith overdispersionφ, user=φandp=φ/(φ+μ). This gives variance =μ + μ²/φ, so largerφapproaches Poisson, smallerφgives more overdispersion.
Honest question: Why Negative Binomial?
- Mechanistic justification: Weak (no clear biological process)
- Empirical justification: Fits better than Poisson
- Convenience: Easy to compute, works with MCMC
Is this really different from choosing a distance metric?
Interlude: Likelihood as Distance
Likelihood as distance
The negative log-likelihood is actually a distance function between observed and predicted data:
\[\text{Distance} = -\log p(\text{obs} | \text{predicted})\]
We can see what distance each likelihood implies:
# Compare different observation models as distance functions
obs_val = 10.0
predicted_range = 1.0:0.5:20.0
# Poisson: -log P(obs | λ)
poisson_distance = [-logpdf(Poisson(λ), obs_val) for λ in predicted_range]
# Negative Binomial with different φ (convert to r,p parametrization)
# NegativeBinomial(r, p) where r=φ, p=φ/(φ+μ)
negbin_distance_phi5 = [-logpdf(NegativeBinomial(5.0, 5.0/(5.0+λ)), obs_val) for λ in predicted_range]
negbin_distance_phi20 = [-logpdf(NegativeBinomial(20.0, 20.0/(20.0+λ)), obs_val) for λ in predicted_range]
# Gaussian (for comparison)
gaussian_distance = [-logpdf(Normal(λ, 3.0), obs_val) for λ in predicted_range]
# Plot
plot(predicted_range, poisson_distance,
label="Poisson", linewidth=2,
xlabel="Predicted value", ylabel="-log(likelihood)",
title="Different Likelihoods as Distance Functions (obs=10)",
legend=:topright)
plot!(predicted_range, negbin_distance_phi5,
label="NegBin(φ=5)", linewidth=2)
plot!(predicted_range, negbin_distance_phi20,
label="NegBin(φ=20)", linewidth=2)
plot!(predicted_range, gaussian_distance,
label="Gaussian(σ=3)", linewidth=2, linestyle=:dash)This reveals that different likelihoods are just different ways of measuring distance:
- Poisson: Penalises deviations inversely proportional to predicted value
- Negative Binomial: Similar but allows more flexibility (overdispersion)
- Gaussian: Constant penalty regardless of scale
Approach 3: Approximate Bayesian Computation
ABC: Making the distance explicit
Instead of choosing a parametric likelihood, ABC directly chooses:
- Summary statistics: What features of the data matter?
- Distance metric: How do we measure similarity?
- Threshold: How close is “close enough”?
Here is a simple ABC rejection sampler from scratch:
"""
Simple ABC rejection sampler.
Samples from prior, simulates data, accepts if distance < threshold.
"""
function abc_rejection(prior_sampler, simulator, distance_fn, observed_data,
threshold, n_samples)
samples = []
n_attempts = 0
max_attempts = n_samples * 1000 # Give up after many attempts
while length(samples) < n_samples && n_attempts < max_attempts
n_attempts += 1
# Sample from prior
θ = prior_sampler()
# Simulate data
simulated_data = simulator(θ)
# Compute distance
d = distance_fn(simulated_data, observed_data)
# Accept if close enough
if d < threshold
push!(samples, θ)
end
end
acceptance_rate = length(samples) / n_attempts
println("Acceptance rate: $(round(acceptance_rate * 100, digits=2))%")
println("Attempts needed: $n_attempts")
return samples
endMain.Notebook.abc_rejection
Now let’s define the components:
# Prior sampler
function sample_from_prior()
R_0 = rand(Uniform(1.0, 20.0))
D_inf = rand(Uniform(1.0, 10.0))
ρ = rand(Uniform(0.1, 1.0))
return [R_0, D_inf, ρ]
end
# Simulator
function simulate_sir_abc(params)
R_0, D_inf, ρ = params
θ = Dict(:R_0 => R_0, :D_inf => D_inf)
traj = simulate_sir(θ, init_test, times)
return ρ .* traj.Inc
end
# Distance functions
distance_L1(sim, obs) = sum(abs.(sim .- obs))
function distance_L2(sim, obs)
return sqrt(sum((sim .- obs).^2))
end
function distance_weighted(sim, obs)
# Weight by 1/observation (similar to Poisson variance structure)
weights = 1.0 ./ (obs .+ 1.0)
return sum(weights .* abs.(sim .- obs))
end
# Observed data
observed_data = flu_tdc.obs59-element Vector{Int64}:
0
1
0
10
6
32
47
37
29
11
⋮
2
0
0
0
0
0
0
0
1
We’ve defined three distance functions, each with different properties:
L1 (Manhattan):
sum(|sim - obs|)- Sums absolute deviations. Robust to outliers because large errors contribute linearly. A mismatch of 20 cases counts twice as much as a mismatch of 10.L2 (Euclidean):
sqrt(sum((sim - obs)²))- Root of squared deviations. Penalises large errors more heavily than L1. A mismatch of 20 cases counts four times as much as 10 (because 20² = 400 vs 10² = 100).Weighted:
sum(w × |sim - obs|)wherew = 1/(obs + 1)- Weights inversely by observation size. This mimics Poisson variance scaling: a mismatch of 5 when obs=2 is penalised more than when obs=50. Useful when you care about relative rather than absolute fit.
When to use each:
- L1: Good default. Robust to occasional large mismatches.
- L2: When large deviations are particularly bad (e.g., outbreak peaks).
- Weighted: When fit quality should be scale-independent (similar to Poisson likelihood).
Run ABC with L1 distance:
ABC rejection sampling is sensitive to the threshold choice:
- Too strict (e.g., 100): Long wait times, few or zero accepted samples
- Too loose (e.g., 1000): Fast, but approximate posterior is close to the prior
- Just right (e.g., 200-400): Reasonable acceptance rate with meaningful posterior concentration
We start with a relaxed threshold (300) here.
If you get 0 samples: Increase the threshold (try 400, then 500). The sampler may be struggling to find parameter combinations that produce simulations close to the data. This can happen due to random variation - ABC is inherently stochastic.
# Start with a relaxed threshold to get some samples
threshold = 300.0 # Relaxed for faster acceptance
println("Running ABC rejection sampler...")
abc_samples = abc_rejection(
sample_from_prior,
simulate_sir_abc,
distance_L1,
observed_data,
threshold,
200 # Target number of samples
)
println("\nObtained $(length(abc_samples)) samples")Running ABC rejection sampler...
Acceptance rate: 35.59%
Attempts needed: 562
Obtained 200 samples
# Extract parameters
if length(abc_samples) > 0
abc_df = DataFrame(
R_0 = [s[1] for s in abc_samples],
D_inf = [s[2] for s in abc_samples],
ρ = [s[3] for s in abc_samples]
)
# Plot ABC results
histogram(abc_df.R_0, bins=30, alpha=0.6,
label="ABC (L1 distance)",
xlabel="R_0", ylabel="Frequency",
title="ABC Posterior Distribution for R_0")
else
println("No samples accepted - threshold too strict!")
endCompare distance metrics
Different distance functions give different results:
# Test on a candidate parameter set
test_params = [10.0, 2.5, 0.8]
test_sim = simulate_sir_abc(test_params)
println("Distance metrics for test parameters:")
println(" L1: ", round(distance_L1(test_sim, observed_data), digits=2))
println(" L2: ", round(distance_L2(test_sim, observed_data), digits=2))
println(" Weighted: ", round(distance_weighted(test_sim, observed_data), digits=2))Distance metrics for test parameters:
L1: 502.82
L2: 168.89
Weighted: 197.64
Notice that choosing a distance metric in ABC is philosophically similar to choosing a parametric likelihood.
ABC-SMC: Sequential Monte Carlo ABC
The rejection sampler above is inefficient - most samples from the prior are rejected, wasting computation. ABC-SMC (Sequential Monte Carlo ABC) dramatically improves efficiency by gradually tightening the acceptance threshold across multiple generations.
The key idea, introduced by Toni et al. (2009), is:
- Start permissive: Begin with a loose threshold that accepts many samples
- Gradually tighten: Reduce the threshold across generations
- Reuse good samples: Instead of sampling from the prior each time, perturb samples that passed the previous threshold
This is much more efficient because particles that are already “close” to the data are refined, rather than searching blindly from the prior each time. Typical speedups are 20-50× compared to rejection sampling.
This is a simplified ABC-SMC for pedagogical purposes. Production implementations (like ApproxBayes.jl or GpABC.jl) have adaptive thresholds, better perturbation kernels, and importance weighting.
Note on Threads.@threads: The code below uses Threads.@threads to parallelise the particle loop — this macro distributes iterations across available CPU threads, so each particle can be simulated concurrently. For this to have any effect, Julia must be started with multiple threads (e.g., julia -t auto). See the threading note in MCMC Diagnostics for setup instructions. Without multiple threads, the loop runs sequentially (slower but still correct).
"""
Simple ABC-SMC implementation.
Uses sequence of decreasing thresholds with resampling.
"""
function abc_smc(prior_sampler, simulator, distance_fn, observed_data,
thresholds, n_particles)
particles = []
weights = ones(n_particles) ./ n_particles
for (t_idx, threshold) in enumerate(thresholds)
println("\nGeneration $t_idx: threshold = $threshold")
new_particles = Vector{Vector{Float64}}(undef, n_particles)
if t_idx == 1
# First generation: sample from prior (parallelised)
Threads.@threads for i in 1:n_particles
while true
θ = prior_sampler()
sim = simulator(θ)
d = distance_fn(sim, observed_data)
if d < threshold
new_particles[i] = θ
break
end
end
end
else
# Later generations: perturb from previous (parallelised)
Threads.@threads for i in 1:n_particles
while true
# Sample from previous generation (weighted)
idx = sample(1:length(particles), Weights(weights))
θ_old = particles[idx]
# Perturb (simple Gaussian perturbation)
θ_new = θ_old .+ randn(length(θ_old)) .* 0.1 .* θ_old
# Check bounds (reject if outside prior)
if θ_new[1] < 1.0 || θ_new[1] > 20.0 continue end
if θ_new[2] < 1.0 || θ_new[2] > 10.0 continue end
if θ_new[3] < 0.1 || θ_new[3] > 1.0 continue end
# Simulate and check distance
sim = simulator(θ_new)
d = distance_fn(sim, observed_data)
if d < threshold
new_particles[i] = θ_new
break
end
end
end
end
particles = collect(new_particles)
weights = ones(n_particles) ./ n_particles # Uniform weights (simplified)
println(" Mean R_0: ", round(mean([p[1] for p in particles]), digits=2))
end
return particles
endMain.Notebook.abc_smc
# Run ABC-SMC with decreasing thresholds
println("Running ABC-SMC...")
thresholds = [400.0, 300.0, 250.0] # Decreasing threshold sequence
abc_smc_samples = abc_smc(
sample_from_prior,
simulate_sir_abc,
distance_L1,
observed_data,
thresholds,
100 # Number of particles
)
abc_smc_df = DataFrame(
R_0 = [s[1] for s in abc_smc_samples],
D_inf = [s[2] for s in abc_smc_samples],
ρ = [s[3] for s in abc_smc_samples]
)Running ABC-SMC...
Generation 1: threshold = 400.0
Mean R_0: 8.52
Generation 2: threshold = 300.0
Mean R_0: 4.85
Generation 3: threshold = 250.0
Mean R_0: 5.45
| Row | R_0 | D_inf | ρ |
|---|---|---|---|
| Float64 | Float64 | Float64 | |
| 1 | 3.88821 | 3.53716 | 0.819859 |
| 2 | 5.67084 | 5.0937 | 0.691977 |
| 3 | 4.96967 | 4.7495 | 0.520559 |
| 4 | 1.89791 | 1.39827 | 0.569458 |
| 5 | 5.78802 | 9.10336 | 0.471406 |
| 6 | 1.5424 | 1.34745 | 0.68343 |
| 7 | 6.04643 | 6.93671 | 0.740918 |
| 8 | 3.82062 | 4.53842 | 0.696085 |
| 9 | 5.85712 | 9.43924 | 0.575105 |
| 10 | 5.47996 | 4.70649 | 0.454627 |
| 11 | 5.80899 | 8.30352 | 0.493937 |
| 12 | 8.48744 | 9.77517 | 0.492892 |
| 13 | 5.82513 | 8.56091 | 0.412056 |
| ⋮ | ⋮ | ⋮ | ⋮ |
| 89 | 4.0472 | 3.34504 | 0.86704 |
| 90 | 5.59588 | 5.03534 | 0.744871 |
| 91 | 3.3375 | 3.409 | 0.811286 |
| 92 | 6.89871 | 8.87654 | 0.388308 |
| 93 | 7.003 | 8.91391 | 0.796584 |
| 94 | 9.48499 | 9.62259 | 0.503329 |
| 95 | 8.60448 | 8.30083 | 0.40489 |
| 96 | 5.30674 | 8.33293 | 0.544474 |
| 97 | 4.75883 | 7.29285 | 0.547479 |
| 98 | 5.60151 | 6.90065 | 0.449812 |
| 99 | 8.89455 | 8.47311 | 0.577395 |
| 100 | 3.78894 | 3.12373 | 0.755425 |
Comparing rejection ABC and ABC-SMC
The following compares the two approaches on our problem:
# Compare efficiency: how many simulations were needed?
println("Comparison: Rejection ABC vs ABC-SMC")
println("="^50)
# Rejection ABC stats (from earlier run)
rejection_samples = length(abc_samples)
rejection_acceptance = rejection_samples / (rejection_samples * 1000 / 100) # Rough estimate
# ABC-SMC: n_particles × n_generations simulations (plus rejections within each)
smc_generations = length(thresholds)
smc_particles = 100
println("Rejection ABC:")
println(" Samples obtained: $rejection_samples")
println(" Threshold used: 300.0")
println("\nABC-SMC:")
println(" Samples obtained: $(length(abc_smc_samples))")
println(" Final threshold: $(thresholds[end])")
println(" Generations: $smc_generations")Comparison: Rejection ABC vs ABC-SMC
==================================================
Rejection ABC:
Samples obtained: 200
Threshold used: 300.0
ABC-SMC:
Samples obtained: 100
Final threshold: 250.0
Generations: 3
# Compare posterior quality
println("\nPosterior summaries:")
println("="^50)
println("Rejection ABC (threshold=300):")
println(" R_0: mean=$(round(mean(abc_df.R_0), digits=2)), std=$(round(std(abc_df.R_0), digits=2))")
println(" D_inf: mean=$(round(mean(abc_df.D_inf), digits=2)), std=$(round(std(abc_df.D_inf), digits=2))")
println("\nABC-SMC (final threshold=250):")
println(" R_0: mean=$(round(mean(abc_smc_df.R_0), digits=2)), std=$(round(std(abc_smc_df.R_0), digits=2))")
println(" D_inf: mean=$(round(mean(abc_smc_df.D_inf), digits=2)), std=$(round(std(abc_smc_df.D_inf), digits=2))")
Posterior summaries:
==================================================
Rejection ABC (threshold=300):
R_0: mean=4.88, std=2.76
D_inf: mean=6.71, std=2.35
ABC-SMC (final threshold=250):
R_0: mean=5.45, std=2.35
D_inf: mean=6.07, std=2.84
# Visual comparison
p1 = histogram(abc_df.R_0, bins=20, alpha=0.6, normalize=:pdf,
label="Rejection (ε=300)", xlabel="R_0", ylabel="Density")
histogram!(p1, abc_smc_df.R_0, bins=20, alpha=0.6, normalize=:pdf,
label="ABC-SMC (ε=250)")
p2 = histogram(abc_df.D_inf, bins=20, alpha=0.6, normalize=:pdf,
label="Rejection (ε=300)", xlabel="D_inf", ylabel="Density")
histogram!(p2, abc_smc_df.D_inf, bins=20, alpha=0.6, normalize=:pdf,
label="ABC-SMC (ε=250)")
plot(p1, p2, layout=(1, 2), size=(900, 400),
title=["R_0: Rejection vs SMC" "D_inf: Rejection vs SMC"])We see that ABC-SMC achieves a tighter threshold (250 vs 300) with fewer wasted simulations. The resulting posteriors are more concentrated - ABC-SMC has effectively “zoomed in” on the high-probability region. This efficiency gain becomes even more pronounced in higher dimensions, where rejection sampling becomes impractical.
Unlike MCMC, ABC does not produce chains, so standard diagnostics like ESS (effective sample size) and R-hat do not apply. Instead, assess ABC posterior quality by:
- Acceptance rate: Very low rates (< 0.1%) suggest the threshold is too strict or the model is a poor fit. Very high rates (> 10%) suggest the threshold is too loose and the posterior is close to the prior.
- Visual comparison with likelihood-based posteriors: When a likelihood is available (as in this session), run both methods and compare. Large discrepancies suggest either the ABC threshold is too loose or the distance metric captures different features than the likelihood.
- Threshold sensitivity: Run ABC with several thresholds. If the posterior changes substantially as you tighten the threshold, it has not yet converged — you need a stricter threshold (and more computation).
- Coverage of prior: If the ABC posterior looks very similar to the prior, the threshold is too loose and the data are not informing the inference.
- Replicate runs: Run ABC multiple times with different random seeds. If posteriors differ substantially between runs, you need more particles or a less strict threshold.
The big picture: Likelihood vs ABC
Now let’s step back and compare all our approaches - parametric likelihoods and ABC methods - to see what’s fundamentally different (and what’s surprisingly similar).
Summary of approaches
| Approach | What you specify | What’s arbitrary | Computational cost |
|---|---|---|---|
| Poisson likelihood | Parametric distribution | Why Poisson? | Fast (MCMC) |
| NegBin likelihood | Parametric + overdispersion | Why NegBin? | Fast (MCMC) |
| ABC (L1) | Distance metric | Why L1? | Slow (many simulations) |
| ABC (weighted) | Distance + weights | Why these weights? | Slow (many simulations) |
Do they give similar results?
Comparing the posteriors:
# Extract MCMC results
mcmc_df = DataFrame(chain_poisson)
# Plot comparison
p1 = histogram(mcmc_df.R_0, bins=30, alpha=0.5,
label="Poisson", normalize=:pdf)
histogram!(p1, abc_df.R_0, bins=30, alpha=0.5,
label="ABC (L1)", normalize=:pdf,
xlabel="R_0", ylabel="Density",
title="Posterior Comparison: R_0")
p2 = histogram(mcmc_df.D_inf, bins=30, alpha=0.5,
label="Poisson", normalize=:pdf)
histogram!(p2, abc_df.D_inf, bins=30, alpha=0.5,
label="ABC (L1)", normalize=:pdf,
xlabel="D_inf", ylabel="Density",
title="Posterior Comparison: D_inf")
plot(p1, p2, layout=(1, 2), size=(900, 400))How to interpret these plots: Look for whether the posterior distributions overlap substantially. If the MCMC (Poisson) and ABC (L1) posteriors cover a similar range and peak in similar places, the two approaches agree — meaning the choice of observation model has little effect on inference for this dataset. If one posterior is much wider, shifted, or multimodal compared to the other, the observation model is influencing conclusions. A useful rule of thumb: if the posterior means are within one standard deviation of each other and the bulk of the distributions overlap, the methods are giving consistent answers.
Discussion: Are the posteriors similar or different? What does this tell us about the observation model choice?
When does it matter?
Sensitivity analysis
The choice between Poisson, NegBin, and ABC matters most when:
- Outliers: Extreme observations far from predictions
- Tail behaviour: Rare events (very low or very high counts)
- Overdispersion: Systematic variance > mean in data
- Model misspecification: When the mechanistic model is wrong
We can test sensitivity to outliers:
# Helper function to simulate and return incidence
function simulate_and_summarize(params)
R_0, D_inf, ρ = params
θ = Dict(:R_0 => R_0, :D_inf => D_inf)
traj = simulate_sir(θ, init_test, times)
return ρ .* traj.Inc
end
# Create synthetic data with and without outliers
function test_sensitivity(add_outlier=false)
# True parameters
true_params = [10.0, 2.5, 0.8]
true_data = simulate_and_summarize(true_params)
# Add noise
noisy_data = true_data .+ randn(length(true_data)) .* 2.0
noisy_data = max.(noisy_data, 0.0) # No negative cases
# Add outlier if requested
if add_outlier
noisy_data[30] += 30.0 # Artificial spike
end
return noisy_data
end
# Test with and without outlier
data_clean = test_sensitivity(false)
data_outlier = test_sensitivity(true)
# Plot
plot(times, data_clean, label="Clean data", linewidth=2)
plot!(times, data_outlier, label="Data with outlier", linewidth=2,
xlabel="Time", ylabel="Cases",
title="Effect of Outliers on Observation Data")Exercise for students: Compare how the Poisson likelihood and ABC with L1 distance respond to the outlier. You can do this visually: simulate from a few parameter sets, compute the Poisson log-likelihood and L1 distance for both data_clean and data_outlier, and see which metric shifts more. If you have time, try a full fit with both methods.
Poisson likelihood is sensitive to outliers because the log-probability drops sharply for counts far from the mean. An outlier of +30 cases when the model predicts 5 contributes a large penalty to the log-likelihood, pulling parameter estimates towards values that accommodate the outlier.
ABC with L1 distance (defined earlier as sum(abs.(sim .- obs))) is more robust because the penalty grows linearly with the discrepancy, not exponentially. A single outlier affects the total distance, but doesn’t dominate as severely.
For likelihood-based inference, the Negative Binomial distribution (covered earlier in this session) offers a middle ground: its heavier tails make it less sensitive to occasional large deviations than Poisson, while still providing a proper likelihood for MCMC.
# Compare how each metric responds to the outlier
true_params = [10.0, 2.5, 0.8]
predicted = simulate_and_summarize(true_params)
# Poisson log-likelihood
function poisson_loglik(predicted, observed)
ll = 0.0
for i in eachindex(observed)
λ = max(predicted[i], 1e-10)
ll += logpdf(Poisson(λ), round(Int, observed[i]))
end
return ll
end
println("Poisson log-likelihood:")
println(" Clean data: ", round(poisson_loglik(predicted, data_clean), digits=2))
println(" Outlier data: ", round(poisson_loglik(predicted, data_outlier), digits=2))
println("\nL1 distance:")
println(" Clean data: ", round(distance_L1(predicted, data_clean), digits=2))
println(" Outlier data: ", round(distance_L1(predicted, data_outlier), digits=2))
# The Poisson log-likelihood drops much more dramatically with the outlier,
# because a count of ~35 when expecting ~5 has near-zero Poisson probability.
# The L1 distance increases by only ~30 (the size of the outlier).The philosophical question
Is there a “correct” observation model?
No! There are only:
- More or less mechanistically motivated choices
- E.g., modelling test sensitivity explicitly vs using NegBin
- More or less flexible choices
- Poisson (rigid) → NegBin (flexible) → ABC (very flexible)
- More or less computationally convenient choices
- MCMC with parametric likelihood (fast)
- ABC (slow but general)
Practical advice
When to use parametric likelihoods (Poisson/NegBin):
- You have a tractable likelihood function
- You want efficient inference (MCMC with gradients)
- The parametric family is flexible enough
- You’re comfortable with the implicit distance metric
When to use ABC/likelihood-free:
- Likelihood is truly intractable (stochastic models, agent-based models)
- You want to make your distance metric explicit
- You care more about specific summary statistics than full data
- You want robustness to model misspecification
Best practice:
- Try multiple observation models (Poisson, NegBin, ABC with different distances)
- Check if scientific conclusions change
- If they don’t → observation model choice doesn’t matter much
- If they do → you’ve learned something about what features of the data drive inference!
Summary
Observation models are pragmatic choices, not mechanistic truths. Choosing Poisson vs Negative Binomial isn’t choosing between truth and falsehood - it’s choosing different ways to penalise deviations between model predictions and data. The negative log-likelihood is simply a distance function.
ABC makes this explicit by directly specifying distance and threshold. While often called “likelihood-free”, ABC uses an implicit likelihood defined by that distance. The boundary between likelihood-based and likelihood-free inference is more philosophical than technical.
The practical implication: run inference with different observation models. If conclusions change, you’ve learned which data features drive inference. If they don’t, your results are robust.
- Parametric likelihoods are pragmatic choices, not mechanistic truths - we choose Poisson or Negative Binomial for convenience, not because they represent reality
- Negative log-likelihood is a distance function - choosing an observation model is choosing how to measure the fit between predictions and data
- ABC makes distance explicit - but choosing L1 vs L2 vs weighted distances is conceptually similar to choosing a likelihood
- Sensitivity analysis is essential - if conclusions change with different observation models, that reveals which data features drive inference
- Flexibility vs computation: parametric likelihoods are fast (MCMC), ABC is slow but general-purpose
Going further
Summary statistics for ABC
Instead of comparing full time series, ABC often works better with carefully chosen summary statistics. For epidemic data, useful summaries include:
- Total cases: Sum of all observations
- Peak timing: Day with maximum incidence
- Peak height: Maximum daily incidence
- Growth rate: Initial doubling time or exponential growth rate
- Final size: Cumulative attack rate
Here is an example of how to define and use summary statistics for ABC:
"""Extract summary statistics from an epidemic time series."""
function epidemic_summaries(data)
return [
sum(data), # Total cases
argmax(data), # Peak timing (day index)
maximum(data), # Peak height
sum(data[1:min(5, length(data))]) # Early cumulative cases (proxy for growth rate)
]
end
# Distance on summaries rather than full time series
function distance_summaries(sim, obs)
s_sim = epidemic_summaries(sim)
s_obs = epidemic_summaries(obs)
# Normalise each summary by the observed value to make them comparable
return sum(abs.(s_sim .- s_obs) ./ (abs.(s_obs) .+ 1.0))
endUsing fewer, well-chosen summaries can make ABC more robust to noise and model misspecification. However, there is a trade-off: summaries that discard too much information will produce wider posteriors. The choice of which summaries to use is problem-specific and often requires experimentation.
Alternative observation models
The choice of observation model affects inference more than many researchers realise. Consider exploring:
- Gaussian likelihood: Appropriate when counts are large and approximately symmetric
- Zero-inflated models: When surveillance systems miss entire days
- Reporting delay models: When cases are reported with variable lag
Computing the variance/mean ratio in your data can help diagnose whether Poisson (ratio ≈ 1) or Negative Binomial (ratio > 1) is more appropriate.
Robust distance metrics
For ABC with outliers or model misspecification, consider robust distances:
- Median absolute deviation instead of mean squared error
- Quantile-based distances that down-weight extreme values
- Feature-based distances that focus on epidemiologically meaningful patterns
References
- Toni, T. et al. (2009). Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J. R. Soc. Interface 6(31), 187-202. (The ABC-SMC paper - highly readable introduction)
- Wood, S.N. (2010). Statistical inference for noisy nonlinear ecological dynamic systems. Nature 466, 1102-1104. (Synthetic likelihood approach)
- Marin, J.-M. et al. (2012). Approximate Bayesian computational methods. Statistics and Computing 22, 1167-1180.
- Cranmer, K. et al. (2020). The frontier of simulation-based inference. PNAS 117(48), 30055-30062. (Comprehensive review of modern likelihood-free methods)
Next sessions
The following advanced sessions cover specialised topics:
- Variational Inference: Fast approximate posteriors for real-time applications
- Universal Differential Equations: Learning unknown mechanisms with neural networks