Observation Models and Distance Metrics

Understanding the bridge between likelihood-based and likelihood-free inference

Estimated time: 2-2.5 hours | Advanced session | Requires: MCMC

Introduction

Do this session if you want to understand why we use Poisson or Negative Binomial likelihoods (not just how), or you’re curious about Approximate Bayesian Computation (ABC) as an alternative to likelihood-based inference.

So far in this course, we’ve used likelihood functions to connect our mathematical models to observed data. For example, we’ve assumed that observed case counts follow a Poisson distribution:

\[\text{obs}_t \sim \text{Poisson}(\rho \cdot I_t)\]

But this raises important questions:

Why Poisson? Is this mechanistically justified?
What if we use Negative Binomial instead? How much does it matter?
Are we doing statistics or just curve fitting?

In this session, we’ll explore these questions and discover that the distinction between “likelihood-based” and “likelihood-free” inference is less sharp than it initially appears.

Objectives

In this session you will:

Critically evaluate observation model choices (Poisson, Negative Binomial, etc.)
Understand negative log-likelihood as an implicit distance metric
Implement Approximate Bayesian Computation (ABC) as an explicit alternative
Compare likelihood-based and distance-based inference approaches
Make informed choices about observation models for your own research

Setup

"100"

Source file

The source file of this session is located at sessions/observation_models.qmd.

Load libraries

using DifferentialEquations
using Distributions
using DataFrames
using CSV
using Plots
using StatsPlots
using Turing
using MCMCChains
using Random
using LinearAlgebra
using DrWatson
using StatsBase  # For Weights in ABC-SMC

Initialisation

We set a random seed for reproducibility.

Random.seed!(1234)

TaskLocalRNG()

Load Tristan da Cunha data

We’ll use the same influenza outbreak data from 1971:

# Load data
flu_tdc = CSV.read(datadir("flu_tdc_1971.csv"), DataFrame)

# Plot
scatter(
    flu_tdc.time, flu_tdc.obs,
    xlabel = "Time (days)", ylabel = "Daily incidence",
    label = "Observed cases", markersize = 4, color = :red,
    title = "Tristan da Cunha 1971 Influenza Outbreak",
    legend = :topright
)

We have 59 daily observations of case counts.

Simple SIR model for comparison

For this session, let’s use a simpler SIR model to make the concepts clearer:

"""
Simple SIR model for pedagogical purposes.
"""
function simulate_sir(θ, init_state, times)
    R_0, D_inf = θ[:R_0], θ[:D_inf]

    function sir_ode!(du, u, p, t)
        S, I, R = u
        β, γ = p
        N = S + I + R

        du[1] = -β * S * I / N
        du[2] = β * S * I / N - γ * I
        du[3] = γ * I
    end

    β = R_0 / D_inf
    γ = 1.0 / D_inf
    u0 = [init_state[:S], init_state[:I], init_state[:R]]
    prob = ODEProblem(sir_ode!, u0, (times[1], times[end]), [β, γ])
    sol = solve(prob, Tsit5(), saveat=times)

    # Extract incidence (new infections per day)
    S_values = sol[1, :]
    incidence = [0.0; -diff(S_values)]

    return DataFrame(time=times, S=S_values,
                     I=sol[2, :],
                     R=sol[3, :],
                     Inc=incidence)
end

Main.Notebook.simulate_sir

# Try some parameters
times = flu_tdc.time
θ_test = Dict(:R_0 => 10.0, :D_inf => 2.5)
init_test = Dict(:S => 270.0, :I => 3.0, :R => 0.0)

traj_test = simulate_sir(θ_test, init_test, times)

# Plot
plot(traj_test.time, traj_test.Inc,
     label="Predicted incidence", linewidth=2, color=:blue)
scatter!(flu_tdc.time, flu_tdc.obs,
         label="Observed cases", markersize=4, color=:red,
         xlabel="Time (days)", ylabel="Cases",
         title="SIR Model vs Observed Data")

Not a perfect fit, but good enough for illustration!

In your own work, you should run prior and posterior predictive checks when fitting models. We skip them here to focus on observation model choices.

Approach 1: Poisson Likelihood (Traditional)

The standard approach

This is what we’ve been doing throughout the course:

@model function sir_poisson(times, init_state, n_obs)
    # Priors
    R_0 ~ Uniform(1.0, 20.0)
    D_inf ~ Uniform(1.0, 10.0)
    ρ ~ Uniform(0.1, 1.0)  # Reporting rate

    # Simulate
    θ = Dict(:R_0 => R_0, :D_inf => D_inf)
    traj = simulate_sir(θ, init_state, times)

    # Likelihood: Poisson observation model
    lambdas = [max(ρ * traj.Inc[i], 1e-10) for i in 1:n_obs]
    obs ~ arraydist(Poisson.(lambdas))

    return traj
end

sir_poisson (generic function with 2 methods)

Reminder: arraydist and | conditioning

obs ~ arraydist(Poisson.(lambdas)) declares a vector of independent Poisson observations, one per time point. The | operator, applied when creating the model instance below, conditions the model on the actual data. See the MCMC session for a full explanation of this pattern.

We fit this model:

# Fit with NUTS
model_poisson = sir_poisson(times, init_test, length(flu_tdc.obs)) | (; obs = flu_tdc.obs)
chain_poisson = sample(model_poisson, NUTS(0.65), 500, progress=false)
# The 0.65 is the target acceptance rate for NUTS — higher values (closer to 1)
# make the sampler more conservative, lower values allow larger steps.
# The default 0.65 works well for most models.

┌ Info: Found initial step size
└   ϵ = 0.2

Chains MCMC chain (500×17×1 Array{Float64, 3}):

Iterations        = 251:1:750
Number of chains  = 1
Samples per chain = 500
Wall duration     = 10.76 seconds
Compute duration  = 10.76 seconds
parameters        = R_0, D_inf, ρ
internals         = n_steps, is_accept, acceptance_rate, log_density, hamiltonian_energy, hamiltonian_energy_error, max_hamiltonian_energy_error, tree_depth, numerical_error, step_size, nom_step_size, logprior, loglikelihood, logjoint

Use `describe(chains)` for summary statistics and quantiles.

# Plot posteriors
plot(chain_poisson[[:R_0, :D_inf, :ρ]], size=(900, 600))

# Summary statistics
summarystats(chain_poisson)

Summary Statistics

  parameters      mean       std      mcse   ess_bulk   ess_tail      rhat   e ⋯
      Symbol   Float64   Float64   Float64    Float64    Float64   Float64     ⋯

         R_0    1.7559    0.1210    0.0099   176.8710   165.2348    0.9999     ⋯
       D_inf    3.2724    0.3351    0.0268   182.8498   189.3899    1.0015     ⋯
           ρ    0.9927    0.0066    0.0004   230.9641   206.4396    0.9994     ⋯

                                                                1 column omitted

What does Poisson assume?

The Poisson distribution has a key property: variance = mean

# Illustrate Poisson assumption
λ_values = [5, 20, 50]
p = plot(title="Poisson Distribution: Variance = Mean",
         xlabel="Count", ylabel="Probability")

for λ in λ_values
    dist = Poisson(λ)
    x_range = max(0, λ-20):min(λ+20, λ*2)
    plot!(p, x_range, [pdf(dist, x) for x in x_range],
          label="λ=$λ (σ²=$λ)", linewidth=2)
end
p

Question: Is this assumption justified for epidemic data?

Under-reporting might be systematic (not random)
Behavioural changes might create clustering
Surveillance quality might vary over time

Approach 2: Negative Binomial Likelihood (Pragmatic)

Adding overdispersion

When Poisson doesn’t fit well (variance > mean), we often use Negative Binomial:

\[\text{obs}_t \sim \text{NegativeBinomial}(\mu_t, \phi)\]

where:

\(\mu_t = \rho \cdot I_t\) (mean)
\(\phi\) controls overdispersion (variance = \(\mu + \mu^2/\phi\))

@model function sir_negbin(times, init_state, n_obs)
    # Priors
    R_0 ~ Uniform(1.0, 20.0)
    D_inf ~ Uniform(1.0, 10.0)
    ρ ~ Uniform(0.1, 1.0)
    φ ~ Exponential(10.0)  # Overdispersion parameter

    # Simulate
    θ = Dict(:R_0 => R_0, :D_inf => D_inf)
    traj = simulate_sir(θ, init_state, times)

    # Likelihood: Negative Binomial observation model
    mus = [max(ρ * traj.Inc[i], 1e-10) for i in 1:n_obs]
    # NegativeBinomial(r, p) where r=φ, p=φ/(φ+μ)
    obs ~ arraydist([NegativeBinomial(φ, φ/(φ+μ)) for μ in mus])

    return traj
end

sir_negbin (generic function with 2 methods)

Code explainer

NegativeBinomial(r, p) - Julia’s parameterisation uses r (number of successes) and p (success probability). To get mean μ with overdispersion φ, use r=φ and p=φ/(φ+μ). This gives variance = μ + μ²/φ, so larger φ approaches Poisson, smaller φ gives more overdispersion.

Honest question: Why Negative Binomial?

Mechanistic justification: Weak (no clear biological process)
Empirical justification: Fits better than Poisson
Convenience: Easy to compute, works with MCMC

Is this really different from choosing a distance metric?

Interlude: Likelihood as Distance

Likelihood as distance

The negative log-likelihood is actually a distance function between observed and predicted data:

\[\text{Distance} = -\log p(\text{obs} | \text{predicted})\]

We can see what distance each likelihood implies:

# Compare different observation models as distance functions
obs_val = 10.0
predicted_range = 1.0:0.5:20.0

# Poisson: -log P(obs | λ)
poisson_distance = [-logpdf(Poisson(λ), obs_val) for λ in predicted_range]

# Negative Binomial with different φ (convert to r,p parametrization)
# NegativeBinomial(r, p) where r=φ, p=φ/(φ+μ)
negbin_distance_phi5 = [-logpdf(NegativeBinomial(5.0, 5.0/(5.0+λ)), obs_val) for λ in predicted_range]
negbin_distance_phi20 = [-logpdf(NegativeBinomial(20.0, 20.0/(20.0+λ)), obs_val) for λ in predicted_range]

# Gaussian (for comparison)
gaussian_distance = [-logpdf(Normal(λ, 3.0), obs_val) for λ in predicted_range]

# Plot
plot(predicted_range, poisson_distance,
     label="Poisson", linewidth=2,
     xlabel="Predicted value", ylabel="-log(likelihood)",
     title="Different Likelihoods as Distance Functions (obs=10)",
     legend=:topright)
plot!(predicted_range, negbin_distance_phi5,
      label="NegBin(φ=5)", linewidth=2)
plot!(predicted_range, negbin_distance_phi20,
      label="NegBin(φ=20)", linewidth=2)
plot!(predicted_range, gaussian_distance,
      label="Gaussian(σ=3)", linewidth=2, linestyle=:dash)

This reveals that different likelihoods are just different ways of measuring distance:

Poisson: Penalises deviations inversely proportional to predicted value
Negative Binomial: Similar but allows more flexibility (overdispersion)
Gaussian: Constant penalty regardless of scale

Approach 3: Approximate Bayesian Computation

ABC: Making the distance explicit

Instead of choosing a parametric likelihood, ABC directly chooses:

Summary statistics: What features of the data matter?
Distance metric: How do we measure similarity?
Threshold: How close is “close enough”?

Here is a simple ABC rejection sampler from scratch:

"""
Simple ABC rejection sampler.
Samples from prior, simulates data, accepts if distance < threshold.
"""
function abc_rejection(prior_sampler, simulator, distance_fn, observed_data,
                      threshold, n_samples)
    samples = []
    n_attempts = 0
    max_attempts = n_samples * 1000  # Give up after many attempts

    while length(samples) < n_samples && n_attempts < max_attempts
        n_attempts += 1

        # Sample from prior
        θ = prior_sampler()

        # Simulate data
        simulated_data = simulator(θ)

        # Compute distance
        d = distance_fn(simulated_data, observed_data)

        # Accept if close enough
        if d < threshold
            push!(samples, θ)
        end
    end

    acceptance_rate = length(samples) / n_attempts
    println("Acceptance rate: $(round(acceptance_rate * 100, digits=2))%")
    println("Attempts needed: $n_attempts")

    return samples
end

Main.Notebook.abc_rejection

Now let’s define the components:

# Prior sampler
function sample_from_prior()
    R_0 = rand(Uniform(1.0, 20.0))
    D_inf = rand(Uniform(1.0, 10.0))
    ρ = rand(Uniform(0.1, 1.0))
    return [R_0, D_inf, ρ]
end

# Simulator
function simulate_sir_abc(params)
    R_0, D_inf, ρ = params
    θ = Dict(:R_0 => R_0, :D_inf => D_inf)
    traj = simulate_sir(θ, init_test, times)
    return ρ .* traj.Inc
end

# Distance functions
distance_L1(sim, obs) = sum(abs.(sim .- obs))

function distance_L2(sim, obs)
    return sqrt(sum((sim .- obs).^2))
end

function distance_weighted(sim, obs)
    # Weight by 1/observation (similar to Poisson variance structure)
    weights = 1.0 ./ (obs .+ 1.0)
    return sum(weights .* abs.(sim .- obs))
end

# Observed data
observed_data = flu_tdc.obs

59-element Vector{Int64}:
  0
  1
  0
 10
  6
 32
 47
 37
 29
 11
  ⋮
  2
  0
  0
  0
  0
  0
  0
  0
  1

Understanding the distance functions

We’ve defined three distance functions, each with different properties:

L1 (Manhattan): sum(|sim - obs|) - Sums absolute deviations. Robust to outliers because large errors contribute linearly. A mismatch of 20 cases counts twice as much as a mismatch of 10.
L2 (Euclidean): sqrt(sum((sim - obs)²)) - Root of squared deviations. Penalises large errors more heavily than L1. A mismatch of 20 cases counts four times as much as 10 (because 20² = 400 vs 10² = 100).
Weighted: sum(w × |sim - obs|) where w = 1/(obs + 1) - Weights inversely by observation size. This mimics Poisson variance scaling: a mismatch of 5 when obs=2 is penalised more than when obs=50. Useful when you care about relative rather than absolute fit.

When to use each:

L1: Good default. Robust to occasional large mismatches.
L2: When large deviations are particularly bad (e.g., outbreak peaks).
Weighted: When fit quality should be scale-independent (similar to Poisson likelihood).

Run ABC with L1 distance:

Choosing the threshold

ABC rejection sampling is sensitive to the threshold choice:

Too strict (e.g., 100): Long wait times, few or zero accepted samples
Too loose (e.g., 1000): Fast, but approximate posterior is close to the prior
Just right (e.g., 200-400): Reasonable acceptance rate with meaningful posterior concentration

We start with a relaxed threshold (300) here.

If you get 0 samples: Increase the threshold (try 400, then 500). The sampler may be struggling to find parameter combinations that produce simulations close to the data. This can happen due to random variation - ABC is inherently stochastic.

# Start with a relaxed threshold to get some samples
threshold = 300.0  # Relaxed for faster acceptance

println("Running ABC rejection sampler...")
abc_samples = abc_rejection(
    sample_from_prior,
    simulate_sir_abc,
    distance_L1,
    observed_data,
    threshold,
    200  # Target number of samples
)

println("\nObtained $(length(abc_samples)) samples")

Running ABC rejection sampler...
Acceptance rate: 29.99%
Attempts needed: 667

Obtained 200 samples

# Extract parameters
if length(abc_samples) > 0
    abc_df = DataFrame(
        R_0 = [s[1] for s in abc_samples],
        D_inf = [s[2] for s in abc_samples],
        ρ = [s[3] for s in abc_samples]
    )

    # Plot ABC results
    histogram(abc_df.R_0, bins=30, alpha=0.6,
              label="ABC (L1 distance)",
              xlabel="R_0", ylabel="Frequency",
              title="ABC Posterior Distribution for R_0")
else
    println("No samples accepted - threshold too strict!")
end

Compare distance metrics

Different distance functions give different results:

# Test on a candidate parameter set
test_params = [10.0, 2.5, 0.8]
test_sim = simulate_sir_abc(test_params)

println("Distance metrics for test parameters:")
println("  L1: ", round(distance_L1(test_sim, observed_data), digits=2))
println("  L2: ", round(distance_L2(test_sim, observed_data), digits=2))
println("  Weighted: ", round(distance_weighted(test_sim, observed_data), digits=2))

Distance metrics for test parameters:
  L1: 502.82
  L2: 168.89
  Weighted: 197.64

Notice that choosing a distance metric in ABC is philosophically similar to choosing a parametric likelihood.

ABC-SMC: Sequential Monte Carlo ABC

The rejection sampler above is inefficient - most samples from the prior are rejected, wasting computation. ABC-SMC (Sequential Monte Carlo ABC) dramatically improves efficiency by gradually tightening the acceptance threshold across multiple generations.

The key idea, introduced by Toni et al. (2009), is:

Start permissive: Begin with a loose threshold that accepts many samples
Gradually tighten: Reduce the threshold across generations
Reuse good samples: Instead of sampling from the prior each time, perturb samples that passed the previous threshold

This is much more efficient because particles that are already “close” to the data are refined, rather than searching blindly from the prior each time. Typical speedups are 20-50× compared to rejection sampling.

Implementation: ABC-SMC (click to expand)

This is a simplified ABC-SMC for pedagogical purposes. Production implementations (like ApproxBayes.jl or GpABC.jl) have adaptive thresholds, better perturbation kernels, and importance weighting.

Note on Threads.@threads: The code below uses Threads.@threads to parallelise the particle loop — this macro distributes iterations across available CPU threads, so each particle can be simulated concurrently. For this to have any effect, Julia must be started with multiple threads (e.g., julia -t auto). See the threading note in MCMC Diagnostics for setup instructions. Without multiple threads, the loop runs sequentially (slower but still correct).

"""
Simple ABC-SMC implementation.
Uses sequence of decreasing thresholds with resampling.
"""
function abc_smc(prior_sampler, simulator, distance_fn, observed_data,
                thresholds, n_particles)

    particles = []
    weights = ones(n_particles) ./ n_particles

    for (t_idx, threshold) in enumerate(thresholds)
        println("\nGeneration $t_idx: threshold = $threshold")
        new_particles = Vector{Vector{Float64}}(undef, n_particles)

        if t_idx == 1
            # First generation: sample from prior (parallelised)
            Threads.@threads for i in 1:n_particles
                while true
                    θ = prior_sampler()
                    sim = simulator(θ)
                    d = distance_fn(sim, observed_data)
                    if d < threshold
                        new_particles[i] = θ
                        break
                    end
                end
            end
        else
            # Later generations: perturb from previous (parallelised)
            Threads.@threads for i in 1:n_particles
                while true
                    # Sample from previous generation (weighted)
                    idx = sample(1:length(particles), Weights(weights))
                    θ_old = particles[idx]

                    # Perturb (simple Gaussian perturbation)
                    θ_new = θ_old .+ randn(length(θ_old)) .* 0.1 .* θ_old

                    # Check bounds (reject if outside prior)
                    if θ_new[1] < 1.0 || θ_new[1] > 20.0 continue end
                    if θ_new[2] < 1.0 || θ_new[2] > 10.0 continue end
                    if θ_new[3] < 0.1 || θ_new[3] > 1.0 continue end

                    # Simulate and check distance
                    sim = simulator(θ_new)
                    d = distance_fn(sim, observed_data)
                    if d < threshold
                        new_particles[i] = θ_new
                        break
                    end
                end
            end
        end

        particles = collect(new_particles)
        weights = ones(n_particles) ./ n_particles  # Uniform weights (simplified)

        println("  Mean R_0: ", round(mean([p[1] for p in particles]), digits=2))
    end

    return particles
end

Main.Notebook.abc_smc

# Run ABC-SMC with decreasing thresholds
println("Running ABC-SMC...")
thresholds = [400.0, 300.0, 250.0]  # Decreasing threshold sequence

abc_smc_samples = abc_smc(
    sample_from_prior,
    simulate_sir_abc,
    distance_L1,
    observed_data,
    thresholds,
    100  # Number of particles
)

abc_smc_df = DataFrame(
    R_0 = [s[1] for s in abc_smc_samples],
    D_inf = [s[2] for s in abc_smc_samples],
    ρ = [s[3] for s in abc_smc_samples]
)

Running ABC-SMC...

Generation 1: threshold = 400.0
  Mean R_0: 9.44

Generation 2: threshold = 300.0
  Mean R_0: 5.08

Generation 3: threshold = 250.0
  Mean R_0: 5.57

100×3 DataFrame

75 rows omitted

Row	R_0	D_inf	ρ
	Float64	Float64	Float64
1	5.56939	6.17676	0.322194
2	5.26552	5.20008	0.630262
3	4.53991	4.53327	0.360797
4	4.1971	5.52499	0.518967
5	5.37009	5.69769	0.290655
6	4.24305	5.19638	0.915497
7	7.94109	8.05817	0.91903
8	4.02683	5.13532	0.864365
9	3.90321	4.28606	0.934857
10	5.56445	5.02015	0.777142
11	3.38997	3.46473	0.433042
12	6.5741	5.99164	0.53854
13	6.98606	6.79887	0.595985
⋮	⋮	⋮	⋮
89	4.33629	4.33125	0.934397
90	5.50104	5.70189	0.649426
91	5.50928	5.29157	0.94151
92	5.84604	7.47906	0.760527
93	4.16062	4.79352	0.819816
94	5.86297	6.40993	0.862277
95	3.58934	5.10214	0.718976
96	4.37408	6.47969	0.368322
97	4.77611	4.67202	0.776128
98	8.50027	8.30674	0.65449
99	4.86046	6.51144	0.688229
100	7.23906	8.57346	0.92686

Comparing rejection ABC and ABC-SMC

The following compares the two approaches on our problem:

# Compare efficiency: how many simulations were needed?
println("Comparison: Rejection ABC vs ABC-SMC")
println("="^50)

# Rejection ABC stats (from earlier run)
rejection_samples = length(abc_samples)
rejection_acceptance = rejection_samples / (rejection_samples * 1000 / 100)  # Rough estimate

# ABC-SMC: n_particles × n_generations simulations (plus rejections within each)
smc_generations = length(thresholds)
smc_particles = 100

println("Rejection ABC:")
println("  Samples obtained: $rejection_samples")
println("  Threshold used: 300.0")

println("\nABC-SMC:")
println("  Samples obtained: $(length(abc_smc_samples))")
println("  Final threshold: $(thresholds[end])")
println("  Generations: $smc_generations")

Comparison: Rejection ABC vs ABC-SMC
==================================================
Rejection ABC:
  Samples obtained: 200
  Threshold used: 300.0

ABC-SMC:
  Samples obtained: 100
  Final threshold: 250.0
  Generations: 3

# Compare posterior quality
println("\nPosterior summaries:")
println("="^50)

println("Rejection ABC (threshold=300):")
println("  R_0:   mean=$(round(mean(abc_df.R_0), digits=2)), std=$(round(std(abc_df.R_0), digits=2))")
println("  D_inf: mean=$(round(mean(abc_df.D_inf), digits=2)), std=$(round(std(abc_df.D_inf), digits=2))")

println("\nABC-SMC (final threshold=250):")
println("  R_0:   mean=$(round(mean(abc_smc_df.R_0), digits=2)), std=$(round(std(abc_smc_df.R_0), digits=2))")
println("  D_inf: mean=$(round(mean(abc_smc_df.D_inf), digits=2)), std=$(round(std(abc_smc_df.D_inf), digits=2))")


Posterior summaries:
==================================================
Rejection ABC (threshold=300):
  R_0:   mean=5.01, std=2.74
  D_inf: mean=6.76, std=2.13

ABC-SMC (final threshold=250):
  R_0:   mean=5.57, std=1.57
  D_inf: mean=6.36, std=1.73

Plotting code

# Visual comparison
p1 = histogram(abc_df.R_0, bins=20, alpha=0.6, normalize=:pdf,
               label="Rejection (ε=300)", xlabel="R_0", ylabel="Density")
histogram!(p1, abc_smc_df.R_0, bins=20, alpha=0.6, normalize=:pdf,
           label="ABC-SMC (ε=250)")

p2 = histogram(abc_df.D_inf, bins=20, alpha=0.6, normalize=:pdf,
               label="Rejection (ε=300)", xlabel="D_inf", ylabel="Density")
histogram!(p2, abc_smc_df.D_inf, bins=20, alpha=0.6, normalize=:pdf,
           label="ABC-SMC (ε=250)")

plot(p1, p2, layout=(1, 2), size=(900, 400),
     title=["R_0: Rejection vs SMC" "D_inf: Rejection vs SMC"])

We see that ABC-SMC achieves a tighter threshold (250 vs 300) with fewer wasted simulations. The resulting posteriors are more concentrated - ABC-SMC has effectively “zoomed in” on the high-probability region. This efficiency gain becomes even more pronounced in higher dimensions, where rejection sampling becomes impractical.

Assessing ABC posterior quality

Unlike MCMC, ABC does not produce chains, so standard diagnostics like ESS (effective sample size) and R-hat do not apply. Instead, assess ABC posterior quality by:

Acceptance rate: Very low rates (< 0.1%) suggest the threshold is too strict or the model is a poor fit. Very high rates (> 10%) suggest the threshold is too loose and the posterior is close to the prior.
Visual comparison with likelihood-based posteriors: When a likelihood is available (as in this session), run both methods and compare. Large discrepancies suggest either the ABC threshold is too loose or the distance metric captures different features than the likelihood.
Threshold sensitivity: Run ABC with several thresholds. If the posterior changes substantially as you tighten the threshold, it has not yet converged — you need a stricter threshold (and more computation).
Coverage of prior: If the ABC posterior looks very similar to the prior, the threshold is too loose and the data are not informing the inference.
Replicate runs: Run ABC multiple times with different random seeds. If posteriors differ substantially between runs, you need more particles or a less strict threshold.

The big picture: Likelihood vs ABC

Now let’s step back and compare all our approaches - parametric likelihoods and ABC methods - to see what’s fundamentally different (and what’s surprisingly similar).

Summary of approaches

Approach	What you specify	What’s arbitrary	Computational cost
Poisson likelihood	Parametric distribution	Why Poisson?	Fast (MCMC)
NegBin likelihood	Parametric + overdispersion	Why NegBin?	Fast (MCMC)
ABC (L1)	Distance metric	Why L1?	Slow (many simulations)
ABC (weighted)	Distance + weights	Why these weights?	Slow (many simulations)

Do they give similar results?

Comparing the posteriors:

# Extract MCMC results
mcmc_df = DataFrame(chain_poisson)

# Plot comparison
p1 = histogram(mcmc_df.R_0, bins=30, alpha=0.5,
               label="Poisson", normalize=:pdf)
histogram!(p1, abc_df.R_0, bins=30, alpha=0.5,
           label="ABC (L1)", normalize=:pdf,
           xlabel="R_0", ylabel="Density",
           title="Posterior Comparison: R_0")

p2 = histogram(mcmc_df.D_inf, bins=30, alpha=0.5,
               label="Poisson", normalize=:pdf)
histogram!(p2, abc_df.D_inf, bins=30, alpha=0.5,
           label="ABC (L1)", normalize=:pdf,
           xlabel="D_inf", ylabel="Density",
           title="Posterior Comparison: D_inf")

plot(p1, p2, layout=(1, 2), size=(900, 400))

How to interpret these plots: Look for whether the posterior distributions overlap substantially. If the MCMC (Poisson) and ABC (L1) posteriors cover a similar range and peak in similar places, the two approaches agree — meaning the choice of observation model has little effect on inference for this dataset. If one posterior is much wider, shifted, or multimodal compared to the other, the observation model is influencing conclusions. A useful rule of thumb: if the posterior means are within one standard deviation of each other and the bulk of the distributions overlap, the methods are giving consistent answers.

Discussion: Are the posteriors similar or different? What does this tell us about the observation model choice?

When does it matter?

Sensitivity analysis

The choice between Poisson, NegBin, and ABC matters most when:

Outliers: Extreme observations far from predictions
Tail behaviour: Rare events (very low or very high counts)
Overdispersion: Systematic variance > mean in data
Model misspecification: When the mechanistic model is wrong

We can test sensitivity to outliers:

# Helper function to simulate and return incidence
function simulate_and_summarize(params)
    R_0, D_inf, ρ = params
    θ = Dict(:R_0 => R_0, :D_inf => D_inf)
    traj = simulate_sir(θ, init_test, times)
    return ρ .* traj.Inc
end

# Create synthetic data with and without outliers
function test_sensitivity(add_outlier=false)
    # True parameters
    true_params = [10.0, 2.5, 0.8]
    true_data = simulate_and_summarize(true_params)

    # Add noise
    noisy_data = true_data .+ randn(length(true_data)) .* 2.0
    noisy_data = max.(noisy_data, 0.0)  # No negative cases

    # Add outlier if requested
    if add_outlier
        noisy_data[30] += 30.0  # Artificial spike
    end

    return noisy_data
end

# Test with and without outlier
data_clean = test_sensitivity(false)
data_outlier = test_sensitivity(true)

# Plot
plot(times, data_clean, label="Clean data", linewidth=2)
plot!(times, data_outlier, label="Data with outlier", linewidth=2,
      xlabel="Time", ylabel="Cases",
      title="Effect of Outliers on Observation Data")

Exercise for students: Compare how the Poisson likelihood and ABC with L1 distance respond to the outlier. You can do this visually: simulate from a few parameter sets, compute the Poisson log-likelihood and L1 distance for both data_clean and data_outlier, and see which metric shifts more. If you have time, try a full fit with both methods.

Hint: Outlier robustness

Poisson likelihood is sensitive to outliers because the log-probability drops sharply for counts far from the mean. An outlier of +30 cases when the model predicts 5 contributes a large penalty to the log-likelihood, pulling parameter estimates towards values that accommodate the outlier.

ABC with L1 distance (defined earlier as sum(abs.(sim .- obs))) is more robust because the penalty grows linearly with the discrepancy, not exponentially. A single outlier affects the total distance, but doesn’t dominate as severely.

For likelihood-based inference, the Negative Binomial distribution (covered earlier in this session) offers a middle ground: its heavier tails make it less sensitive to occasional large deviations than Poisson, while still providing a proper likelihood for MCMC.

Solution

# Compare how each metric responds to the outlier
true_params = [10.0, 2.5, 0.8]
predicted = simulate_and_summarize(true_params)

# Poisson log-likelihood
function poisson_loglik(predicted, observed)
    ll = 0.0
    for i in eachindex(observed)
        λ = max(predicted[i], 1e-10)
        ll += logpdf(Poisson(λ), round(Int, observed[i]))
    end
    return ll
end

println("Poisson log-likelihood:")
println("  Clean data:   ", round(poisson_loglik(predicted, data_clean), digits=2))
println("  Outlier data: ", round(poisson_loglik(predicted, data_outlier), digits=2))

println("\nL1 distance:")
println("  Clean data:   ", round(distance_L1(predicted, data_clean), digits=2))
println("  Outlier data: ", round(distance_L1(predicted, data_outlier), digits=2))

# The Poisson log-likelihood drops much more dramatically with the outlier,
# because a count of ~35 when expecting ~5 has near-zero Poisson probability.
# The L1 distance increases by only ~30 (the size of the outlier).

The philosophical question

Is there a “correct” observation model?

No! There are only:

More or less mechanistically motivated choices
- E.g., modelling test sensitivity explicitly vs using NegBin
More or less flexible choices
- Poisson (rigid) → NegBin (flexible) → ABC (very flexible)
More or less computationally convenient choices
- MCMC with parametric likelihood (fast)
- ABC (slow but general)

Practical advice

When to use parametric likelihoods (Poisson/NegBin):

You have a tractable likelihood function
You want efficient inference (MCMC with gradients)
The parametric family is flexible enough
You’re comfortable with the implicit distance metric

When to use ABC/likelihood-free:

Likelihood is truly intractable (stochastic models, agent-based models)
You want to make your distance metric explicit
You care more about specific summary statistics than full data
You want robustness to model misspecification

Best practice:

Try multiple observation models (Poisson, NegBin, ABC with different distances)
Check if scientific conclusions change
If they don’t → observation model choice doesn’t matter much
If they do → you’ve learned something about what features of the data drive inference!

Learning points

Parametric likelihoods are pragmatic choices, not mechanistic truths - we choose Poisson or Negative Binomial for convenience, not because they represent reality
Negative log-likelihood is a distance function - choosing an observation model is choosing how to measure the fit between predictions and data
ABC makes distance explicit - but choosing L1 vs L2 vs weighted distances is conceptually similar to choosing a likelihood
Sensitivity analysis is essential - if conclusions change with different observation models, that reveals which data features drive inference
Flexibility vs computation: parametric likelihoods are fast (MCMC), ABC is slow but general-purpose

Going further

Summary statistics for ABC

Instead of comparing full time series, ABC often works better with carefully chosen summary statistics. For epidemic data, useful summaries include:

Total cases: Sum of all observations
Peak timing: Day with maximum incidence
Peak height: Maximum daily incidence
Growth rate: Initial doubling time or exponential growth rate
Final size: Cumulative attack rate

Here is an example of how to define and use summary statistics for ABC:

"""Extract summary statistics from an epidemic time series."""
function epidemic_summaries(data)
    return [
        sum(data),                          # Total cases
        argmax(data),                       # Peak timing (day index)
        maximum(data),                      # Peak height
        sum(data[1:min(5, length(data))])   # Early cumulative cases (proxy for growth rate)
    ]
end

# Distance on summaries rather than full time series
function distance_summaries(sim, obs)
    s_sim = epidemic_summaries(sim)
    s_obs = epidemic_summaries(obs)
    # Normalise each summary by the observed value to make them comparable
    return sum(abs.(s_sim .- s_obs) ./ (abs.(s_obs) .+ 1.0))
end

Using fewer, well-chosen summaries can make ABC more robust to noise and model misspecification. However, there is a trade-off: summaries that discard too much information will produce wider posteriors. The choice of which summaries to use is problem-specific and often requires experimentation.

Alternative observation models

The choice of observation model affects inference more than many researchers realise. Consider exploring:

Gaussian likelihood: Appropriate when counts are large and approximately symmetric
Zero-inflated models: When surveillance systems miss entire days
Reporting delay models: When cases are reported with variable lag

Computing the variance/mean ratio in your data can help diagnose whether Poisson (ratio ≈ 1) or Negative Binomial (ratio > 1) is more appropriate.

Robust distance metrics

For ABC with outliers or model misspecification, consider robust distances:

Median absolute deviation instead of mean squared error
Quantile-based distances that down-weight extreme values
Feature-based distances that focus on epidemiologically meaningful patterns

References

Toni, T. et al. (2009). Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J. R. Soc. Interface 6(31), 187-202. (The ABC-SMC paper - highly readable introduction)
Wood, S.N. (2010). Statistical inference for noisy nonlinear ecological dynamic systems. Nature 466, 1102-1104. (Synthetic likelihood approach)
Marin, J.-M. et al. (2012). Approximate Bayesian computational methods. Statistics and Computing 22, 1167-1180.
Cranmer, K. et al. (2020). The frontier of simulation-based inference. PNAS 117(48), 30055-30062. (Comprehensive review of modern likelihood-free methods)

Next sessions

The following advanced sessions cover specialised topics:

Variational Inference: Fast approximate posteriors for real-time applications
Universal Differential Equations: Learning unknown mechanisms with neural networks