Universal Differential Equations: Learning Unknown Mechanisms

Combining mechanistic models with machine learning

Estimated time: 2-3 hours | Advanced session | Requires: Comfort with ODE-based models (e.g. the Introduction) and basic familiarity with neural networks (what layers and activation functions are)

Introduction

Do this session if you want to discover unknown mechanisms from data using neural networks, or you have dynamics that simple parametric models can’t capture. Assumes basic familiarity with neural networks.

WarningComing from R?

UDEs rely on differentiable programming — automatic differentiation through ODE solvers and neural networks — which is not practically available in R. The Julia ecosystem (SciML, Lux.jl) is currently the most mature platform for this kind of work. If you are an R user, this session is still worth reading for the concepts, but the code cannot be straightforwardly translated.

We discuss when UDEs are appropriate (and when simpler methods suffice) in When to use UDEs.

From model comparison to model discovery

In the pMCMC session, we formally compared two mechanistic models for the Tristan da Cunha two-wave outbreak: SEITL and SEIT4L. Both encode the same biological hypothesis - temporary immunity followed by potential reinfection - but with different assumptions about timing. We used LOO-CV to show that SEIT4L (Erlang-distributed immunity) fits better than SEITL (exponential).

But what if we didn’t know the mechanism? What if we were seeing a two-wave pattern for the first time and had no theory to guide us?

We might hypothesise alternative explanations:

  • Behavioural relaxation: People resumed normal contact patterns after the first wave
  • Waning vigilance: Initial precautions (isolation, reduced gatherings) were abandoned
  • Population structure: Different groups (households, age groups) infected at different times

All of these would manifest as time-varying transmission β(t) rather than structural changes to the compartmental model.

Universal Differential Equations (UDEs) combine mechanistic differential equations with neural networks to learn unknown functions from data. In our case, we’ll ask: “What transmission pattern β(t) would a simple SIR model need to explain these dynamics?”

Beyond this, UDEs can discover missing terms in equations which, with careful interpretation, can amount to learning unknown mechanisms (see Going further).

The UDE approach

We keep the known epidemiological structure, but let a neural network learn the unknown parts:

dS/dt = -β(t) * S * I / N   # ← Learn β(t) from data!
dI/dt = +β(t) * S * I / N - γ * I
dR/dt = +γ * I

Replace the unknown β with a neural network that respects constraints:

β(t) = NeuralNetwork([I/N, t])  # Must be positive, bounded

What does “replace β with a neural network” actually mean? Instead of writing β = 0.5 (a fixed number) or even β(t) = β₀ * (1 + sin(t)) (a parametric function we specify), we let the neural network learn the function from data. The network takes inputs (current prevalence, time) and outputs a transmission rate. During training, we adjust the network’s internal weights so that the resulting epidemic trajectory matches observed data. The network is just a flexible function approximator - we’re not assuming any particular form for how β varies, we’re letting the data tell us.

Objectives

By the end of this session, you will:

  1. Use UDEs to discover transmission patterns from data
  2. Apply appropriate likelihood functions (Poisson) for count data
  3. Interpret learned β(t) in terms of possible mechanisms
  4. Understand when UDEs are useful vs. when simpler models suffice
  5. Appreciate the challenges of learning from sparse data (~60 points)

Setup

using DifferentialEquations
using Distributions
using DataFrames
using CSV
using Plots
using Random
using LinearAlgebra
using DrWatson
using Lux  # Modern neural network framework (https://lux.csail.mit.edu/)
using ComponentArrays
using Optimization
using OptimizationOptimisers
using Optimisers  # For Adam optimizer

Random.seed!(1234)
TaskLocalRNG()

Load Tristan da Cunha data

flu_tdc = CSV.read(datadir("flu_tdc_1971.csv"), DataFrame)

scatter(
    flu_tdc.time, flu_tdc.obs,
    xlabel = "Time (days)", ylabel = "Daily incidence",
    label = "Observed cases", markersize = 4, color = :red,
    title = "Tristan da Cunha 1971: Two waves?",
    legend = :topright
)
vspan!([25, 35], alpha=0.2, color=:blue, label="Second wave?")
ImportantTdC is a pedagogical example, not an ideal UDE dataset

The Tristan da Cunha dataset has only ~60 data points. As we note in When to use UDEs, UDEs really benefit from hundreds to thousands of observations — for 60 points, simpler approaches such as Gaussian processes or splines are likely more appropriate. We use TdC here because it is familiar from earlier sessions and its two-wave pattern poses an interesting modelling challenge, not because it is a good candidate for UDEs in practice. Keep this limitation in mind when interpreting results.

NoteSkipped step: synthetic data validation

In a real analysis you should always validate a UDE on synthetic data first — generate observations from a known β(t), train the UDE, and check that it recovers the true function. This confirms the method works before you apply it to data where the truth is unknown. We skip this step here for time, but see Going further → Validation with synthetic data for an outline.

Notice the two distinct periods of transmission:

  • First wave (days 4-12): Sharp peak reaching ~47 cases/day
  • Trough (days 13-24): Near-zero transmission
  • Second wave (days 25-35): Smaller resurgence, peak ~12 cases/day

Question: What transmission pattern β(t) would produce this? Can we learn it from the data?

Baseline: SIR with constant β

First, let’s establish what a constant transmission rate gives us:

"""
Standard SIR model with constant transmission.
"""
function sir_constant_beta(du, u, p, t)
    S, I, R = u
    β, γ = p
    N = S + I + R

    du[1] = -β * S * I / N
    du[2] = β * S * I / N - γ * I
    du[3] = γ * I
end

# Literature-informed values for influenza (R_0 ≈ 2.5, D_inf ≈ 2 days)
R_0_guess = 2.5
D_inf_guess = 2.0
γ_guess = 1.0 / D_inf_guess
β_guess = R_0_guess * γ_guess
u0 = [281.0, 3.0, 0.0]  # N = 284 (total population of Tristan da Cunha)
tspan = (0.0, maximum(flu_tdc.time))
times = flu_tdc.time

prob_constant = ODEProblem(sir_constant_beta, u0, tspan, [β_guess, γ_guess])
sol_constant = solve(prob_constant, Tsit5(), saveat=times)

# Extract incidence
S_values = sol_constant[1, :]
incidence_constant = [0.0; -diff(S_values)]

# Plot
plot(times, incidence_constant, label="Constant β", linewidth=2)
scatter!(flu_tdc.time, flu_tdc.obs, label="Observed",
         markersize=4, color=:red,
         xlabel="Time (days)", ylabel="Cases",
         title="Baseline: Can we do better?")

Question: The fit isn’t perfect. Could a time-varying β(t) do better?

UDE approach: learn β(t) with a neural network

Step 1: Define the neural network

We’ll use a small network - remember, we only have 60 data points!

# Network architecture: 2 inputs → 8 hidden → 1 output
# Total parameters: ~33 (keeps training fast)

rng = Random.default_rng()
Random.seed!(rng, 1234)

# Define network structure - small for fast ForwardDiff autodiff
nn_architecture = Chain(
    Dense(2 => 8, tanh),       # Input: [I/N, t/tmax]
    Dense(8 => 1, softplus)     # Output: β > 0 (softplus ensures positivity)
)

# Initialise parameters
p_nn, st_nn = Lux.setup(rng, nn_architecture)
p_nn = ComponentArray(p_nn)  # Convert to ComponentArray for optimization

println("Neural network parameters: ", length(p_nn))
println("Data points: ", length(flu_tdc.obs))
println("Ratio: ", round(length(flu_tdc.obs) / length(p_nn), digits=2), " data points per parameter")
Neural network parameters: 33
Data points: 59
Ratio: 1.79 data points per parameter
  • Chain(...) - combines multiple neural network layers in sequence. Input flows through each layer in order.
  • Dense(in => out, activation) - a fully-connected layer with in inputs, out outputs, and an activation function applied element-wise.
  • tanh - activation function that squashes values to [-1, 1]. Helps learn smooth functions and keeps gradients bounded.
  • softplus - defined as log(1 + exp(x)), ensures outputs are always positive. Smoother than ReLU at zero.
  • Lux.setup(rng, model) - initialises the neural network parameters p and state st. Lux separates parameters from architecture for functional programming.
  • ComponentArray - wraps nested parameter structures into a flat array while preserving named access. Needed for gradient-based optimisation.

A few things to note about this architecture (expand the code explainer above for definitions of each function):

  • Why 8 hidden units? It’s a balance: more units = more flexibility but more parameters to learn from limited data. With 60 data points, we keep the network small.

  • Why softplus output? It ensures β is always positive — transmission rates can’t be negative.

  • Why 2 inputs? We feed the network both prevalence (I/N) and normalised time (t/tmax). This lets β depend on both “how bad is the epidemic right now” and “what day is it” — capturing both behavioural responses to prevalence and time-dependent changes like weekends or policy changes.

  • Parameter count: The first layer has 2×8 weights + 8 biases = 24 parameters. The second has 8×1 weights + 1 bias = 9 parameters. Total: 33 parameters for 60 data points — roughly 2 data points per parameter. In a standard neural network, this would be a recipe for overfitting, but in a UDE the network output has to flow through an ODE solver and produce a physically plausible epidemic trajectory. The network can’t just memorise noise because arbitrary weight values would produce nonsensical dynamics (Rackauckas et al. 2020).

Step 2: Define UDE with learned β(t)

"""
SIR model where β is learned by neural network.
β = NN([I/N, t/tmax]) predicts transmission rate.
"""
function sir_ude!(du, u, p, t)
    S, I, R = u
    N = S + I + R
    γ = p.γ  # Fixed here for simplicity; in practice you'd estimate γ jointly

    # Neural network predicts β(t, I)
    # Inputs: prevalence (I/N) and normalized time (t/tmax)
    tmax = p.tmax
    nn_input = [I/N, t/tmax]

    # Evaluate neural network
    β_learned, _ = nn_architecture(nn_input, p.nn_params, st_nn)
    β = β_learned[1]  # Extract scalar

    # Standard SIR dynamics with learned β
    du[1] = -β * S * I / N
    du[2] = β * S * I / N - γ * I
    du[3] = γ * I
end
Main.Notebook.sir_ude!

Step 3: Prediction and loss function

"""
Predict incidence using current UDE parameters.
"""
function predict_ude(p_all)
    prob_ude = ODEProblem(sir_ude!, u0, tspan, p_all)
    sol_ude = solve(prob_ude, Tsit5(), saveat=times, dense=false)

    if sol_ude.retcode != :Success
        return fill(NaN, length(times))
    end

    S_pred = sol_ude[1, :]
    incidence_pred = [0.0; -diff(S_pred)]

    return incidence_pred
end

"""
Loss function using Poisson negative log-likelihood.
"""
function loss_ude(p_all, _)
    pred = predict_ude(p_all)

    # Return large loss if ODE solver failed (pred contains NaN)
    any(isnan, pred) && return eltype(pred)(1e10)

    # Poisson negative log-likelihood (clamp predictions to avoid log(negative))
    ε = 1e-6
    obs = flu_tdc.obs
    pred_safe = max.(pred, ε)
    nll = sum(pred_safe .- obs .* log.(pred_safe))

    return nll
end
Main.Notebook.loss_ude

Why Poisson and not sum of squared errors? As we learned in the observation models session, choosing a likelihood is really choosing a distance metric. SSE treats all residuals equally — a mismatch of 10 cases at the peak (expected ~40) counts the same as at the tail (expected ~2). Poisson NLL instead penalises relative errors, so predicting 0 when 5 cases occurred is heavily penalised. This forces the model to explain the second wave, where counts are small but non-zero.

As discussed above, the ODE structure constrains the network — the learned β(t) has to produce a valid epidemic trajectory, which limits what the weights can do.

ImportantSet realistic expectations before training

Training UDEs is hard, especially with limited data like ours (60 points, 33 parameters). You should expect:

  • Variability: Different random seeds give different β(t) curves. This reflects genuine uncertainty.
  • Imperfect fits: The SIR structure may not be able to capture two waves regardless of β(t) - this is informative, not failure.
  • Sensitivity to hyperparameters: Results can change with different learning rates or network sizes.

If training produces NaN values, reduce the learning rate. If the loss doesn’t decrease, try different initialisation. See Practical challenges for detailed troubleshooting.

Step 4: Train the UDE

Training means finding network weights that minimise the loss function. This requires computing gradients of the loss with respect to all 33 parameters - a task handled by automatic differentiation (autodiff).

# Combine all parameters into a single object for optimisation
p_all = ComponentArray(
    nn_params = p_nn,
    γ = γ_guess,
    tmax = maximum(times)
)

println("Starting UDE training...")
println("="^60)

# Initial loss (before any training)
println("Initial loss: ", round(loss_ude(p_all, nothing), digits=2))

# Set up optimisation with ForwardDiff for gradient computation
opt_func = OptimizationFunction(loss_ude, Optimization.AutoForwardDiff())
opt_prob = OptimizationProblem(opt_func, p_all, nothing)

# Train using Adam optimiser — store loss history for convergence check
loss_history = Float64[]

println("\nTraining with Adam optimiser (500 iterations)...")
@time result = solve(opt_prob, Optimisers.Adam(0.01), maxiters=500,
                     callback = (state, l) -> begin
                         push!(loss_history, l)
                         if state.iter % 100 == 0
                             println("  Iter $(state.iter): loss = $(round(l, digits=2))")
                         end
                         return false
                     end)
Starting UDE training...
============================================================
Initial loss: 355.68

Training with Adam optimiser (500 iterations)...
  Iter 100: loss = -210.16
  Iter 200: loss = -260.85
  Iter 300: loss = -365.05
  Iter 400: loss = -371.47
  Iter 500: loss = -378.29
  Iter 500: loss = -378.29
  9.569720 seconds (39.15 M allocations: 2.453 GiB, 4.21% gc time, 95.68% compilation time)
retcode: Default
u: ComponentVector{Float64}(nn_params = (layer_1 = (weight = [0.24120001070537606 -0.5404527388496453; 2.8429429836146536 -0.23606380292863371; … ; -0.14478411774948144 -1.1507585973837953; -0.6402420917140612 3.1578061485365705], bias = [-0.24935229079269502, -1.0878469426431727, 0.010790984956251786, 0.4503786015974814, -0.06243318126603811, 0.26019744153309277, -0.6423148683260297, 0.14896707936992867]), layer_2 = (weight = [-0.05427029366831282 -0.39737915693056586 … -0.19455150744150446 -0.7248217344412756], bias = [0.01068361643666583])), γ = 0.03009710907235215, tmax = 54.73828756375623)
  • OptimizationFunction(loss, ad_backend) - wraps a loss function with automatic differentiation for gradient computation.
  • Optimization.AutoForwardDiff() - uses ForwardDiff.jl for forward-mode automatic differentiation. Efficient when the number of parameters is small (~33 here). For larger networks, reverse-mode (Zygote.jl) would be faster.
  • Adam(learning_rate) - adaptive gradient descent optimiser that adjusts per-parameter learning rates and uses momentum.
  • callback - function called after each iteration. Returns true to stop early, false to continue. Useful for monitoring progress.

println("\n" * "="^60)
println("Final loss: ", round(result.objective, digits=2))

p_trained = result.u

============================================================
Final loss: -378.29
ComponentVector{Float64}(nn_params = (layer_1 = (weight = [0.24120001070537606 -0.5404527388496453; 2.8429429836146536 -0.23606380292863371; … ; -0.14478411774948144 -1.1507585973837953; -0.6402420917140612 3.1578061485365705], bias = [-0.24935229079269502, -1.0878469426431727, 0.010790984956251786, 0.4503786015974814, -0.06243318126603811, 0.26019744153309277, -0.6423148683260297, 0.14896707936992867]), layer_2 = (weight = [-0.05427029366831282 -0.39737915693056586 … -0.19455150744150446 -0.7248217344412756], bias = [0.01068361643666583])), γ = 0.03009710907235215, tmax = 54.73828756375623)

Always check that the loss has converged — if it’s still decreasing at the end, run more iterations:

plot(loss_history, xlabel="Iteration", ylabel="Loss",
     label="Training loss", linewidth=2,
     title="Convergence check")

A few things to understand about this training process:

Automatic differentiation computes exact gradients of the loss with respect to parameters. We use ForwardDiff.jl, which is efficient for small networks like ours (33 parameters).

The Adam optimiser is a gradient descent variant that adapts learning rates for each parameter and uses momentum to smooth updates. The learning rate (0.01) controls step size: too large and training becomes unstable; too small and convergence is slow. Adam is widely used for neural network training because it works well across many problems without much tuning.

What to expect: The loss should decrease substantially (often by orders of magnitude). If it doesn’t, something is wrong - perhaps the learning rate is too high (NaN errors) or too low (no progress), or the model structure can’t fit the data.

Step 5: Compare results

# Get UDE predictions
pred_ude = predict_ude(p_trained)

# Plot comparison
plot(times, incidence_constant, label="Constant β", linewidth=2,
     xlabel="Time (days)", ylabel="Cases",
     title="UDE vs Constant β")
plot!(times, pred_ude, label="UDE (learned β)", linewidth=2)
scatter!(flu_tdc.time, flu_tdc.obs, label="Observed",
         markersize=4, color=:red)

Key question: Does the UDE capture the second wave? Look at the fit around days 25-35.

You should see the UDE trajectory (orange/green line) tracking the first-wave peak noticeably better than the constant-β baseline. Whether the second wave (days 25-35) is captured depends on the training run: the UDE may show a small uptick there, but with only 60 data points it is unlikely to reproduce the second wave precisely. If the two curves look almost identical, the network may not have learned anything useful — try increasing the number of iterations or using a different random seed.

Step 6: Visualise learned β(t)

What transmission pattern did the network learn? Since the network takes both prevalence (I/N) and time as inputs, we need to evaluate it along the actual epidemic trajectory — the β(t) the ODE solver used during integration, not at some arbitrary fixed prevalence.

# Solve the trained UDE at high resolution to get the trajectory
prob_trained = ODEProblem(sir_ude!, u0, tspan, p_trained)
sol_trained = solve(prob_trained, Tsit5(), dense=true)

# Evaluate β along the actual trajectory
t_plot = range(0.0, maximum(times), length=100)
N_total = sum(u0)

β_along_trajectory = [begin
    S_t, I_t, R_t = sol_trained(t)
    nn_input = [I_t / N_total, t / maximum(times)]
    β_val, _ = nn_architecture(nn_input, p_trained.nn_params, st_nn)
    β_val[1]
end for t in t_plot]

# Plot with wave markers
plot(t_plot, β_along_trajectory,
     label="Learned β(t)", linewidth=2,
     xlabel="Time (days)", ylabel="β",
     title="Learned Transmission Rate (along trajectory)",
     legend=:topright)
hline!([β_guess], label="Constant β baseline", linestyle=:dash)
vspan!([25, 35], alpha=0.2, color=:blue, label="Second wave period")

In a typical run you should see β(t) starting high (above the constant-β baseline), dropping during the trough around days 12-20, and then rising again around days 20-25 as the network tries to reignite transmission for the second wave. The exact shape varies between runs (see Uncertainty quantification below).

Before interpreting this curve, look back at Step 5. How well does the UDE actually fit the data? The fit may be imperfect. An imperfect fit tells us either that the network needs more capacity, or that the SIR structure itself can’t capture the dynamics. We’ll explore this in the next section.

Interpreting the learned β(t):

If β(t) shows a dip followed by recovery around days 15-25, this suggests the network learned that transmission must vary to explain both waves. This is consistent with behavioural relaxation - initial caution followed by return to normal contact patterns.

If β(t) is roughly constant, the network didn’t find time-varying transmission useful. This might mean:

  • The network is too small (try more hidden units)
  • More training iterations are needed
  • Or genuinely: time-varying β isn’t the right mechanism for this data

Uncertainty quantification in the main analysis

Everything above is a single optimisation run — the learned β(t) is a point estimate with no uncertainty attached. This is a serious limitation for drawing scientific conclusions. A quick way to get a rough sense of uncertainty is to re-run the training from several different random seeds (e.g. 5-10) and overlay the resulting β(t) curves. Where the curves agree, the signal is robust; where they diverge, the data cannot constrain the answer. See Going further → Ensemble uncertainty for more detail.

Practical challenges

Training UDEs is difficult, especially with limited data. Understanding why helps set realistic expectations.

The fundamental tension is between flexibility and data. Our network has 33 parameters and we have 60 data points - roughly 2 points per parameter. In classical statistics, you’d want at least 10-20 observations per parameter for reliable inference. With neural networks embedded in ODEs, the situation is even more constrained because the parameters interact nonlinearly through the differential equation.

Identifiability is another challenge. Many different β(t) curves could produce similar epidemic trajectories. The data doesn’t uniquely determine β(t) — there’s a family of solutions. The ODE structure constrains this family (not every β(t) produces a valid epidemic), but it doesn’t eliminate the ambiguity entirely.

Initialisation matters because optimisation can get stuck in local minima. Different random seeds for the network weights lead to different solutions. Running multiple times and comparing results is good practice.

Numerical stability can be problematic. The combination of ODE solving and gradient computation can produce NaN values if the optimiser takes steps that are too large, or if the ODE becomes stiff (rapidly changing dynamics requiring small time steps).

The deeper issue: structural constraints

If the fit is imperfect even with Poisson loss, this tells us something important: maybe time-varying β isn’t the full story.

Consider what happens during the trough (days 13-21):

  • Cases drop to near-zero (0-5 per day)
  • By this point, S has been substantially depleted by the first wave
  • Even with I > 0, the product β·S·I/N is small because S is small
  • A large β can compensate somewhat, but there is a limit

The neural network can only learn β(t), but the SIR structure itself constrains what’s possible:

  1. S only decreases: Once recovered, people are immune forever — the susceptible pool is depleted by the first wave
  2. The trough problem: With I near zero and S depleted, restarting growth requires implausibly large β
  3. No memory: The model has no way to “store” infection for later release

In the pMCMC session, we saw that the SEIT4L model — which adds an exposed class and Erlang-distributed temporary immunity — explains the two waves through structural changes, not time-varying transmission. That is the more parsimonious explanation for this data.

What other structural changes could help?

  • SEIR model: Exposed period sustains transmission chains during the trough
  • Two-population model: Different mixing groups, delayed introduction
  • Learn γ(t) too: Time-varying recovery might help, though this raises identifiability issues with β(t)

UDEs can only discover mechanisms within the constraints we impose. If we say “it must be SIR with learned β(t)”, we’re already limiting what can be learned.

An imperfect fit tells us something: time-varying transmission alone is insufficient to explain the data, and we may need different model structure.

Common failure modes

Overfitting occurs when the training loss becomes very low but the predictions look unrealistic — perhaps fitting noise or producing wildly oscillating β(t). With UDEs this is less common than in standard neural networks because the ODE structure constrains the output, but it can still happen with very flexible networks. The main solution is reducing network size (fewer hidden units).

Underfitting is the opposite: the UDE behaves almost identically to constant β, meaning the network hasn’t learned anything useful. The loss may still be high. Solutions include increasing network capacity (more hidden units) or running more training iterations.

NaN explosion happens when gradients become infinite, usually because the optimiser took too large a step that pushed the ODE into an unstable region. You’ll see NaN values in the loss. Solutions include reducing the learning rate (0.01 to 0.001), using gradient clipping, or switching to a more robust ODE solver that handles stiff equations.

Sensitivity to initialisation

Different random initialisations of the network weights can lead to different local minima and different learned β(t) curves. Let’s check how much this matters.

# Train from 3 different random seeds and compare learned β(t)
plot(title="Learned β(t) across random seeds",
     xlabel="Time (days)", ylabel="β", legend=:topright)

for seed in [1234, 5678, 9012]
    rng_test = Random.default_rng()
    Random.seed!(rng_test, seed)
    p_nn_test, _ = Lux.setup(rng_test, nn_architecture)
    p_nn_test = ComponentArray(p_nn_test)

    p_test = ComponentArray(
        nn_params = p_nn_test,
        γ = γ_guess,
        tmax = maximum(times)
    )

    opt_func_test = OptimizationFunction(loss_ude, Optimization.AutoForwardDiff())
    opt_prob_test = OptimizationProblem(opt_func_test, p_test, nothing)
    result_test = solve(opt_prob_test, Optimisers.Adam(0.01), maxiters=500,
                        callback = (state, l) -> false)

    # Extract the learned β(t) along the trajectory for comparison
    p_test_trained = result_test.u
    prob_test = ODEProblem(sir_ude!, u0, tspan, p_test_trained)
    sol_test = solve(prob_test, Tsit5(), dense=true)
    β_test = [begin
        S_t, I_t, R_t = sol_test(t)
        nn_input = [I_t / sum(u0), t / maximum(times)]
        β_val, _ = nn_architecture(nn_input, p_test_trained.nn_params, st_nn)
        β_val[1]
    end for t in t_plot]

    plot!(t_plot, β_test, label="Seed $seed (loss=$(round(result_test.objective, digits=1)))",
          linewidth=2)
end

Compare not just the final loss values but the shape of the β(t) curves. Similar losses do not guarantee similar β(t) — two different transmission patterns can produce comparably good fits. Where the curves agree, you can be more confident in the result; where they diverge, the data cannot distinguish between explanations.

Exercises

TipExercise 1: Experiment with hyperparameters

Try changing one thing at a time and observe the effect on the learned β(t) curve and the final loss:

  1. Hidden units: Change Dense(2 => 8, tanh) to Dense(2 => 4, tanh) (smaller) or Dense(2 => 16, tanh) (larger). How does the number of parameters change? Does the fit improve or does overfitting become visible?
  2. Learning rate: Replace Adam(0.01) with Adam(0.001) or Adam(0.05). What happens to training stability and convergence speed?
  3. Random seed: Change Random.seed!(1234) to a different value and re-run. How much does the learned β(t) vary between seeds? What does this tell you about identifiability?

For each change, re-run from Step 1 through Step 6 and compare the β(t) plot to your original result.

TipExercise 2: Interpret the structural constraint

In the section on structural constraints, we argue that an SIR model with learned β(t) struggles to produce two waves. Can you explain, in your own words, why even a very large β during the trough is insufficient — considering both the depleted susceptible pool and near-zero prevalence? What structural change to the model would you propose, and how does this relate to the SEIT4L model from the pMCMC session?

When to use UDEs

UDEs aren’t always appropriate. Here’s when they add value.

Use UDEs when:

  • You have mechanistic knowledge about structure (e.g., compartmental dynamics) but unknown functional forms (how transmission varies)
  • You have moderate-to-large datasets - ideally hundreds to thousands of data points
  • You’re doing exploratory analysis to discover potential mechanisms
  • You’re willing to experiment with hyperparameters and validate results carefully

Don’t use UDEs when:

  • Simple parametric models fit the data adequately (prefer parsimony)
  • Your dataset is very small (under 50 points)
  • You need interpretable parameters for publication or policy
  • Computational resources are limited - training can be expensive
  • You need real-time inference - for a single dataset, MCMC may actually be faster

Comparison with other approaches

Method Flexibility Interpretability Data needs Computational cost
Constant β Low High Low (10-50 points) Very low
Time-varying β with splines Medium Medium Medium (50-200) Low
GP for β(t) Medium Medium (but with UQ) Medium (50-200) Medium
UDE High Low High (100-1000+) High

For TdC data (60 points): GPs or splines are likely more appropriate than UDEs.

Where UDEs have been applied

Real-data applications of UDEs are still relatively few. In epidemiology, Kuwahara & Bauch (2024) used a UDE to predict COVID-19 second waves from case and mobility data, Schmid et al. (2026) combined a UDE-based SEIR model with wastewater surveillance to project case counts, and Rojas-Campos et al. (2023) demonstrated the UDE → SINDy pipeline on simulated regional transmission data. Outside epidemiology, UDEs have been applied to glucose-insulin dynamics (de Rooij et al. 2025) and battery degradation modelling (Kuzhiyil et al. 2024). Philipps et al. (2025) review the broader state of UDEs in systems biology and note that most published work still uses synthetic data.

Going further

Network architecture choices

The neural network architecture affects what patterns can be learned:

  • Smaller networks (e.g., 4→1) may underfit complex dynamics but are more interpretable
  • Larger networks (e.g., 16→8→1) can capture more complexity but may overfit with limited data despite the ODE constraint
  • Input features matter: β(t) learns time-dependent patterns, β(I/N) learns prevalence-dependent patterns, β(I/N, t) can learn both

Try different architectures to understand what your data can support.

Ensemble uncertainty

See Uncertainty quantification in the main analysis for the basic idea of training from multiple random seeds. For a more rigorous version, train 10-20 runs and compute pointwise confidence bands on β(t) from the ensemble. This is particularly important for scientific claims about mechanisms.

Validation with synthetic data

Before trusting UDE discoveries on real data, test on synthetic data where you know the truth:

  1. Generate data with a known β(t), e.g., β(t) = β₀ × (1 + 0.3 sin(2πt/20))
  2. Train UDE on this synthetic data
  3. Compare learned β(t) to the true function
  4. If the UDE can’t recover known patterns, be cautious about interpreting patterns from real data

Symbolic regression: interpretable discovery (optional)

Note

This section introduces a related but distinct topic — symbolic regression — that extends the UDE workflow. It is included for interest and can be safely skipped.

A major limitation of neural network UDEs is interpretability - you get a function that works, but you can’t write down what it is. Symbolic regression addresses this by searching for equations rather than neural network weights.

The idea: instead of learning that β(t) is “some neural network”, discover that β(t) = β₀(1 + α·sin(ωt)) or β(t) = β₀·exp(-κI/N). The output is a human-readable formula that can be interpreted, published, and debated.

Two main approaches exist:

SINDy (Sparse Identification of Nonlinear Dynamics) builds a library of candidate terms (polynomials, trigonometric functions, etc.) and uses sparse regression to find which terms are needed. It’s fast and produces interpretable equations, but limited to the function library you define. In Julia, DataDrivenDiffEq.jl provides SINDy implementations.

Genetic programming / symbolic regression evolves equations by combining mathematical operations, selecting those that fit well. Tools like PySR (accessible from Julia via PythonCall) or SymbolicRegression.jl can discover surprising functional forms you wouldn’t have thought to include in a library.

The workflow often combines both approaches: first use a neural network UDE to confirm that a time-varying or state-dependent term improves fit, then apply symbolic regression to find an interpretable equation for that term.

For policy applications, the difference matters in principle. A hypothetical finding like “transmission increases when prevalence drops below 5%” would be actionable; “transmission follows neural network with weights [0.23, -0.41, …]” is not. In practice, recovering clean symbolic expressions from noisy epidemic data is difficult, but symbolic UDEs offer a path from discovery towards interpretation.

For more on this approach, see the SciML tutorial on missing physics and Philipps et al. (2025) for a recent review of UDEs in systems biology.

Summary

Universal Differential Equations combine mechanistic structure (SIR compartments) with neural networks that learn unknown components (how β varies).

Loss function choice matters. Poisson likelihood forces the model to explain small counts in the second wave; squared errors would be dominated by the first-wave peak. The ODE structure constrains the network — it can only learn β(t) curves that produce valid epidemic trajectories.

If the learned β(t) shows an increase around days 20-25, this is consistent with behavioural relaxation — but it’s a hypothesis needing independent validation. An imperfect fit may indicate that SIR structure with time-varying β isn’t sufficient.

TipLearning points
  • UDEs combine mechanistic and machine learning: Keep known structure (SIR compartments), learn unknown parts (how β varies)
  • The loss function matters: Poisson likelihood forces the model to explain small counts; SSE is dominated by peak values
  • The ODE constrains the network: The mechanistic structure limits what the network can learn, preventing it from memorising noise
  • A single training run is a point estimate: Without ensemble runs or Bayesian treatment, there is no uncertainty quantification — interpret any single β(t) curve with caution
  • Learned patterns suggest mechanisms but don’t prove them - UDEs generate hypotheses needing independent validation
  • Data requirements are high: UDEs need hundreds of data points to work well; for small datasets, simpler approaches may be better

Next steps

This session and the variational inference session round out the course’s coverage of inference methods. Together with the MCMC, particle filter and pMCMC sessions, you now have a range of approaches for fitting mechanistic models to data, and some sense of when each is appropriate.

References