Combining mechanistic models with machine learning
Estimated time: 2-3 hours | Advanced session | Requires: Comfort with ODE-based models (e.g. the Introduction) and basic familiarity with neural networks (what layers and activation functions are)
Introduction
Do this session if you want to discover unknown mechanisms from data using neural networks, or you have dynamics that simple parametric models can’t capture. Assumes basic familiarity with neural networks.
WarningComing from R?
UDEs rely on differentiable programming — automatic differentiation through ODE solvers and neural networks — which is not practically available in R. The Julia ecosystem (SciML, Lux.jl) is currently the most mature platform for this kind of work. If you are an R user, this session is still worth reading for the concepts, but the code cannot be straightforwardly translated.
We discuss when UDEs are appropriate (and when simpler methods suffice) in When to use UDEs.
From model comparison to model discovery
In the pMCMC session, we formally compared two mechanistic models for the Tristan da Cunha two-wave outbreak: SEITL and SEIT4L. Both encode the same biological hypothesis - temporary immunity followed by potential reinfection - but with different assumptions about timing. We used LOO-CV to show that SEIT4L (Erlang-distributed immunity) fits better than SEITL (exponential).
But what if we didn’t know the mechanism? What if we were seeing a two-wave pattern for the first time and had no theory to guide us?
We might hypothesise alternative explanations:
Behavioural relaxation: People resumed normal contact patterns after the first wave
Waning vigilance: Initial precautions (isolation, reduced gatherings) were abandoned
Population structure: Different groups (households, age groups) infected at different times
All of these would manifest as time-varying transmission β(t) rather than structural changes to the compartmental model.
Universal Differential Equations (UDEs) combine mechanistic differential equations with neural networks to learn unknown functions from data. In our case, we’ll ask: “What transmission pattern β(t) would a simple SIR model need to explain these dynamics?”
Beyond this, UDEs can discover missing terms in equations which, with careful interpretation, can amount to learning unknown mechanisms (see Going further).
The UDE approach
We keep the known epidemiological structure, but let a neural network learn the unknown parts:
dS/dt = -β(t) * S * I / N # ← Learn β(t) from data!
dI/dt = +β(t) * S * I / N - γ * I
dR/dt = +γ * I
Replace the unknown β with a neural network that respects constraints:
β(t) = NeuralNetwork([I/N, t]) # Must be positive, bounded
What does “replace β with a neural network” actually mean? Instead of writing β = 0.5 (a fixed number) or even β(t) = β₀ * (1 + sin(t)) (a parametric function we specify), we let the neural network learn the function from data. The network takes inputs (current prevalence, time) and outputs a transmission rate. During training, we adjust the network’s internal weights so that the resulting epidemic trajectory matches observed data. The network is just a flexible function approximator - we’re not assuming any particular form for how β varies, we’re letting the data tell us.
Objectives
By the end of this session, you will:
Use UDEs to discover transmission patterns from data
Apply appropriate likelihood functions (Poisson) for count data
Interpret learned β(t) in terms of possible mechanisms
Understand when UDEs are useful vs. when simpler models suffice
Appreciate the challenges of learning from sparse data (~60 points)
Setup
usingDifferentialEquationsusingDistributionsusingDataFramesusingCSVusingPlotsusingRandomusingLinearAlgebrausingDrWatsonusingLux# Modern neural network framework (https://lux.csail.mit.edu/)usingComponentArraysusingOptimizationusingOptimizationOptimisersusingOptimisers# For Adam optimizerRandom.seed!(1234)
TaskLocalRNG()
Load Tristan da Cunha data
flu_tdc = CSV.read(datadir("flu_tdc_1971.csv"), DataFrame)scatter( flu_tdc.time, flu_tdc.obs, xlabel ="Time (days)", ylabel ="Daily incidence", label ="Observed cases", markersize =4, color =:red, title ="Tristan da Cunha 1971: Two waves?", legend =:topright)vspan!([25, 35], alpha=0.2, color=:blue, label="Second wave?")
ImportantTdC is a pedagogical example, not an ideal UDE dataset
The Tristan da Cunha dataset has only ~60 data points. As we note in When to use UDEs, UDEs really benefit from hundreds to thousands of observations — for 60 points, simpler approaches such as Gaussian processes or splines are likely more appropriate. We use TdC here because it is familiar from earlier sessions and its two-wave pattern poses an interesting modelling challenge, not because it is a good candidate for UDEs in practice. Keep this limitation in mind when interpreting results.
NoteSkipped step: synthetic data validation
In a real analysis you should always validate a UDE on synthetic data first — generate observations from a known β(t), train the UDE, and check that it recovers the true function. This confirms the method works before you apply it to data where the truth is unknown. We skip this step here for time, but see Going further → Validation with synthetic data for an outline.
Notice the two distinct periods of transmission:
First wave (days 4-12): Sharp peak reaching ~47 cases/day
Trough (days 13-24): Near-zero transmission
Second wave (days 25-35): Smaller resurgence, peak ~12 cases/day
Question: What transmission pattern β(t) would produce this? Can we learn it from the data?
Baseline: SIR with constant β
First, let’s establish what a constant transmission rate gives us:
"""Standard SIR model with constant transmission."""functionsir_constant_beta(du, u, p, t) S, I, R = u β, γ = p N = S + I + R du[1] =-β * S * I / N du[2] = β * S * I / N - γ * I du[3] = γ * Iend# Literature-informed values for influenza (R_0 ≈ 2.5, D_inf ≈ 2 days)R_0_guess =2.5D_inf_guess =2.0γ_guess =1.0/ D_inf_guessβ_guess = R_0_guess * γ_guessu0 = [281.0, 3.0, 0.0] # N = 284 (total population of Tristan da Cunha)tspan = (0.0, maximum(flu_tdc.time))times = flu_tdc.timeprob_constant =ODEProblem(sir_constant_beta, u0, tspan, [β_guess, γ_guess])sol_constant =solve(prob_constant, Tsit5(), saveat=times)# Extract incidenceS_values = sol_constant[1, :]incidence_constant = [0.0; -diff(S_values)]# Plotplot(times, incidence_constant, label="Constant β", linewidth=2)scatter!(flu_tdc.time, flu_tdc.obs, label="Observed", markersize=4, color=:red, xlabel="Time (days)", ylabel="Cases", title="Baseline: Can we do better?")
Question: The fit isn’t perfect. Could a time-varying β(t) do better?
UDE approach: learn β(t) with a neural network
Step 1: Define the neural network
We’ll use a small network - remember, we only have 60 data points!
# Network architecture: 2 inputs → 8 hidden → 1 output# Total parameters: ~33 (keeps training fast)rng =Random.default_rng()Random.seed!(rng, 1234)# Define network structure - small for fast ForwardDiff autodiffnn_architecture =Chain(Dense(2=>8, tanh), # Input: [I/N, t/tmax]Dense(8=>1, softplus) # Output: β > 0 (softplus ensures positivity))# Initialise parametersp_nn, st_nn = Lux.setup(rng, nn_architecture)p_nn =ComponentArray(p_nn) # Convert to ComponentArray for optimizationprintln("Neural network parameters: ", length(p_nn))println("Data points: ", length(flu_tdc.obs))println("Ratio: ", round(length(flu_tdc.obs) /length(p_nn), digits=2), " data points per parameter")
Neural network parameters: 33
Data points: 59
Ratio: 1.79 data points per parameter
TipCode explainer
Chain(...) - combines multiple neural network layers in sequence. Input flows through each layer in order.
Dense(in => out, activation) - a fully-connected layer with in inputs, out outputs, and an activation function applied element-wise.
tanh - activation function that squashes values to [-1, 1]. Helps learn smooth functions and keeps gradients bounded.
softplus - defined as log(1 + exp(x)), ensures outputs are always positive. Smoother than ReLU at zero.
Lux.setup(rng, model) - initialises the neural network parameters p and state st. Lux separates parameters from architecture for functional programming.
ComponentArray - wraps nested parameter structures into a flat array while preserving named access. Needed for gradient-based optimisation.
A few things to note about this architecture (expand the code explainer above for definitions of each function):
Why 8 hidden units? It’s a balance: more units = more flexibility but more parameters to learn from limited data. With 60 data points, we keep the network small.
Why softplus output? It ensures β is always positive — transmission rates can’t be negative.
Why 2 inputs? We feed the network both prevalence (I/N) and normalised time (t/tmax). This lets β depend on both “how bad is the epidemic right now” and “what day is it” — capturing both behavioural responses to prevalence and time-dependent changes like weekends or policy changes.
Parameter count: The first layer has 2×8 weights + 8 biases = 24 parameters. The second has 8×1 weights + 1 bias = 9 parameters. Total: 33 parameters for 60 data points — roughly 2 data points per parameter. In a standard neural network, this would be a recipe for overfitting, but in a UDE the network output has to flow through an ODE solver and produce a physically plausible epidemic trajectory. The network can’t just memorise noise because arbitrary weight values would produce nonsensical dynamics (Rackauckas et al. 2020).
Step 2: Define UDE with learned β(t)
"""SIR model where β is learned by neural network.β = NN([I/N, t/tmax]) predicts transmission rate."""functionsir_ude!(du, u, p, t) S, I, R = u N = S + I + R γ = p.γ # Fixed here for simplicity; in practice you'd estimate γ jointly# Neural network predicts β(t, I)# Inputs: prevalence (I/N) and normalized time (t/tmax) tmax = p.tmax nn_input = [I/N, t/tmax]# Evaluate neural network β_learned, _ =nn_architecture(nn_input, p.nn_params, st_nn) β = β_learned[1] # Extract scalar# Standard SIR dynamics with learned β du[1] =-β * S * I / N du[2] = β * S * I / N - γ * I du[3] = γ * Iend
Main.Notebook.sir_ude!
Step 3: Prediction and loss function
"""Predict incidence using current UDE parameters."""functionpredict_ude(p_all) prob_ude =ODEProblem(sir_ude!, u0, tspan, p_all) sol_ude =solve(prob_ude, Tsit5(), saveat=times, dense=false)if sol_ude.retcode !=:Successreturnfill(NaN, length(times))end S_pred = sol_ude[1, :] incidence_pred = [0.0; -diff(S_pred)]return incidence_predend"""Loss function using Poisson negative log-likelihood."""functionloss_ude(p_all, _) pred =predict_ude(p_all)# Return large loss if ODE solver failed (pred contains NaN)any(isnan, pred) &&returneltype(pred)(1e10)# Poisson negative log-likelihood (clamp predictions to avoid log(negative)) ε =1e-6 obs = flu_tdc.obs pred_safe =max.(pred, ε) nll =sum(pred_safe .- obs .*log.(pred_safe))return nllend
Main.Notebook.loss_ude
Why Poisson and not sum of squared errors? As we learned in the observation models session, choosing a likelihood is really choosing a distance metric. SSE treats all residuals equally — a mismatch of 10 cases at the peak (expected ~40) counts the same as at the tail (expected ~2). Poisson NLL instead penalises relative errors, so predicting 0 when 5 cases occurred is heavily penalised. This forces the model to explain the second wave, where counts are small but non-zero.
As discussed above, the ODE structure constrains the network — the learned β(t) has to produce a valid epidemic trajectory, which limits what the weights can do.
ImportantSet realistic expectations before training
Training UDEs is hard, especially with limited data like ours (60 points, 33 parameters). You should expect:
Variability: Different random seeds give different β(t) curves. This reflects genuine uncertainty.
Imperfect fits: The SIR structure may not be able to capture two waves regardless of β(t) - this is informative, not failure.
Sensitivity to hyperparameters: Results can change with different learning rates or network sizes.
If training produces NaN values, reduce the learning rate. If the loss doesn’t decrease, try different initialisation. See Practical challenges for detailed troubleshooting.
Step 4: Train the UDE
Training means finding network weights that minimise the loss function. This requires computing gradients of the loss with respect to all 33 parameters - a task handled by automatic differentiation (autodiff).
# Combine all parameters into a single object for optimisationp_all =ComponentArray( nn_params = p_nn, γ = γ_guess, tmax =maximum(times))println("Starting UDE training...")println("="^60)# Initial loss (before any training)println("Initial loss: ", round(loss_ude(p_all, nothing), digits=2))# Set up optimisation with ForwardDiff for gradient computationopt_func =OptimizationFunction(loss_ude, Optimization.AutoForwardDiff())opt_prob =OptimizationProblem(opt_func, p_all, nothing)# Train using Adam optimiser — store loss history for convergence checkloss_history =Float64[]println("\nTraining with Adam optimiser (500 iterations)...")@time result =solve(opt_prob, Optimisers.Adam(0.01), maxiters=500, callback = (state, l) ->beginpush!(loss_history, l)if state.iter %100==0println(" Iter $(state.iter): loss = $(round(l, digits=2))")endreturnfalseend)
Starting UDE training...
============================================================
Initial loss: 355.68
Training with Adam optimiser (500 iterations)...
Iter 100: loss = -210.16
Iter 200: loss = -260.85
Iter 300: loss = -365.05
Iter 400: loss = -371.47
Iter 500: loss = -378.29
Iter 500: loss = -378.29
9.569720 seconds (39.15 M allocations: 2.453 GiB, 4.21% gc time, 95.68% compilation time)
Optimization.AutoForwardDiff() - uses ForwardDiff.jl for forward-mode automatic differentiation. Efficient when the number of parameters is small (~33 here). For larger networks, reverse-mode (Zygote.jl) would be faster.
Adam(learning_rate) - adaptive gradient descent optimiser that adjusts per-parameter learning rates and uses momentum.
callback - function called after each iteration. Returns true to stop early, false to continue. Useful for monitoring progress.
A few things to understand about this training process:
Automatic differentiation computes exact gradients of the loss with respect to parameters. We use ForwardDiff.jl, which is efficient for small networks like ours (33 parameters).
The Adam optimiser is a gradient descent variant that adapts learning rates for each parameter and uses momentum to smooth updates. The learning rate (0.01) controls step size: too large and training becomes unstable; too small and convergence is slow. Adam is widely used for neural network training because it works well across many problems without much tuning.
What to expect: The loss should decrease substantially (often by orders of magnitude). If it doesn’t, something is wrong - perhaps the learning rate is too high (NaN errors) or too low (no progress), or the model structure can’t fit the data.
Key question: Does the UDE capture the second wave? Look at the fit around days 25-35.
You should see the UDE trajectory (orange/green line) tracking the first-wave peak noticeably better than the constant-β baseline. Whether the second wave (days 25-35) is captured depends on the training run: the UDE may show a small uptick there, but with only 60 data points it is unlikely to reproduce the second wave precisely. If the two curves look almost identical, the network may not have learned anything useful — try increasing the number of iterations or using a different random seed.
Step 6: Visualise learned β(t)
What transmission pattern did the network learn? Since the network takes both prevalence (I/N) and time as inputs, we need to evaluate it along the actual epidemic trajectory — the β(t) the ODE solver used during integration, not at some arbitrary fixed prevalence.
# Solve the trained UDE at high resolution to get the trajectoryprob_trained =ODEProblem(sir_ude!, u0, tspan, p_trained)sol_trained =solve(prob_trained, Tsit5(), dense=true)# Evaluate β along the actual trajectoryt_plot =range(0.0, maximum(times), length=100)N_total =sum(u0)β_along_trajectory = [begin S_t, I_t, R_t =sol_trained(t) nn_input = [I_t / N_total, t /maximum(times)] β_val, _ =nn_architecture(nn_input, p_trained.nn_params, st_nn) β_val[1]end for t in t_plot]# Plot with wave markersplot(t_plot, β_along_trajectory, label="Learned β(t)", linewidth=2, xlabel="Time (days)", ylabel="β", title="Learned Transmission Rate (along trajectory)", legend=:topright)hline!([β_guess], label="Constant β baseline", linestyle=:dash)vspan!([25, 35], alpha=0.2, color=:blue, label="Second wave period")
In a typical run you should see β(t) starting high (above the constant-β baseline), dropping during the trough around days 12-20, and then rising again around days 20-25 as the network tries to reignite transmission for the second wave. The exact shape varies between runs (see Uncertainty quantification below).
Before interpreting this curve, look back at Step 5. How well does the UDE actually fit the data? The fit may be imperfect. An imperfect fit tells us either that the network needs more capacity, or that the SIR structure itself can’t capture the dynamics. We’ll explore this in the next section.
Interpreting the learned β(t):
If β(t) shows a dip followed by recovery around days 15-25, this suggests the network learned that transmission must vary to explain both waves. This is consistent with behavioural relaxation - initial caution followed by return to normal contact patterns.
If β(t) is roughly constant, the network didn’t find time-varying transmission useful. This might mean:
The network is too small (try more hidden units)
More training iterations are needed
Or genuinely: time-varying β isn’t the right mechanism for this data
Uncertainty quantification in the main analysis
Everything above is a single optimisation run — the learned β(t) is a point estimate with no uncertainty attached. This is a serious limitation for drawing scientific conclusions. A quick way to get a rough sense of uncertainty is to re-run the training from several different random seeds (e.g. 5-10) and overlay the resulting β(t) curves. Where the curves agree, the signal is robust; where they diverge, the data cannot constrain the answer. See Going further → Ensemble uncertainty for more detail.
Practical challenges
Training UDEs is difficult, especially with limited data. Understanding why helps set realistic expectations.
The fundamental tension is between flexibility and data. Our network has 33 parameters and we have 60 data points - roughly 2 points per parameter. In classical statistics, you’d want at least 10-20 observations per parameter for reliable inference. With neural networks embedded in ODEs, the situation is even more constrained because the parameters interact nonlinearly through the differential equation.
Identifiability is another challenge. Many different β(t) curves could produce similar epidemic trajectories. The data doesn’t uniquely determine β(t) — there’s a family of solutions. The ODE structure constrains this family (not every β(t) produces a valid epidemic), but it doesn’t eliminate the ambiguity entirely.
Initialisation matters because optimisation can get stuck in local minima. Different random seeds for the network weights lead to different solutions. Running multiple times and comparing results is good practice.
Numerical stability can be problematic. The combination of ODE solving and gradient computation can produce NaN values if the optimiser takes steps that are too large, or if the ODE becomes stiff (rapidly changing dynamics requiring small time steps).
The deeper issue: structural constraints
If the fit is imperfect even with Poisson loss, this tells us something important: maybe time-varying β isn’t the full story.
Consider what happens during the trough (days 13-21):
Cases drop to near-zero (0-5 per day)
By this point, S has been substantially depleted by the first wave
Even with I > 0, the product β·S·I/N is small because S is small
A large β can compensate somewhat, but there is a limit
The neural network can only learn β(t), but the SIR structure itself constrains what’s possible:
S only decreases: Once recovered, people are immune forever — the susceptible pool is depleted by the first wave
The trough problem: With I near zero and S depleted, restarting growth requires implausibly large β
No memory: The model has no way to “store” infection for later release
In the pMCMC session, we saw that the SEIT4L model — which adds an exposed class and Erlang-distributed temporary immunity — explains the two waves through structural changes, not time-varying transmission. That is the more parsimonious explanation for this data.
What other structural changes could help?
SEIR model: Exposed period sustains transmission chains during the trough
Two-population model: Different mixing groups, delayed introduction
Learn γ(t) too: Time-varying recovery might help, though this raises identifiability issues with β(t)
UDEs can only discover mechanisms within the constraints we impose. If we say “it must be SIR with learned β(t)”, we’re already limiting what can be learned.
An imperfect fit tells us something: time-varying transmission alone is insufficient to explain the data, and we may need different model structure.
Common failure modes
Overfitting occurs when the training loss becomes very low but the predictions look unrealistic — perhaps fitting noise or producing wildly oscillating β(t). With UDEs this is less common than in standard neural networks because the ODE structure constrains the output, but it can still happen with very flexible networks. The main solution is reducing network size (fewer hidden units).
Underfitting is the opposite: the UDE behaves almost identically to constant β, meaning the network hasn’t learned anything useful. The loss may still be high. Solutions include increasing network capacity (more hidden units) or running more training iterations.
NaN explosion happens when gradients become infinite, usually because the optimiser took too large a step that pushed the ODE into an unstable region. You’ll see NaN values in the loss. Solutions include reducing the learning rate (0.01 to 0.001), using gradient clipping, or switching to a more robust ODE solver that handles stiff equations.
Sensitivity to initialisation
Different random initialisations of the network weights can lead to different local minima and different learned β(t) curves. Let’s check how much this matters.
# Train from 3 different random seeds and compare learned β(t)plot(title="Learned β(t) across random seeds", xlabel="Time (days)", ylabel="β", legend=:topright)for seed in [1234, 5678, 9012] rng_test =Random.default_rng()Random.seed!(rng_test, seed) p_nn_test, _ = Lux.setup(rng_test, nn_architecture) p_nn_test =ComponentArray(p_nn_test) p_test =ComponentArray( nn_params = p_nn_test, γ = γ_guess, tmax =maximum(times) ) opt_func_test =OptimizationFunction(loss_ude, Optimization.AutoForwardDiff()) opt_prob_test =OptimizationProblem(opt_func_test, p_test, nothing) result_test =solve(opt_prob_test, Optimisers.Adam(0.01), maxiters=500, callback = (state, l) ->false)# Extract the learned β(t) along the trajectory for comparison p_test_trained = result_test.u prob_test =ODEProblem(sir_ude!, u0, tspan, p_test_trained) sol_test =solve(prob_test, Tsit5(), dense=true) β_test = [begin S_t, I_t, R_t =sol_test(t) nn_input = [I_t /sum(u0), t /maximum(times)] β_val, _ =nn_architecture(nn_input, p_test_trained.nn_params, st_nn) β_val[1]end for t in t_plot]plot!(t_plot, β_test, label="Seed $seed (loss=$(round(result_test.objective, digits=1)))", linewidth=2)end
Compare not just the final loss values but the shape of the β(t) curves. Similar losses do not guarantee similar β(t) — two different transmission patterns can produce comparably good fits. Where the curves agree, you can be more confident in the result; where they diverge, the data cannot distinguish between explanations.
Exercises
TipExercise 1: Experiment with hyperparameters
Try changing one thing at a time and observe the effect on the learned β(t) curve and the final loss:
Hidden units: Change Dense(2 => 8, tanh) to Dense(2 => 4, tanh) (smaller) or Dense(2 => 16, tanh) (larger). How does the number of parameters change? Does the fit improve or does overfitting become visible?
Learning rate: Replace Adam(0.01) with Adam(0.001) or Adam(0.05). What happens to training stability and convergence speed?
Random seed: Change Random.seed!(1234) to a different value and re-run. How much does the learned β(t) vary between seeds? What does this tell you about identifiability?
For each change, re-run from Step 1 through Step 6 and compare the β(t) plot to your original result.
TipExercise 2: Interpret the structural constraint
In the section on structural constraints, we argue that an SIR model with learned β(t) struggles to produce two waves. Can you explain, in your own words, why even a very large β during the trough is insufficient — considering both the depleted susceptible pool and near-zero prevalence? What structural change to the model would you propose, and how does this relate to the SEIT4L model from the pMCMC session?
When to use UDEs
UDEs aren’t always appropriate. Here’s when they add value.
Use UDEs when:
You have mechanistic knowledge about structure (e.g., compartmental dynamics) but unknown functional forms (how transmission varies)
You have moderate-to-large datasets - ideally hundreds to thousands of data points
You’re doing exploratory analysis to discover potential mechanisms
You’re willing to experiment with hyperparameters and validate results carefully
Don’t use UDEs when:
Simple parametric models fit the data adequately (prefer parsimony)
Your dataset is very small (under 50 points)
You need interpretable parameters for publication or policy
Computational resources are limited - training can be expensive
You need real-time inference - for a single dataset, MCMC may actually be faster
Comparison with other approaches
Method
Flexibility
Interpretability
Data needs
Computational cost
Constant β
Low
High
Low (10-50 points)
Very low
Time-varying β with splines
Medium
Medium
Medium (50-200)
Low
GP for β(t)
Medium
Medium (but with UQ)
Medium (50-200)
Medium
UDE
High
Low
High (100-1000+)
High
For TdC data (60 points): GPs or splines are likely more appropriate than UDEs.
Where UDEs have been applied
Real-data applications of UDEs are still relatively few. In epidemiology, Kuwahara & Bauch (2024) used a UDE to predict COVID-19 second waves from case and mobility data, Schmid et al. (2026) combined a UDE-based SEIR model with wastewater surveillance to project case counts, and Rojas-Campos et al. (2023) demonstrated the UDE → SINDy pipeline on simulated regional transmission data. Outside epidemiology, UDEs have been applied to glucose-insulin dynamics (de Rooij et al. 2025) and battery degradation modelling (Kuzhiyil et al. 2024). Philipps et al. (2025) review the broader state of UDEs in systems biology and note that most published work still uses synthetic data.
Going further
Network architecture choices
The neural network architecture affects what patterns can be learned:
Smaller networks (e.g., 4→1) may underfit complex dynamics but are more interpretable
Larger networks (e.g., 16→8→1) can capture more complexity but may overfit with limited data despite the ODE constraint
Input features matter: β(t) learns time-dependent patterns, β(I/N) learns prevalence-dependent patterns, β(I/N, t) can learn both
Try different architectures to understand what your data can support.
Ensemble uncertainty
See Uncertainty quantification in the main analysis for the basic idea of training from multiple random seeds. For a more rigorous version, train 10-20 runs and compute pointwise confidence bands on β(t) from the ensemble. This is particularly important for scientific claims about mechanisms.
Validation with synthetic data
Before trusting UDE discoveries on real data, test on synthetic data where you know the truth:
Generate data with a known β(t), e.g., β(t) = β₀ × (1 + 0.3 sin(2πt/20))
Train UDE on this synthetic data
Compare learned β(t) to the true function
If the UDE can’t recover known patterns, be cautious about interpreting patterns from real data
This section introduces a related but distinct topic — symbolic regression — that extends the UDE workflow. It is included for interest and can be safely skipped.
A major limitation of neural network UDEs is interpretability - you get a function that works, but you can’t write down what it is. Symbolic regression addresses this by searching for equations rather than neural network weights.
The idea: instead of learning that β(t) is “some neural network”, discover that β(t) = β₀(1 + α·sin(ωt)) or β(t) = β₀·exp(-κI/N). The output is a human-readable formula that can be interpreted, published, and debated.
Two main approaches exist:
SINDy (Sparse Identification of Nonlinear Dynamics) builds a library of candidate terms (polynomials, trigonometric functions, etc.) and uses sparse regression to find which terms are needed. It’s fast and produces interpretable equations, but limited to the function library you define. In Julia, DataDrivenDiffEq.jl provides SINDy implementations.
Genetic programming / symbolic regression evolves equations by combining mathematical operations, selecting those that fit well. Tools like PySR (accessible from Julia via PythonCall) or SymbolicRegression.jl can discover surprising functional forms you wouldn’t have thought to include in a library.
The workflow often combines both approaches: first use a neural network UDE to confirm that a time-varying or state-dependent term improves fit, then apply symbolic regression to find an interpretable equation for that term.
For policy applications, the difference matters in principle. A hypothetical finding like “transmission increases when prevalence drops below 5%” would be actionable; “transmission follows neural network with weights [0.23, -0.41, …]” is not. In practice, recovering clean symbolic expressions from noisy epidemic data is difficult, but symbolic UDEs offer a path from discovery towards interpretation.
Universal Differential Equations combine mechanistic structure (SIR compartments) with neural networks that learn unknown components (how β varies).
Loss function choice matters. Poisson likelihood forces the model to explain small counts in the second wave; squared errors would be dominated by the first-wave peak. The ODE structure constrains the network — it can only learn β(t) curves that produce valid epidemic trajectories.
If the learned β(t) shows an increase around days 20-25, this is consistent with behavioural relaxation — but it’s a hypothesis needing independent validation. An imperfect fit may indicate that SIR structure with time-varying β isn’t sufficient.
TipLearning points
UDEs combine mechanistic and machine learning: Keep known structure (SIR compartments), learn unknown parts (how β varies)
The loss function matters: Poisson likelihood forces the model to explain small counts; SSE is dominated by peak values
The ODE constrains the network: The mechanistic structure limits what the network can learn, preventing it from memorising noise
A single training run is a point estimate: Without ensemble runs or Bayesian treatment, there is no uncertainty quantification — interpret any single β(t) curve with caution
Learned patterns suggest mechanisms but don’t prove them - UDEs generate hypotheses needing independent validation
Data requirements are high: UDEs need hundreds of data points to work well; for small datasets, simpler approaches may be better
Next steps
This session and the variational inference session round out the course’s coverage of inference methods. Together with the MCMC, particle filter and pMCMC sessions, you now have a range of approaches for fitting mechanistic models to data, and some sense of when each is appropriate.