This vignette is to introduce to the {idmodelr} package users a simple use example of the particle Markov-chain Monte Carlo (PMCMC) algorithm. We take simple deterministic/stochastic epidemic models with partial observation as model cases and show how the models are coupled with PMCMC for efficient parameter estimation.
Infectious disease modelling often involves handling of time-series datasets with partial (or modified) observation. A familiar example of the partial observation is underreporting, where only a certain fraction of infected individuals are identified and recorded in the dataset. One can account for the observation processes by explicitly incorporating them into the model, but this approach can lead to a particular class of model called “hidden Markov process (HMP)”, whose likelihood is usually intractable.
HMPs are typically represented as a Markov process with hidden inner states and its observation. Namely, the hidden variable \(x_t\) evolves over time according to the time-evolution process \(f\) and only \(y_t\), the observed variable subject to the observation process \(g\), is available as the model output.
\[\begin{matrix} x_t & \sim & f(x_t;x_{t-1}) \\ y_t &\sim &g(y_t;x_t) \end{matrix} \]
The likelihood of this HMP given the data \(\{y_1,y_2,...,y_T\}\) should theoretically be obtained by integrating out hidden variables \(x_t\)’s; however, this integration is computationally too heavy in practice due to high dimensionality.
\[ \mathcal L=\int\int...\int \left( p(x_1)g(y_1;x_1)\sum_{t=2}^T {f(x_t;x_{t-1})g(y_t;x_t)} \right) dx_1dx_2...dx_T. \] PMCMC employs the sequential Monte Carlo (SMC) to approximate this likelihood for better efficiency. The posterior distributions of \(x_t\)’s are represented by a set of samples (referred to as “particles”), which is sequentially sampled for \(t=1,2,...,T\). For each time step \(t\), SMC simulates the current state \(\{x_t\}\) based on the previous state samples \(\{x_{t-1}\}\) and weight them based on the stepwise likelihood \(g(y_t;x_t)\), so that the resulting samples provide a Monte Carlo integration. The model parameters are then searched using MCMC based on the approximated likelihood.
For further details and formal introduction to the PMCMC method, see Endo et al. (2019).
First, we take the determinisitic Susceptible-Infectious-Recovered (SIR) model as an example. Although PMCMC is not necessary to fit the deterministic SIR model to data because the likelihood can be directly computed, we will introduce it for a comparison with the stochastic SIR model example later, which PMCMC is better suited for.
The SIR model is a simple compartmental model widely used in infectious disease epidemiology. The model categorises the population into three classes: \(S\) (not yet infected but at risk), \(I\) (infected and currently infectious) and \(R\) (previously infected but not infectious anymore), and characterises the transitions between the classes by a set of ordinary differential equations.
\[\begin{cases} \frac{d}{dt}S(t)=-\beta S(t)I(t) \\ \frac{d}{dt}I(t)=\beta S(t)I(t) - \gamma I(t) \\ \frac{d}{dt}R(t)=\gamma I(t) \end{cases} \]
SIR model runs on continuous time and we can get the state of each compartment at given time \(t\).
# deterministic SIR model by {idmodelr}
sim_outbreak <- function(betatau, N, times, model = SIR_ode, sim_fn = solve_ode) {
parameters <- data.frame(beta = betatau[1], tau = betatau[2])
inits <- data.frame(S = N - 1, I = 1, R = 0)
traj <- simulate_model(model = model, sim_fn = sim_fn, inits = inits, params = parameters,
times = times)
return(traj)
}
times <- seq(0, 10, 0.1)
traj <- sim_outbreak(c(3, 1), 1000, times)
# plot the epicurve
plot_model(traj, facet = F)
However, we usually do not observe infections on continuous time: outbreak data is usually reported as incidence on discrete time steps (e.g. weekly case counts). Therefore, we need to assume certain obervation processes when we fit the model to the observed data.
One of the options is to assume the reported number of cases follows a Poisson distribution whose mean is equal to the ascertainment probability \(p\) times the cumulative amount of the newly-infected (i.e., inflow into the \(I\) compartment) in SIR model. Here we assume \(p\) is an unknown constant. Let \(C_n\) be the number of cases observed over the period between \(t=T_{n-1}\) and \(t=T_{n}\). We then get
\[ \begin{matrix} C_n & \sim & \operatorname{Pois}(pc_n), \\ c_n & = & \int_{T_{n-1}}^{T_{n}}\beta I(t)S(t)dt \\ & = & S(T_{n-1})-S(T_{n}). \\ %& = & S(T_{n-1})\left[1-\exp\left(-\int_{T_{n-1}}^{T_{n}}\beta I(t)dt\right)\right] \end{matrix} \]
We can consider this system as an HMP. For discrete time steps \(n=1,2,...,N\), \(S(T_n)\), \(I(T_n)\) and \(R(T_n)\) constitute the (hidden) states of the HMP. The ODEs provide the time-evolution process and the Poisson distributions are the observation process.
Although we introduced PMCMC as an inference tool for HMPs, we do not need to use PMCMC for this deterministic setting because the likelihood is directly available without particle approximation. Given parameters \(\beta\) and \(\gamma\), the likelihood of observing \(C_1, C_2, ..., C_N\) cases is computed as below.
# Likelihood for the deterministic SIR model
cases <- c(3, 18, 91, 96, 42, 17, 7, 2, 1, 1) # Observed case counts: {C_n}
T_n <- 0:10 * 10 + 1 # Time frame of observation
getc_n <- function(traj, T_n) {
S_T <- pmax(unlist(traj[, "S"])[T_n], 0)
c_n <- (S_T[-length(S_T)] - S_T[-1])
return(c_n)
}
logl_detSIR <- function(parms, C_n) {
beta <- parms[1]
tau <- parms[2]
p <- parms[3]
traj <- sim_outbreak(c(beta, tau), 1000, times)
c_n <- getc_n(traj, T_n)
logl <- sum(dpois(C_n, p * c_n, log = T))
return(logl)
}
Parameters can be estimated with this likelihood; e.g., by the maximum likelihood estimation or Bayesian posterior sampling.
# Maximum likelihood estimation. The true parameter values were (beta = 3.0, tau
# = 1.0, p = 0.3).
optim(numeric(3) + 0.5, logl_detSIR, lower = numeric(3) + 0.1, upper = c(5, 5, 1),
method = "L-BFGS", C_n = cases, control = list(fnscale = -1))
#> $par
#> [1] 3.1572618 1.1665207 0.3039177
#>
#> $value
#> [1] -22.99954
#>
#> $counts
#> function gradient
#> 22 22
#>
#> $convergence
#> [1] 0
#>
#> $message
#> [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
In this section, we show how the deterministic SIR model can be fitted to the observed data. For better comparison with the stochastic SIR model case in the later section, we incrementally evaluate the likelihood by following time steps rather than in one shot. Here, getSIR(n)
returns the hidden state variables at step \(n\) (i.e. \(S(t_n)\), \(I(t_n)\) and \(R(t_n)\)) and observe(C_n, c_n, p)
gives the incremental likelihood of observing \(C_n\) cases given \(\lambda_n\) and \(p\). Assuming improper uniform priors on the positive real line, MCMC is implemented targeting the likelihood function.
# Stepwise likelihood compuation and MCMC
getSIR <- function(n, SIRtraj) {
t_n <- T_n[n + 1]
SIR <- unlist(SIRtraj[t_n, ])
return(list(S = SIR["S"], I = SIR["I"], R = SIR["R"]))
}
observe <- function(C_n, c_n, p) {
logl <- dpois(C_n, c_n * p, log = T)
return(logl)
}
# MCMC implementation with {LaplacesDemon}
library(LaplacesDemon)
n_iter <- 5000
chain <- matrix(0, n_iter, 3)
Data <- list(parm.names = c("beta", "tau", "p"), mon.names = "logl", N = 1000)
Model <- function(parms, Data) {
beta <- abs(parms[1])
tau <- abs(parms[2])
p <- interval(parms[3], 0, 1)
traj <- sim_outbreak(c(beta, tau), 1000, times)
prevS <- unlist(traj[1, "S"])
logl <- 0
for (n in 1:10) {
currS <- getSIR(n, traj)$S
logl <- logl + observe(cases[n], prevS - currS, p)
prevS <- currS
}
return(list(LP = logl, Dev = -2 * logl, Monitor = logl, yhat = NULL, parm = c(beta,
tau, p)))
}
Covar <- matrix(c(0.5, 0.45, 0.03, 0.45, 0.45, 0.03, 0.03, 0.03, 0.003), 3) # Specified ad-hoc: need to use adaptive methods etc. in practice
set.seed(2020)
chain <- LaplacesDemon(Model, Data, Initial.Values = c(1, 1, 0.5), Covar = Covar,
Iterations = n_iter, Status = 1000, Algorithm = "RWM")
#>
#> Laplace's Demon was called on Wed Jan 15 18:58:42 2020
#>
#> Performing initial checks...
#> Suggestion: 1 possible instance(s) of for loops
#> were found in the Model specification. Iteration speed will
#> increase if for loops are vectorized in R or coded in a
#> faster language such as C++ via the Rcpp package.
#> Algorithm: Random-Walk Metropolis
#>
#> Laplace's Demon is beginning to update...
#> Iteration: 1000, Proposal: Multivariate, LP: -23.9
#> Iteration: 2000, Proposal: Multivariate, LP: -23.4
#> Iteration: 3000, Proposal: Multivariate, LP: -24.9
#> Iteration: 4000, Proposal: Multivariate, LP: -23.5
#> Iteration: 5000, Proposal: Multivariate, LP: -26.3
#>
#> Assessing Stationarity
#> Assessing Thinning and ESS
#> Creating Summaries
#> Estimating Log of the Marginal Likelihood
#> Creating Output
#>
#> Laplace's Demon has finished.
chain$Summary2
#> Mean SD MCSE ESS LB Median
#> beta 3.567631 0.58522096 0.09101409 93.48947 2.7805671 3.4446113
#> tau 1.666154 0.69820241 0.11052192 89.21031 0.7351573 1.5256705
#> p 0.340713 0.06092527 0.00939460 86.13116 0.2677050 0.3237893
#> Deviance 53.826001 78.62832926 4.83391214 343.17428 46.2839755 48.6346771
#> logl -26.913000 39.31416463 2.41695607 343.17428 -28.9875397 -24.3173385
#> UB
#> beta 5.1835019
#> tau 3.5933746
#> p 0.5110684
#> Deviance 57.9750794
#> logl -23.1419877
for (c in 1:3) plot(chain$Posterior1[, c], type = "l", lty = 1, col = c, xlab = "Iterations",
ylab = c("beta", "tau", "p")[c])
Because the time evolution process is deterministic, the incremental likelihood is computed without random fluctuations; getSIR(n)
returns the fixed values and therefore the fixed incremental likelihood observe(C_n, c_n, p)
. One the other hand, the time evolution randomly fluctuates in the stochastic SIR model (detailed in the next section). A possible approach to accounting for the stochasticity is to use many samples and approximate the likelihood by averaging over the samples, which is the key idea of PMCMC.
Next, we show the PMCMC implementation for the stochastic SIR model, which constitutes a typical HMP with an intractable likelihood function.
We adopt the Gillespie algorithm to incorporate stochasticity into the SIR model. The three variables \(S\), \(I\) and \(R\), are now all considered as integers, and we assume that each individual transitions between compartment randomly according to the following rates:
\[\begin{matrix} r_{S\rightarrow I}&=&\beta I(t), \\ r_{I\rightarrow R}&=&\gamma. \end{matrix}\] Assuming that transitions of more than one individuals do not happen exactly at the same time, we can characterise the time evolution of the stochastic SIR model as a random sequence of the two events:
Because the number of individuals in each compartment stays constant between the successive two events, the transition rates are also constant meanwhile. Note that the next earliest happening of an event is the earlier of either “one of the \(S\) susceptible individuals gets infected” or “one of the \(I\) infectious individuals recovers”. The time between the two successive events (where the time of the first event is \(t_k\)) therefore follows an exponential distribution with the rate \[r_{\mathrm{Next}}=r_{S\rightarrow I}S(t_k)+r_{I\rightarrow R}I(t_k)=\beta S(t_k)I(t_k)+\gamma I(t_k).\]
The Gillespie algorithm simulates the epidemic by the following steps. For complete program code examples, see epirecipes or {idmodelr} package repository.
The output of the Gillespie algorithm,\(\{S(t_k),I(t_k),R(t_k)\}\) for \(k=1,2,...\), can then be translated into a continuous-time epidemic time series. {idmodelr} supports epidemic simulations with the Stochastic SIR model.
source("../R/SIR.R")
source("../R/solve.R")
library(purrr)
#>
#> Attaching package: 'purrr'
#> The following object is masked from 'package:LaplacesDemon':
#>
#> partial
library(tibble)
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.2
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(2020)
# Stochastic SIR simulation by {idmodelr}
traj <- sim_outbreak(c(3, 1), 1000, times, model = SIR_stoch, sim_fn = solve_stoch)
# plot the epicurve
plot_model(traj, facet = F)
As we did for the deterministic SIR model, we can incorporate an observation process (underreporting) into the model. \(C_n\), the number of observed cases between time \(T_{n-1}\) and \(T_n\), is given as \[\begin{matrix} C_n&\sim&\operatorname{Bin}(c_n,p), \\ c_n&=&S(T_{n-1})-S(T_n), \end{matrix}\] where \(c_n\) is the number of “new infection” event occurrences between time \(T_{n-1}\) and \(T_n\). The system consisting of the hidden states \(S(T_n), I(T_n),R(T_n)\) and \(c_n\) for discrete time steps \(n=1,2,...,N\) is an HMP. Unlike the case of deterministic SIR model, the likelihood function for observed data \(\{C_1,C_2,...,C_N\}\) is intractable and PMCMC needs to be employed to efficiently estimate the model parameters with the approximated likelihood.
We implement PMCMC to estimate the parameters from the observed data \(\{C_1,C_2,...,C_N\}\). The function simulateSIR_stoch(n)
simulates the stochastic SIR model from \(t=T_{n-1}\) to \(T_n\) and observe_stoch(C_n, c_n, p)
gives the incremental likelihood corresponding to the observation \(C_n\). Improper uniform priors on the positive real line are assumed.
The SMC part of PMCMC starts off with a set of initial particles all corresponding to the initial state \(S(0)=999, I(0)=1, R(0)=0\). For each time step \(n\), the time evolution of the stochastic SIR model is simulated for each particle. Then the particles are weighted by the incremental likelihood and resampled to yield the samples that approximate the total likelihood. In the MCMC part, the parameter values are updated by the Metropolis Hastings algorithm based on the approximated likelihood.
PMCMC for the stochastic SIR model works in the following steps: