Bayesian analysis of longitudinal multilevel data - part 6

Data Analysis

Bayesian

Stan

Part 6 of a tutorial showing how to fit directly with Stan and cmdstanr.

Author

Andreas Handel

Published

February 16, 2024

Modified

March 20, 2024

Overview

This is an extension of a cmdstanr/Stan model to fit longitudinal data using Bayesian multilevel/hierarchical/mixed-effects models. It is a continuation of a prior series of posts. You should start at the beginning.

Here is the Stan code for this example and this is the R script that runs everything.

Introduction

As described in the prior post I want to implement an ordinary differential equation (ODE) model with Stan. I am slowly building up to that.

While I had planned to hop from the 2-parameter cmdstanr and Stan model straight to the ODE model, I ended up deciding on one more intermediate step. Namely, I know that the ODE model I want to use has 4 main parameters, as opposed to the 2 parameters used here. I figured I might first build a more complex, 4-parameter non-ODE model before I try to implement the ODE model.

New deterministic model

The obvious candidate for that 4-parameter model is the equation I mentioned previously and that we ended up using in one of our papers. This equation was introduced as a good model to fit virus load data for an acute infection in this paper.

I’m deviating from the original paper by giving the parameters different names (\(\alpha\), \(\beta\), \(\gamma\), \(\eta\) instead of \(p\), \(g\), \(d\), \(k\)) to be consistent with what I’ve been doing so far.

Also, since the original model was developed and set up with the assumptions that all 4 main parameters are positive, I need to ensure that by using the previously described approach of exponentiating the parameters.

This leads to this equation for the virus load trajectory.

\[ \begin{aligned} \mu_{i,t} = \log\left( \frac{2 \exp(\alpha_{i})}{e^{-\exp(\beta_{i}) (t_i - \exp(\gamma_{i}))} + e^{\exp(\eta_{i}) (t_i - \exp(\gamma_{i}))}}\right).\\ \end{aligned} \]

Here’s a bit of R code to explore this equation and to determine values for parameter priors that produce curves that are broadly consistent with our simulated data.

# brief plotting of model to get idea for priors
t = seq(0.1,40,length=100) 
# all parameters are the log of their original values
alph = 30; # approximately the peak of virus
bet = 1.0; # approx. growth rate
gamm = 2.5; # approx. peak time
et = 0.3; # approx. decay rate
num  = 2*exp(alph)
d1 = exp( - exp(bet)*(t - exp(gamm)) )
d2 =  exp( exp(et) * (t - exp(gamm)) )
mu = log( num /(d1 + d2) ) 
plot(t,mu, type = "l") #looks somewhat like virus load in acute

Full model structure

The rest of the model follow the previous one, just with more parameters. They all get their own equations. Here are the main components of the the new model.

\[ \begin{aligned} \textrm{Outcome} \\ Y_{i,t} \sim \mathrm{Normal}\left(\mu_{i,t}, \sigma\right) \\ \\ \textrm{main model describing the virus trajectory} \\ \mu_{i,t} = \log\left( \frac{2 \exp(\alpha_{i})}{e^{-\exp(\beta_{i}) (t_i - \exp(\gamma_{i}))} + e^{\exp(\eta_{i}) (t_i - \exp(\gamma_{i}))}}\right).\\ \\ \textrm{Deterministic models for main parameters} \\ \alpha_{i} = a_{0,i} + a_1 \left(\log (D_i) - \log (D_m)\right) \\ \beta_{i} = b_{0,i} + b_1 \left(\log (D_i) - \log (D_m)\right) \\ \gamma_{i} = g_{0,i} + g_1 \left(\log (D_i) - \log (D_m)\right) \\ \eta_{i} = e_{0,i} + e_1 \left(\log (D_i) - \log (D_m)\right) \\ \end{aligned} \]

Parameter distributions

To fully specify the model, we need to give all parameters distributions. Here are the distributions for the population-level parameters. These do not vary between individuals. I’m choosing values for the prior distributions assuming that dose will lead to an increase in peak, time to peak and growth rate, and a reduction in decay rate. I’m only indicating the values with an X below since I keep changing them in the code and then they are out of sync here. Check the R script to see the chosen values.

\[ \begin{aligned} \textrm{population-level priors} \\ \sigma \sim \mathrm{Exponential}(X) \\ a_1 \sim \mathrm{Normal}(X,X) \\ b_1 \sim \mathrm{Normal}(X,X) \\ g_1 \sim \mathrm{Normal}(X,X) \\ e_1 \sim \mathrm{Normal}(X,X) \\ \end{aligned} \]

In addition, we allow some parameters to differ between individuals, and we’ll implement hyper-parameters to allow these values to inform each other across individuals. This is again the adaptive pooling concept discussed previously.

I’m setting values for the prior distributions such that the virus load curve looks somewhat reasonable, based on the quick exploration of the model above. Again, not showing exact values here to not create confusion between what I write here and potentially different values I end up using in the code (and forgetting to update here). See the code for the actual values.

\[ \begin{aligned} \textrm{individal-level priors} \\ a_{0,i} \sim \mathrm{Normal}(\mu_a, \sigma_a) \\ b_{0,i} \sim \mathrm{Normal}(\mu_b, \sigma_b) \\ g_{0,i} \sim \mathrm{Normal}(\mu_g, \sigma_g) \\ e_{0,i} \sim \mathrm{Normal}(\mu_e, \sigma_e) \\ \\ \textrm{hyper priors} \\ \mu_a \sim \mathrm{Normal}(X, X) \\ \mu_b \sim \mathrm{Normal}(X, X) \\ \mu_g \sim \mathrm{Normal}(X, X) \\ \mu_e \sim \mathrm{Normal}(X, X) \\ \sigma_a \sim \mathrm{Exponential}(X) \\ \sigma_b \sim \mathrm{Exponential}(X) \\ \sigma_g \sim \mathrm{Exponential}(X) \\ \sigma_e \sim \mathrm{Exponential}(X) \\ \end{aligned} \]

And that’s the full model. The basic structure is the same as before, but the model is bigger because I’m now modeling the virus trajectory (given by \(\mu_{i,t}\)) with 4 main parameters.

Model implementation

We’ll follow the same setup as in the previous post. Links to the Stan and R code files are given at the top of this document.

library('here') #for file loading
library('dplyr') # for data manipulation
library('ggplot2') # for plotting
library('fs') #for file path
library('cmdstanr') #for model fitting
library('bayesplot') #for plotting results
library('posterior') #for post-processing
library('loo') #for model diagnostics

Some setup bits

############################################
# some general definitions and setup stuff
############################################
#setting random number seed for reproducibility
rngseed = 12345

# I'll be saving results so we can use them without always running the model
# Note that the files are often too large for standard Git/GitHub - where this project lives
# Git Large File Storage should be able to handle it
# I'm using a simple hack so I don't have to set up Git LFS
# I am saving these large file to a folder that is synced with Dropbox
# adjust accordingly for your setup
#filepath = fs::path("C:","Data","Dropbox","datafiles","longitudinalbayes")
filepath = fs::path("D:","Dropbox","datafiles","longitudinalbayes")
filename = "cmdstanr4par.Rds"
stanfile <- here('posts','2024-02-16-longitudinal-multilevel-bayes-6',"stancode-4par.stan")

Data

We’ll use the same data as before. I’m making one more change. Instead of hard-coding the values for the prior distributions into the Stan code, I’m passing them into the Stan code as part of the data. This makes exploring the Stan model more flexible, I don’t need to re-edit the Stan code if I want to try different values for the priors. I could do this for all parameters, but out of laziness, and because I don’t change them much, I’m hard-coding the sigma parameters inside the Stan file.

# adjust as necessary
simdatloc <- here::here('posts','2022-02-22-longitudinal-multilevel-bayes-1','simdat.Rds')
simdat <- readRDS(simdatloc)
Nind = length(unique(simdat$m3$id))
Ntot =  length(simdat$m3$id)
# values for prior distributions
# allows for exploring different values without having to edit Stan model code
priorvals = list(mu_a_mu = 1, mu_a_sd = 1,
                 mu_b_mu = 1, mu_b_sd = 1,
                 mu_g_mu = 2.5, mu_g_sd = 1,
                 mu_e_mu = 0.3, mu_e_sd = 1,
                 a1_mu = 0.5, a1_sd = 1,
                 b1_mu = 0.1, b1_sd = 1,
                 g1_mu = 0.1, g1_sd = 1,
                 e1_mu = -0.1, e1_sd = 1
)

# all data as one list, this is how Stan needs it
fitdatbase=list(id=simdat[[3]]$id,
            outcome = simdat[[3]]$outcome,
            time = simdat[[3]]$time,
            dose_adj = simdat[[3]]$dose_adj[1:Nind], #first Nind values
            Ntot =  Ntot,
            Nind = Nind
            )
fitdat = c(fitdatbase, priorvals)

Stan code

I again wrote the Stan code in a separate Stan file. Here is the code. The model is basically like the previous one, updated to reflect the model equations above. As mentioned above, instead of hard-coding values for prior distributions inside the Stan code, I’m passing some of them into the code as data. The advantage of passing them in is that I can more quickly play around with different values and see how results change. It also ensures that I use the same values in all parts of the model (e.g., model and generated quantities blocks).

//
// Stan code for fitting a hierarchical model to time-series data
//

data{
   int<lower = 1> Ntot; //number of observations
   int<lower = 1> Nind; //number of individuals
   vector[Ntot] outcome; //virus load
   vector[Ntot] time; // times at which virus load is measured
   vector[Nind] dose_adj; //dose after adjustment, 1 value per individual
   array[Ntot] int id;  //vector of person IDs to keep track which data points belong to whom
   //everything below are variables that contain values for prior distributions
   real mu_a_mu; 
   real mu_b_mu;
   real mu_g_mu;
   real mu_e_mu;
   real mu_a_sd;
   real mu_b_sd;
   real mu_g_sd;
   real mu_e_sd;
   real a1_mu; 
   real b1_mu;
   real g1_mu;
   real e1_mu;
   real a1_sd;
   real b1_sd;
   real g1_sd;
   real e1_sd;

}

parameters{
    // population variance
    real<lower=0> sigma;
    // variance of priors
    real<lower=0> sigma_a;
    real<lower=0> sigma_b;
    real<lower=0> sigma_g;
    real<lower=0> sigma_e;
    // population-level dose dependence parameters
    real a1;
    real b1;
    real g1;
    real e1;
    // hyper-parameters to implement adaptive pooling
    real mu_a;
    real mu_b;
    real mu_g;
    real mu_e;
    // individual level variation parameters
    vector[Nind] a0;
    vector[Nind] b0;
    vector[Nind] g0;
    vector[Nind] e0;
}


// Generated/intermediate parameters
transformed parameters{

    // predicted virus load from model
    vector[Ntot] virus_pred; 
    // main model parameters
    // each individual has their potentially own value 
    // I'm removing the last letter from each variable name
    // just to avoid potential conflicts with bulit-in names of Stan
    // such as beta for a beta distribution
    // it might not matter, but I wanted to be safe
    vector[Nind] alph;
    vector[Nind] bet;
    vector[Nind] gamm;
    vector[Nind] et;
    // intermediate variables for denominators
    // just to make equation easier to understand
    real d1; 
    real d2; 


    // compute main model parameters
    // multiply a by 30 to get better scaling
    for ( i in 1:Nind ) {
        alph[i] = exp( 30*( a0[i] + a1 * dose_adj[i]) );
        bet[i]  = exp( b0[i] + b1 * dose_adj[i] );
        gamm[i] = exp( g0[i] + g1 * dose_adj[i] );
        et[i]   = exp( e0[i] + e1 * dose_adj[i] );
        
    }
    // loop over all observations
    // since parameters are saved in vectors of length corresponding to number of individuals
    // we need to index with that extra id[i] notation
    for (i in 1:Ntot)
    {
      d1 = exp( -bet[id[i]] * (time[i] - gamm[id[i]]) );
      d2 = exp(   et[id[i]] * (time[i] - gamm[id[i]]) ); 
      virus_pred[i] = log(  2*alph[id[i]] / ( d1 + d2) );
     }

} // end transformed parameters block



model{

    // hyper-priors to allow for adaptive pooling among individuals 
    // values for the distributions are passed into the Stan code as part of the data
    mu_a ~ normal( mu_a_mu , mu_a_sd );
    mu_b ~ normal( mu_b_mu , mu_b_sd );
    mu_g ~ normal( mu_g_mu , mu_g_sd );
    mu_e ~ normal( mu_e_mu , mu_e_sd );
    // variance of priors
    sigma_a ~ exponential(1);
    sigma_b ~ exponential(1);
    sigma_g ~ exponential(1);
    sigma_e ~ exponential(1);
    // individual variation of each ODE model parameter
    a0 ~ normal( mu_a , sigma_a );
    b0 ~ normal( mu_b , sigma_b );
    g0 ~ normal( mu_g , sigma_g );
    e0 ~ normal( mu_e , sigma_e );
    // average dose-dependence of each ODE model parameter
    // values for the distributions are passed into the Stan code as part of the data
    a1 ~ normal( a1_mu , a1_sd); 
    b1 ~ normal( b1_mu , b1_sd);
    g1 ~ normal( g1_mu , g1_sd);
    e1 ~ normal( e1_mu , e1_sd);
    // residual population variation
    sigma ~ exponential(1); 

    // distribution of outcome (virus load)
    // all computations to get the time-series trajectory for the outcome are done  
    // inside the transformed parameters block
    outcome ~ normal( virus_pred , sigma );
}

// for model diagnostics and exploration
generated quantities {
    // define quantities that are computed in this block
    // for the individual level parameters, every individual
    // has the same prior so we only specify one
    real mu_a_prior;
    real mu_b_prior;
    real mu_g_prior;
    real mu_e_prior;
    real<lower=0> sigma_a_prior;
    real<lower=0> sigma_b_prior;
    real<lower=0> sigma_g_prior;
    real<lower=0> sigma_e_prior;
    real a0_prior;  
    real b0_prior;     
    real g0_prior;     
    real e0_prior;     

    real a1_prior;
    real b1_prior;
    real g1_prior;
    real e1_prior;

    real<lower=0> sigma_prior;
    vector[Ntot] log_lik;
    vector[Ntot] ypred;

    
    // this is so one can plot priors and compare with posterior later   
    // simulate the priors
    mu_a_prior = normal_rng( mu_a_mu , mu_a_sd);
    mu_b_prior = normal_rng( mu_b_mu , mu_b_sd);
    mu_g_prior = normal_rng( mu_g_mu , mu_g_sd);
    mu_e_prior = normal_rng( mu_e_mu , mu_e_sd);

    sigma_a_prior =  exponential_rng(  1 );
    sigma_b_prior = exponential_rng(  1 );
    sigma_g_prior = exponential_rng(  1 );
    sigma_e_prior = exponential_rng(  1 );

    a0_prior = normal_rng(mu_a_prior, sigma_a_prior);
    b0_prior = normal_rng(mu_b_prior, sigma_b_prior);
    g0_prior = normal_rng(mu_g_prior, sigma_g_prior);
    e0_prior = normal_rng(mu_e_prior, sigma_e_prior);

    a1_prior = normal_rng( a1_mu , a1_sd);
    b1_prior = normal_rng( b1_mu , b1_sd);
    g1_prior = normal_rng( g1_mu , g1_sd);
    e1_prior = normal_rng( e1_mu , e1_sd);

    sigma_prior = exponential_rng( 1 );

  
  // compute log-likelihood and predictions
    for(i in 1:Ntot)
    {
      log_lik[i] = normal_lpdf(outcome[i] | virus_pred[i], sigma);
      ypred[i] = normal_rng(virus_pred[i], sigma);
    }
} //end generated quantities block

This loads and compiles the Stan model.

# make Stan model. 
stanmod1 <- cmdstanr::cmdstan_model(stanfile, 
                                    pedantic=TRUE, 
                                    force_recompile=TRUE)

Model fitting settings

These are the settings for the model fitting routine. Basically the same as before, only more initial conditions now because we have more parameters. And of course different values, since our model changed.

#settings for fitting
fs_m1 = list(warmup = 1500,
             sampling = 2000, 
             max_td = 20, #tree depth
             adapt_delta = 0.99999,
             chains = 5,
             cores  = 5,
             seed = rngseed,
             save_warmup = TRUE)

# separate definition of initial values, added to fs_m1 structure 
# a different sample will be drawn for each chain
# there's probably a better way to do that than a for loop
set.seed(rngseed) #make inits reproducible
init_vals_1chain <- function() (list(mu_a = runif(1,1,1.5), 
                                     mu_b = runif(1,0.8,1.2),
                                     mu_g = runif(1,2,3),
                                     mu_e = runif(1,0.2,0.6),
                                     sigma_a = runif(1,0,1),
                                     sigma_b = runif(1,0,1),
                                     sigma_g = runif(1,0,1),
                                     sigma_e = runif(1,0,1),
                                     a0 = runif(Nind,1,1.5),
                                     b0 = runif(Nind,0.8,1.5),
                                     g0 = runif(Nind,1.5,2.5),
                                     e0 = runif(Nind,0,1),
                                     a1 = runif(1,0.5,0.6),
                                     b1 = runif(1,0.1,0.1),
                                     g1 = runif(1,0.1,0.1),
                                     e1 = runif(1,-0.1,-0.1),
                                     sigma = runif(1,0,1))
                                )
inits = NULL
for (n in 1:fs_m1$chains)
{
  inits[[n]] = init_vals_1chain()
}
fs_m1$init = inits

Model fitting

This runs the model. It’s not actually run here to speed up generation of this Quarto file, but the code chunk is in the R script, so you can run it.

res_m1 <- stanmod1$sample(data = fitdat,
                          chains = fs_m1$chains,
                          init = fs_m1$init,
                          seed = fs_m1$seed,
                          parallel_chains = fs_m1$chains,
                          iter_warmup = fs_m1$warmup,
                          iter_sampling = fs_m1$sampling,
                          save_warmup = fs_m1$save_warmup,
                          max_treedepth = fs_m1$max_td,
                          adapt_delta = fs_m1$adapt_delta,
                          output_dir = filepath
)

res_m1$save_object(fs::path(filepath,filename))

Model result loading

To save time, we don’t run the model each time, instead we save the results and load them.

# loading previously saved fit.
# useful if we don't want to re-fit
# every time we want to explore the results.
# since the file is too large for GitHub
# it is stored in a local cloud-synced folder
# adjust accordingly for your setup
res_m1 <- readRDS(fs::path(filepath,filename))

Model diagnostics

First, we look at diagnostics from the fitting routine to make sure nothing obviously wrong shows up.

res_m1$cmdstan_diagnose()

Processing csv files: D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-202402271819-1-39fd59.csvWarning: non-fatal error reading adaptation data
, D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-202402271819-2-39fd59.csvWarning: non-fatal error reading adaptation data
, D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-202402271819-3-39fd59.csvWarning: non-fatal error reading adaptation data
, D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-202402271819-4-39fd59.csvWarning: non-fatal error reading adaptation data
, D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-202402271819-5-39fd59.csvWarning: non-fatal error reading adaptation data


Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
210 of 17500 (1.20%) transitions ended with a divergence.
These divergent transitions indicate that HMC is not fully able to explore the posterior distribution.
Try increasing adapt delta closer to 1.
If this doesn't remove all divergences, try to reparameterize the model.

Checking E-BFMI - sampler transitions HMC potential energy.
The E-BFMI, 0.07, is below the nominal threshold of 0.30 which suggests that HMC may have trouble exploring the target distribution.
If possible, try to reparameterize the model.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete.

Ok, so the sampler isn’t quite happy. We should sample more and more stringently, but that would take very long. So for the purpose of this investigation, and given that I’m only exploring this model as a stepping stone to the ODE model I’m really interested in, I’ll leave it the way it is. If this were an actual research project, I would obviously need to improve the model performance.

Another important check are to make a few diagnostic plots. We’ll first need to get the samples, both with and without warmups, to be able to make various figures.

#this uses the posterior package to get draws
samp_m1 <- res_m1$draws(inc_warmup = FALSE, format = "draws_df")
allsamp_m1 <- res_m1$draws(inc_warmup = TRUE, format = "draws_df")

Now we can look at a few figures. Here I’m again showing a trace plot and a pairs plot.

# only main parameters, excluding parameters that we have for each individual, is too much
plotpars = c("a1","b1","g1","e1","sigma")
bayesplot::color_scheme_set("viridis")
bp1 <- bayesplot::mcmc_trace(samp_m1, pars = plotpars)
bp2 <- bayesplot::mcmc_pairs(samp_m1, pars = plotpars)
plot(bp1)

plot(bp2)

The plots look reasonable. Well-mixing chains and no noticeable correlations among parameters.

Model results

Let’s also look at the results, namely the posterior distributions of the parameters. We’ll again do both a table and a figure. We can’t really compare with the values we used to generate the model since we are using a different model to fit the data, so we shouldn’t expect any parameters to be similar. Thus, I’m not focusing further on the values. Again, for a real research project, you would want to carefully evaluate the parameters (after addressing the problems with the algorithm not working well).

print(head(res_m1$summary(),15))

# A tibble: 15 × 10
   variable       mean   median      sd     mad       q5      q95  rhat ess_bulk
   <chr>         <dbl>    <dbl>   <dbl>   <dbl>    <dbl>    <dbl> <dbl>    <dbl>
 1 lp__     -323.      -3.26e+2 3.78e+1 3.86e+1 -3.78e+2 -2.56e+2  1.04     99.0
 2 sigma       5.87     5.86e+0 2.67e-1 2.68e-1  5.45e+0  6.33e+0  1.00   5947. 
 3 sigma_a     0.0240   2.13e-2 1.58e-2 1.54e-2  3.97e-3  5.39e-2  1.03    232. 
 4 sigma_b     0.0160   1.27e-2 1.30e-2 1.19e-2  1.49e-3  4.21e-2  1.04    145. 
 5 sigma_g     0.0118   1.00e-2 8.34e-3 7.92e-3  1.69e-3  2.80e-2  1.01    200. 
 6 sigma_e     0.275    2.68e-1 5.55e-2 5.31e-2  1.99e-1  3.77e-1  1.00   3843. 
 7 a1          0.112    1.12e-1 1.15e-2 1.17e-2  9.33e-2  1.32e-1  1.00   5467. 
 8 b1          0.105    1.05e-1 2.21e-2 2.21e-2  6.94e-2  1.41e-1  1.00   1531. 
 9 g1         -0.00380 -3.27e-3 1.58e-2 1.58e-2 -2.98e-2  2.16e-2  1.00   1623. 
10 e1         -0.299   -2.98e-1 3.55e-2 3.45e-2 -3.58e-1 -2.42e-1  1.00   1856. 
11 mu_a        0.869    8.69e-1 2.23e-2 2.21e-2  8.33e-1  9.06e-1  1.01   1017. 
12 mu_b        3.86     3.86e+0 4.36e-2 4.37e-2  3.79e+0  3.93e+0  1.01    514. 
13 mu_g        0.480    4.79e-1 3.13e-2 3.05e-2  4.32e-1  5.34e-1  1.02    487. 
14 mu_e        0.241    2.43e-1 6.63e-2 6.40e-2  1.28e-1  3.49e-1  1.00   3324. 
15 a0[1]       0.878    8.76e-1 3.42e-2 3.04e-2  8.26e-1  9.36e-1  1.00   2078. 
# ℹ 1 more variable: ess_tail <dbl>

bp3 <- bayesplot::mcmc_dens_overlay(samp_m1, pars = plotpars)
plot(bp3)

Priors and Posteriors

Next, we’ll compare prior and posterior distributions. This can give an indication if the priors were selected well or are too broad or overly influential. To be able to show priors, we needed all that extra information in the generated quantities block in the Stan code.

# data manipulation to get in shape for plotting
postdf1 <- samp_m1 %>% 
  select(!ends_with('prior')) %>% 
  select(!starts_with(".")) %>% 
  select(-"lp__") %>% 
  select(!contains("[")) 
# awkward way of getting some further parameters
# namely values from first individual for a0,b0,g0,e0
postdf2 <- samp_m1 %>%
           select(contains("0[1]")) %>%
           rename_with(~ gsub("[1]", "", .x, fixed = TRUE) )
postdf <- cbind(postdf1, postdf2) 
priordf <-  samp_m1 %>% 
  select(ends_with('prior')) %>% 
  rename_with(~ gsub("_prior", "", .x, fixed = TRUE) ) 
postlong <- tidyr::pivot_longer(data = postdf, cols = everything() , names_to = "parname", values_to = "value") %>% mutate(type = "posterior")
priorlong <- tidyr::pivot_longer(data = priordf, cols = everything() , names_to = "parname", values_to = "value") %>% mutate(type = "prior")
ppdf <- dplyr::bind_rows(postlong,priorlong)

m1_p1 <- ppdf %>%
  ggplot() +
  geom_density(aes(x = value, color = type), linewidth = 1) +
  facet_wrap("parname", scales = "free") +
  theme_minimal()
plot(m1_p1)

The plots show that the priors don’t overly influence the results, the data dominates the posteriors (they move and get more peaked, an indication that the data controls the posterior shape).

Observed versus predicted

Another useful plot is to look at observed versus predicted results. This is shown in the following plot.

ypred_df <- samp_m1 %>% select(starts_with("ypred"))
m1_p2 <- bayesplot::ppc_dens_overlay(fitdat$outcome, as.matrix(ypred_df))
plot(m1_p2)

# just to get a picture that can be shown together with the post
#ggsave("featured.png",m1_p2)

The data (black line, \(y\) variable) and the model (thin green line, \(y_{rep}\)) are a bit off. That indicates that the model didn’t fully capture the patterns found in the data and might need modification.

Cross-validation tests

Here’s again some further exploration via cross-validation with the loo package.

# uses loo package 
loo_m1 <- res_m1$loo(cores = fs_m1$chains, save_psis = TRUE)
print(loo_m1)


Computed from 10000 by 264 log-likelihood matrix

         Estimate   SE
elpd_loo   -862.2 11.2
p_loo        29.7  3.4
looic      1724.3 22.3
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     249   94.3%   812       
 (0.5, 0.7]   (ok)        13    4.9%   303       
   (0.7, 1]   (bad)        2    0.8%   161       
   (1, Inf)   (very bad)   0    0.0%   <NA>      
See help('pareto-k-diagnostic') for details.

plot(loo_m1)

Some values aren’t too great. This again suggests that we need to tweak the model or run it longer with more stringent settings.

Here’s some more LOO diagnostics.

ypred_df <- samp_m1 %>% select(starts_with("ypred"))
m1_p3 <- bayesplot::ppc_loo_pit_overlay(
  y = fitdat$outcome,
  yrep = as.matrix(ypred_df),
  lw = weights(loo_m1$psis_object)
)
plot(m1_p3)

The marginal posterior predictive plot suggests some improvement might be possible (so that the solid line is more on top of the green lines). See here for more.

Model predictions

Finally, we again want to look at the actual data and the model predictions. It’s exactly the same code as for the 2-parameter model.

# averages and CI for 
# estimates of deterministic model trajectory
# for each observation
# this is computed in the transformed parameters block of the Stan code
mu <- samp_m1 |>
  select(starts_with("virus_pred")) |>
  apply(2, quantile, c(0.05, 0.5, 0.95)) |>
  t() 
rownames(mu) <- NULL

# estimate and CI for prediction intervals
# the predictions factor in additional uncertainty around the mean (mu)
# as indicated by sigma
# this is computed in the predicted-quantities block of the Stan code
# the average of mu and preds should be more or less the same
# but preds will have wider uncertainty due to the residual variation sigma
preds <- samp_m1 |>
  select(starts_with("ypred")) |>
  apply(2, quantile, c(0.05, 0.5, 0.95)) |>
  t() 
rownames(preds) <- NULL

# change dose so it looks nicer in plot
dose = as.factor(fitdat$dose_adj)
levels(dose)[1] <- "low"
levels(dose)[2] <- "medium"
levels(dose)[3] <- "high"

#place everything into a data frame
fitpred = data.frame(id = as.factor(fitdat$id),
                     dose = dose,
                     time = fitdat$time,
                     Outcome = fitdat$outcome,
                     Estimate = mu[,2],
                     Qmulo = mu[,1], Qmuhi = mu[,3],
                     Qsimlo = preds[,1], Qsimhi = preds[,3]
)

#make the plot
#not including the prediction intervals since it looks too messy
predplot <- ggplot(data = fitpred, aes(x = time, y = Estimate, group = id, color = dose ) ) +
  geom_line() +
  geom_ribbon(aes(x=time, ymin=Qmulo, ymax=Qmuhi, fill = dose, color = NULL), alpha=0.3, show.legend = F) +
  #geom_ribbon(aes(x=time, ymin=Qsimlo, ymax=Qsimhi, fill = dose, color = NULL), alpha=0.1, show.legend = F) +
  geom_point(aes(x = time, y = Outcome, group = id, color = dose), shape = 1, size = 2, stroke = 2) +
  scale_y_continuous(limits = c(-30,50)) +
  labs(y = "Virus load",
       x = "days post infection") +
  theme_minimal() 
plot(predplot)

The model fits don’t look that great. I’m not sure why the model isn’t capturing the data better, if it needs more iterations and/or better choice of priors, or if there’s something inherent about the structure of this model not being able to capture the data generated by the 2-parameter model, even though it is more flexible with its 4 parameters. I suspect it’s the last one, it seems even though this new model has more parameters, and thus should be more flexible, it can’t quite capture the more gentle increase and decrease of the data generated by the simpler model.

In a reasearch/real-world setting, I would try to explore the model, priors, algorithm to make sure 1) I don’t have any remaining mistakes in my code️ 🙄, 2) The problem is not due to poor choice of priors or starting values, 3) the hierarchical structure is not constraining the model too much. If those are ruled out, change the model to something that is better able to capture the patterns seen in the data.

In fact, I did a bit of that and figured I’ll show it too.

Alternative model

Since the previous model isn’t fitting too well, I wanted to understand a bit more why that might be. I decided to implement a simpler model. It’s the same 4-parameter equation for the virus load as above, but with a simpler parameter structure. I only used individual-level parameters which are independent of each other, and there’s no dependence on dose. The complete independence of each parameter from each other is expected to make the model too flexible and overfit. The dose dependence is the scientific question, so removing them makes the model scientifically pointless. But I wanted to see if a model that basically allows each individual to have their completely independent set of main model parameters could get me something that matches the data closer. In some sense, this is the most flexible model I can make. If that doesn’t work, it suggests the underlying model for the virus load is not able to capture the pattern seen here.

This is the updated model. I would of course not need both the \(\alpha, \beta, ...\) and \(a_0, b_0,...\) parameters since they are the same here and just duplicates. I decided to keep it anyway so it’s easier to compare to the above model - and I had to make fewer changes to the code 😁.

\[ \begin{aligned} \textrm{Outcome} \\ Y_{i,t} \sim \mathrm{Normal}\left(\mu_{i,t}, \sigma\right) \\ \\ \textrm{main model describing the virus trajectory} \\ \mu_{i,t} = \log\left( \frac{2 \exp(\alpha_{i})}{e^{-\exp(\beta_{i}) (t_i - \exp(\gamma_{i}))} + e^{\exp(\eta_{i}) (t_i - \exp(\gamma_{i}))}}\right).\\ \\ \textrm{Deterministic models for main parameters} \\ \alpha_{i} = a_{0,i} \\ \beta_{i} = b_{0,i} \\ \gamma_{i} = g_{0,i} \\ \eta_{i} = e_{0,i} \\ \\ \textrm{Priors} \\ a_{0,i} \sim \mathrm{Normal}(X, X) \\ b_{0,i} \sim \mathrm{Normal}(X, X) \\ g_{0,i} \sim \mathrm{Normal}(X, X) \\ e_{0,i} \sim \mathrm{Normal}(X, X) \\ \sigma \sim \mathrm{HalfCauchy}(X) \\ \end{aligned} \]

Alternative model implementation

Here’s the code to set up and run this model. We of course have a new Stan model and need to pick new priors and new initial conditions.

############################
# Try an alternative model
############################
filepath = fs::path("D:","Dropbox","datafiles","longitudinalbayes")
filename_simple = "cmdstanr4par-simple.Rds"
stanfile <- here('posts','2024-02-16-longitudinal-multilevel-bayes-6',"stancode-4par-simple.stan")
rngseed = 12345

# need to update priors
priorvals2 = list(a0_mu = 1, a0_sd = 1,
                 b0_mu = 1, b0_sd = 1,
                 g0_mu = 2.5, g0_sd = 1,
                 e0_mu = 0.3, e0_sd = 1
                )
fitdat2 = c(fitdatbase,priorvals2)

# need different initial values
set.seed(rngseed) #make inits reproducible
init_vals_1chain <- function() (list(a0_mu = runif(Nind,1,1), 
                                    b0_mu = runif(Nind,0.8,1.2),
                                     g0_mu = runif(Nind,2,3),
                                     e0_mu = runif(Nind,0,0.5),
                                   
                                     a0_sd = runif(Nind,1,1),
                                     b0_sd = runif(Nind,1,1),
                                     g0_sd = runif(Nind,1,1),
                                     e0_sd = runif(Nind,1,1),
                                     sigma = runif(1,0,1))
                                )
inits = NULL
for (n in 1:fs_m1$chains)
{
  inits[[n]] = init_vals_1chain()
}
fs_m1$init = inits

Stan Code

This is the updated Stan code. Here is the file.

//
// Stan code for fitting a hierarchical model to time-series data
//

data{
   int<lower = 1> Ntot; //number of observations
   int<lower = 1> Nind; //number of individuals
   vector[Ntot] outcome; //virus load
   vector[Ntot] time; // times at which virus load is measured
   vector[Nind] dose_adj; //dose after adjustment, 1 value per individual
   array[Ntot] int id;  //vector of person IDs to keep track which data points belong to whom
   //everything below are variables that contain values for prior distributions
   real a0_mu; 
   real b0_mu;
   real g0_mu;
   real e0_mu;
   real a0_sd;
   real b0_sd;
   real g0_sd;
   real e0_sd;
}

parameters{
    // population variance
    real<lower=0> sigma;
    // individual-level  parameters
    vector[Nind] a0;
    vector[Nind] b0;
    vector[Nind] g0;
    vector[Nind] e0;
}


// Generated/intermediate parameters
transformed parameters{

    // predicted virus load from model
    vector[Ntot] virus_pred; 
    // main model parameters
    // each individual has their potentially own value 
    // I'm removing the last letter from each variable name
    // just to avoid potential conflicts with bulit-in names of Stan
    // such as beta for a beta distribution
    // it might not matter, but I wanted to be safe
    vector[Nind] alph;
    vector[Nind] bet;
    vector[Nind] gamm;
    vector[Nind] et;
    // intermediate variables for denominators
    // just to make equation easier to understand
    real d1; 
    real d2; 

    // compute main model parameters
    for ( i in 1:Nind ) {
        alph[i] = exp(30*a0[i]);
        bet[i] = exp(b0[i]);
        gamm[i] = exp(g0[i]);
        et[i] = exp(e0[i]);
    }
    
    // loop over all observations
    // since parameters are saved in vectors of length corresponding to number of individuals
    // we need to index with that extra id[i] notation
    for (i in 1:Ntot)
    {
      d1 = exp( -bet[id[i]] * (time[i] - gamm[id[i]]) );
      d2 = exp(   et[id[i]] * (time[i] - gamm[id[i]]) ); 
      virus_pred[i] = log(  2*alph[id[i]] / ( d1 + d2) );
     }

} // end transformed parameters block



model{

    // individual variation of each ODE model parameter
    a0 ~ normal( a0_mu , a0_sd );
    b0 ~ normal( b0_mu , b0_sd );
    g0 ~ normal( g0_mu , g0_sd );
    e0 ~ normal( e0_mu , e0_sd );

    // residual population variation
    sigma ~ cauchy(0, 1); 

    // distribution of outcome (virus load)
    // all computations to get the time-series trajectory for the outcome are done  
    // inside the transformed parameters block
    outcome ~ normal( virus_pred , sigma );
}

// for model diagnostics and exploration
generated quantities {
    // define quantities that are computed in this block
    vector[Ntot] ypred;
    vector[Ntot] log_lik;
    //real<lower=0> sigma_prior;
    real sigma_prior;
    real a0_prior;
    real b0_prior;
    real g0_prior;
    real e0_prior;
        
    // this is so one can plot priors and compare with posterior later   
    // simulate the priors
    sigma_prior = abs(cauchy_rng(0, 1));
    a0_prior = normal_rng( a0_mu , a0_sd);
    b0_prior = normal_rng( b0_mu , b0_sd);
    g0_prior = normal_rng( g0_mu , g0_sd);
    e0_prior = normal_rng( e0_mu , e0_sd);


  // compute log-likelihood and predictions
    for(i in 1:Ntot)
    {
      log_lik[i] = normal_lpdf(outcome[i] | virus_pred[i], sigma);
      ypred[i] = normal_rng(virus_pred[i], sigma);
    }
} //end generated quantities block

Compiling the Stan model.

# make Stan model. 
stanmod2 <- cmdstanr::cmdstan_model(stanfile, 
                                    pedantic=TRUE, 
                                    force_recompile=TRUE)

Model run

This runs the model and saves results.

# most of it the same as previously
res_m2 <- stanmod2$sample(data = fitdat2,
                          chains = fs_m1$chains,
                          init = fs_m1$init,
                          seed = rngseed,
                          parallel_chains = fs_m1$chains,
                          iter_warmup = fs_m1$warmup,
                          iter_sampling = fs_m1$sampling,
                          save_warmup = fs_m1$save_warmup,
                          max_treedepth = fs_m1$max_td,
                          adapt_delta = fs_m1$adapt_delta,
                          output_dir = filepath
)

res_m2$save_object(fs::path(filepath,filename_simple))

Result exploration

Load results.

res_m2 <- readRDS(fs::path(filepath,filename_simple))

Get the samples.

#this uses the posterior package to get draws
samp_m2 <- res_m2$draws(inc_warmup = FALSE, format = "draws_df")
allsamp_m2 <- res_m2$draws(inc_warmup = TRUE, format = "draws_df")

Diagnostics. Still not quite right.

res_m2$cmdstan_diagnose()

Processing csv files: D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-simple-202402271836-1-5ba055.csvWarning: non-fatal error reading adaptation data
, D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-simple-202402271836-2-5ba055.csvWarning: non-fatal error reading adaptation data
, D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-simple-202402271836-3-5ba055.csvWarning: non-fatal error reading adaptation data
, D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-simple-202402271836-4-5ba055.csvWarning: non-fatal error reading adaptation data
, D:/Dropbox/datafiles/longitudinalbayes/stancode-4par-simple-202402271836-5-5ba055.csvWarning: non-fatal error reading adaptation data


Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete, no problems detected.

A few plots. They look fairly reasonable.

# only a few parameters
plotpars = c("a0[1]","b0[1]","g0[1]","e0[1]","sigma")
bayesplot::color_scheme_set("viridis")
bp1 <- bayesplot::mcmc_trace(samp_m2, pars = plotpars)
bp3 <- bayesplot::mcmc_dens_overlay(samp_m2, pars = plotpars)
plot(bp1)

plot(bp3)

Comparing priors and posteriors.

# data manipulation to get in shape for plotting
# start with manipulation of posterior parameters
postdf1 <- samp_m2 %>% 
  select(!ends_with('prior')) %>% 
  select(contains("sigma")) 
# akward way of getting some further parameters
# namely values from first individual for a0,b0,g0,e0
postdf2 <- samp_m2 %>%
  select(contains("0[1]")) %>%
  rename_with(~ gsub("[1]", "", .x, fixed = TRUE) )
postdf <- cbind(postdf1, postdf2) 
postlong <- tidyr::pivot_longer(data = postdf, 
                                cols = everything() , 
                                names_to = "parname", 
                                values_to = "value") %>% 
  dplyr::mutate(type = "posterior")
# manipulation of prior parameters
priordf <-  samp_m2 %>% 
  select(ends_with('prior')) %>% 
  rename_with(~ gsub("_prior", "", .x, fixed = TRUE) ) 
priorlong <- tidyr::pivot_longer(data = priordf, 
                                 cols = everything() , 
                                 names_to = "parname", 
                                 values_to = "value") %>% 
                        dplyr::mutate(type = "prior")

ppdf <- dplyr::bind_rows(postlong,priorlong)

m2_p1 <- ppdf %>%
  ggplot() +
  geom_density(aes(x = value, color = type), linewidth = 1) +
  facet_wrap("parname", scales = "free") +
  theme_minimal()
plot(m2_p1)

Skipping over all the other diagnostics plot we did above.

Outcome predictions

Going straight to plotting of data and model predictions.

mu2 <- samp_m2 |>
  select(starts_with("virus_pred")) |>
  apply(2, quantile, c(0.05, 0.5, 0.95)) |>
  t() 
rownames(mu) <- NULL
preds2 <- samp_m2 |>
  select(starts_with("ypred")) |>
  apply(2, quantile, c(0.05, 0.5, 0.95)) |>
  t() 
rownames(preds) <- NULL

# change dose so it looks nicer in plot
dose = as.factor(fitdat2$dose_adj)
levels(dose)[1] <- "low"
levels(dose)[2] <- "medium"
levels(dose)[3] <- "high"
fitpred2 = data.frame(id = as.factor(fitdat2$id),
                     dose = dose,
                     time = fitdat2$time,
                     Outcome = fitdat2$outcome,
                     Estimate = mu2[,2],
                     Qmulo = mu2[,1], Qmuhi = mu2[,3],
                     Qsimlo = preds2[,1], Qsimhi = preds2[,3]
)

#make the plot
#not including the CI or prediction intervals since it looks too messy
predplot2 <- ggplot(data = fitpred2, aes(x = time, y = Estimate, group = id, color = dose ) ) +
  geom_line() +
  #geom_ribbon(aes(x=time, ymin=Qmulo, ymax=Qmuhi, fill = dose, color = NULL), alpha=0.3, show.legend = F) +
  #geom_ribbon(aes(x=time, ymin=Qsimlo, ymax=Qsimhi, fill = dose, color = NULL), alpha=0.1, show.legend = F) +
  geom_point(aes(x = time, y = Outcome, group = id, color = dose), shape = 1, size = 2, stroke = 2) +
  scale_y_continuous(limits = c(-30,50)) +
  labs(y = "Virus load",
       x = "days post infection") +
  theme_minimal() 
plot(predplot2)

This doesn’t really look any better compared to the other model. So I – preliminarily – conclude that the main 4-parameter deterministic model I use here to describe the virus load trajectories just can’t capture the shape of the data generated by the simpler 2-parameter model.

Summary and continuation

This completes the 4-parameter model. I would not use this model if I really wanted to fit the data here. But I just wanted to get a working model, I’m not really interested in the model results. I just wanted to set the stage for the next version, which is the 4-parameter ODE model. So it’s finally time to tackle that one.