The problem with shortitudinal data

When the number of observations per cluster is very small there is a bias in the estimation of the compositional effect, which will be attenuated (i.e. biased towards 0) because the mean of the observed x’s in the sample constitutes a measurement of the ‘true mean x’ in each cluster with error. Fortunately the amount of error can be estimated from the within-cluster variability in x and we can use this to get an unbiased estimate by taking measurement error into account.

This is a simulation to illustrate these biases and to show how to correct for them using a measurement error model in Stan.

Simulating a data set

We will use an example that parallels the mathach-ses example but with very small samples from each school: \(n=2\). The compositional effect is 1 and the within effect is -.5. Thus the contextual effect is 1.5.

This example also illustrates how to simulate a multilevel data set, which can be useful for power analyses.

Loading packages:

library(spida2)
library(magrittr, pos = 1000) 
library(lattice)
library(latticeExtra)
Loading required package: RColorBrewer

Parameters for data

J <- 1000 # number of schools
n <- 2   # cluster size
N <- J*n  # number of observations
beta_comp <- 1
beta_within <- -.5

A simulation function

Wrapping the code that generates the data set in a function makes it easy to rerun.

sim <- function() {  # put everything in a function so we can repeat easily
  # not that all assignments in the function don't change anything
  # outside the function
  sim_school <- data.frame(xmean = 3* rnorm(J), id = 1:J) # bad style -- should pass J, etc., as arguments
  sim <- data.frame(xdev = rnorm(N), id = rep(1:J, each = n))
  sim <- merge(sim, sim_school)
  sim <- within(sim, {  # note the curly brace in which statements are listed
    x <- xmean +  xdev
    y <- beta_comp * xmean + beta_within * xdev + rnorm(N)
    }
  )
  sim <- within(sim, {  # dropping 'invisible' variables
    xdev <- NULL
    xmean <- NULL
  })
  sim # return dd
}

set.seed(2132453)  # to make sure this looks the same
dd <- sim()
head(dd)
  id          y          x
1  1  0.2136943  0.9763922
2  1  0.2932619  1.2347078
3  2  0.8467329 -0.1830278
4  2  1.7769236 -0.8847983
5  3 -1.5670531 -3.4127757
6  3 -4.0563064 -2.0009708

Multilevel analysis using group mean and CWG variables

library(nlme)

Attaching package: 'nlme'
The following object is masked from 'package:spida2':

    getData
fit <- lme(y ~ cvar(x,id) + dvar(x,id), dd, random = ~ 1 + dvar(x,id) | id)
summary(fit)
Linear mixed-effects model fit by REML
 Data: dd 
       AIC      BIC    logLik
  6913.916 6953.112 -3449.958

Random effects:
 Formula: ~1 + dvar(x, id) | id
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev    Corr  
(Intercept) 1.0697037 (Intr)
dvar(x, id) 0.1626571 -0.051
Residual    0.9979500       

Fixed effects: y ~ cvar(x, id) + dvar(x, id) 
                 Value  Std.Error  DF   t-value p-value
(Intercept) -0.0096665 0.04052456 999  -0.23853  0.8115
cvar(x, id)  0.9139934 0.01333461 998  68.54292  0.0000
dvar(x, id) -0.5116809 0.03269087 999 -15.65210  0.0000
 Correlation: 
            (Intr) cv(,i)
cvar(x, id)  0.004       
dvar(x, id) -0.007  0.000

Standardized Within-Group Residuals:
          Min            Q1           Med            Q3           Max 
-3.3022735831 -0.5365572270  0.0008671576  0.5252513760  2.8892932280 

Number of Observations: 2000
Number of Groups: 1000 

Biased estimates of compositional effects

wald(fit)
 numDF denDF  F.value p.value
     3   998 1647.544 <.00001
             
Coefficients   Estimate Std.Error  DF    t-value p-value Lower 0.95
  (Intercept) -0.009666  0.040525 999  -0.238534 0.81152  -0.089190
  cvar(x, id)  0.913993  0.013335 998  68.542920 <.00001   0.887826
  dvar(x, id) -0.511681  0.032691 999 -15.652105 <.00001  -0.575832
             
Coefficients  Upper 0.95
  (Intercept)   0.069857
  cvar(x, id)   0.940161
  dvar(x, id)  -0.447530

Note that the ‘within’ CI covers the true value but the compositional CI is far from the true value.

Why? Measurement error: cvar(x, id) is equal to ‘true’ xmean + error from the random selection of 2 observations in the school.

Why doesn’t this affect dvar(x,id)? Because, x is ‘true’ x for the individual and dvar(x,id) is orthogonal to cvar(x,id).

Can we fix this? Let’s try a model in Stan that treats ‘true’ x_mean as a latent variable.

Stan model with measurement error model for ‘contextual’ variable

library(rstan)
Loading required package: ggplot2

Attaching package: 'ggplot2'
The following object is masked from 'package:latticeExtra':

    layer
The following object is masked from 'package:spida2':

    labs
Loading required package: StanHeaders
rstan (Version 2.14.2, packaged: 2017-03-19 00:42:29 UTC, GitRev: 5fa1e80eb817)
For execution on a local, multicore CPU with excess RAM we recommend calling
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

Attaching package: 'rstan'
The following object is masked from 'package:magrittr':

    extract
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
print(readLines('shortitudinal_1.stan'), quote = F)
 [1] data {                                                                        
 [2]   int N;                                                                      
 [3]   int J;                                                                      
 [4]   int id[N];                                                                  
 [5]   vector[N] x;                                                                
 [6]   vector[N] y;                                                                
 [7] }                                                                             
 [8] parameters {                                                                  
 [9]   real b_contextual;                                                          
[10]   real b_within;                                                              
[11]   real alpha;                                                                 
[12]   real<lower=0> sigma;                                                        
[13]   real<lower=0> sigma_u;                                                      
[14]   real<lower=0> sigma_x_between;                                              
[15]   real<lower=0> sigma_x_within;                                               
[16]   real mu_x;                                                                  
[17]   vector[J] u;  // random intercept                                           
[18]   vector[J] xmean;                                                            
[19] }                                                                             
[20] transformed parameters {                                                      
[21]   real b_compositional;                                                       
[22]   b_compositional = b_within + b_contextual;                                  
[23] }                                                                             
[24] model {                                                                       
[25]   xmean ~ normal(mu_x, sigma_x_between);                                      
[26]   x ~ normal(xmean[id], sigma_x_within);                                      
[27]   u ~ normal(0, sigma_u);                                                     
[28]   y ~ normal(alpha + u[id] + b_contextual * xmean[id] + b_within * x, sigma );
[29] }                                                                             

We can write a quick function to make this easier

pr <- function(x) print(readLines(x), quote = FALSE) # write a function for repetitive tasks
stanfile <- 'shortitudinal_1.stan' 
# pr(stanfile)
system.time(
  shor_1_dso <- stan_model(stanfile)
)
   user  system elapsed 
  0.740   0.043   1.495 

Let’s try it on our data.

head(dd)
  id          y          x
1  1  0.2136943  0.9763922
2  1  0.2932619  1.2347078
3  2  0.8467329 -0.1830278
4  2  1.7769236 -0.8847983
5  3 -1.5670531 -3.4127757
6  3 -4.0563064 -2.0009708
dat <- list( N = nrow(dd),
             id = idn <- as.numeric(as.factor(dd$id)),
             J = max(idn),
             x = dd$x,
             y = dd$y
            )
mod <- sampling(shor_1_dso, dat)
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
traceplot(mod)
'pars' not specified. Showing first 10 parameters by default.

Note the irony that ‘true sigma_u’ is zero and that’s the one giving us trouble

pars <- grepv('^u|^xm',names(mod), invert = T)
traceplot(mod, pars = pars)

pairs(mod, pars = pars)

But look at b_contextual and b_compositional! The bias is gone.

print(mod, pars = pars)
Inference for Stan model: shortitudinal_1.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                    mean se_mean     sd     2.5%      25%      50%
b_contextual        1.49    0.00   0.03     1.43     1.47     1.49
b_within           -0.50    0.00   0.03    -0.56    -0.52    -0.50
alpha              -0.01    0.00   0.04    -0.09    -0.04    -0.01
sigma               1.01    0.00   0.02     0.97     1.00     1.01
sigma_u             0.33    0.02   0.10     0.10     0.26     0.33
sigma_x_between     2.96    0.00   0.07     2.83     2.91     2.96
sigma_x_within      1.00    0.00   0.02     0.96     0.99     1.00
mu_x               -0.01    0.00   0.09    -0.19    -0.07    -0.01
b_compositional     0.99    0.00   0.01     0.96     0.98     0.99
lp__            -2933.67   79.71 341.76 -3450.65 -3170.78 -2984.00
                     75%    97.5% n_eff Rhat
b_contextual        1.52     1.56   171 1.02
b_within           -0.48    -0.45   261 1.01
alpha               0.02     0.07  2740 1.00
sigma               1.02     1.05  2308 1.00
sigma_u             0.40     0.55    21 1.14
sigma_x_between     3.01     3.11  4000 1.00
sigma_x_within      1.01     1.04   336 1.01
mu_x                0.05     0.18  4000 1.00
b_compositional     1.00     1.02   641 1.01
lp__            -2756.15 -1849.45    18 1.17

Samples were drawn using NUTS(diag_e) at Tue Mar 28 15:33:49 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Dealing with variance parameters that are close to 0 without dropping them

Here, we’ll put a gamma prior on sigma_u. We can add parameters through data so we can experiment without having to recompile.

stanfile <- 'shortitudinal_2.stan' 
pr(stanfile)
 [1] data {                                                                        
 [2]   int N;                                                                      
 [3]   int J;                                                                      
 [4]   int id[N];                                                                  
 [5]   vector[N] x;                                                                
 [6]   vector[N] y;                                                                
 [7]   int gamma_par;                                                              
 [8]   real gamma_scale;                                                           
 [9] }                                                                             
[10] parameters {                                                                  
[11]   real b_contextual;                                                          
[12]   real b_within;                                                              
[13]   real alpha;                                                                 
[14]   real<lower=0> sigma;                                                        
[15]   real<lower=0> sigma_u;                                                      
[16]   real<lower=0> sigma_x_between;                                              
[17]   real<lower=0> sigma_x_within;                                               
[18]   real mu_x;                                                                  
[19]   vector[J] u;  // random intercept                                           
[20]   vector[J] xmean;                                                            
[21] }                                                                             
[22] transformed parameters {                                                      
[23]   real b_compositional;                                                       
[24]   b_compositional = b_within + b_contextual;                                  
[25] }                                                                             
[26] model {                                                                       
[27]   sigma_u ~ gamma(gamma_par, 1/gamma_scale); // prior for sigma_u             
[28]   xmean ~ normal(mu_x, sigma_x_between);                                      
[29]   x ~ normal(xmean[id], sigma_x_within);                                      
[30]   u ~ normal(0, sigma_u);                                                     
[31]   y ~ normal(alpha + u[id] + b_contextual * xmean[id] + b_within * x, sigma );
[32] }                                                                             
system.time(
  shor_2_dso <- stan_model(stanfile)
)
   user  system elapsed 
  0.084   0.028   0.820 
mod2 <- sampling(shor_2_dso, c(dat, gamma_par = 4, gamma_scale = .2))
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
pars <- grepv('^u|^xm',names(mod2), invert = T)

Note how the log-posterior trace is almost an inversion of the sigma_u trace

traceplot(mod2, pars = pars)

pairs(mod2, pars = pars)

Bias is gone!

print(mod2, pars = pars)
Inference for Stan model: shortitudinal_2.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

                    mean se_mean     sd     2.5%      25%      50%
b_contextual        1.49    0.00   0.03     1.42     1.47     1.49
b_within           -0.50    0.00   0.03    -0.56    -0.52    -0.50
alpha              -0.01    0.00   0.04    -0.09    -0.04    -0.01
sigma               1.01    0.00   0.02     0.97     0.99     1.01
sigma_u             0.35    0.02   0.10     0.20     0.28     0.34
sigma_x_between     2.96    0.00   0.07     2.83     2.92     2.96
sigma_x_within      1.00    0.00   0.02     0.96     0.98     1.00
mu_x               -0.01    0.00   0.10    -0.20    -0.07    -0.01
b_compositional     0.99    0.00   0.02     0.96     0.98     0.99
lp__            -3021.71   53.85 265.10 -3503.88 -3219.00 -3027.16
                     75%    97.5% n_eff Rhat
b_contextual        1.52     1.56   171 1.02
b_within           -0.48    -0.44   286 1.01
alpha               0.02     0.07  2572 1.00
sigma               1.02     1.05  2377 1.00
sigma_u             0.42     0.57    25 1.16
sigma_x_between     3.01     3.10  4000 1.00
sigma_x_within      1.01     1.04   261 1.02
mu_x                0.05     0.18  4000 1.00
b_compositional     1.00     1.02   751 1.01
lp__            -2849.75 -2498.54    24 1.18

Samples were drawn using NUTS(diag_e) at Tue Mar 28 15:35:29 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).