When the number of observations per cluster is very small there is a bias in the estimation of the compositional effect, which will be attenuated (i.e. biased towards 0) because the mean of the observed x’s in the sample constitutes a measurement of the ‘true mean x’ in each cluster with error. Fortunately the amount of error can be estimated from the within-cluster variability in x and we can use this to get an unbiased estimate by taking measurement error into account.
This is a simulation to illustrate these biases and to show how to correct for them using a measurement error model in Stan.
We will use an example that parallels the mathach-ses example but with very small samples from each school: \(n=2\). The compositional effect is 1 and the within effect is -.5. Thus the contextual effect is 1.5.
This example also illustrates how to simulate a multilevel data set, which can be useful for power analyses.
Loading packages:
library(spida2)
library(magrittr, pos = 1000)
library(lattice)
library(latticeExtra)
Loading required package: RColorBrewer
J <- 1000 # number of schools
n <- 2 # cluster size
N <- J*n # number of observations
beta_comp <- 1
beta_within <- -.5
Wrapping the code that generates the data set in a function makes it easy to rerun.
sim <- function() { # put everything in a function so we can repeat easily
# not that all assignments in the function don't change anything
# outside the function
sim_school <- data.frame(xmean = 3* rnorm(J), id = 1:J) # bad style -- should pass J, etc., as arguments
sim <- data.frame(xdev = rnorm(N), id = rep(1:J, each = n))
sim <- merge(sim, sim_school)
sim <- within(sim, { # note the curly brace in which statements are listed
x <- xmean + xdev
y <- beta_comp * xmean + beta_within * xdev + rnorm(N)
}
)
sim <- within(sim, { # dropping 'invisible' variables
xdev <- NULL
xmean <- NULL
})
sim # return dd
}
set.seed(2132453) # to make sure this looks the same
dd <- sim()
head(dd)
id y x
1 1 0.2136943 0.9763922
2 1 0.2932619 1.2347078
3 2 0.8467329 -0.1830278
4 2 1.7769236 -0.8847983
5 3 -1.5670531 -3.4127757
6 3 -4.0563064 -2.0009708
library(nlme)
Attaching package: 'nlme'
The following object is masked from 'package:spida2':
getData
fit <- lme(y ~ cvar(x,id) + dvar(x,id), dd, random = ~ 1 + dvar(x,id) | id)
summary(fit)
Linear mixed-effects model fit by REML
Data: dd
AIC BIC logLik
6913.916 6953.112 -3449.958
Random effects:
Formula: ~1 + dvar(x, id) | id
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 1.0697037 (Intr)
dvar(x, id) 0.1626571 -0.051
Residual 0.9979500
Fixed effects: y ~ cvar(x, id) + dvar(x, id)
Value Std.Error DF t-value p-value
(Intercept) -0.0096665 0.04052456 999 -0.23853 0.8115
cvar(x, id) 0.9139934 0.01333461 998 68.54292 0.0000
dvar(x, id) -0.5116809 0.03269087 999 -15.65210 0.0000
Correlation:
(Intr) cv(,i)
cvar(x, id) 0.004
dvar(x, id) -0.007 0.000
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-3.3022735831 -0.5365572270 0.0008671576 0.5252513760 2.8892932280
Number of Observations: 2000
Number of Groups: 1000
wald(fit)
numDF denDF F.value p.value
3 998 1647.544 <.00001
Coefficients Estimate Std.Error DF t-value p-value Lower 0.95
(Intercept) -0.009666 0.040525 999 -0.238534 0.81152 -0.089190
cvar(x, id) 0.913993 0.013335 998 68.542920 <.00001 0.887826
dvar(x, id) -0.511681 0.032691 999 -15.652105 <.00001 -0.575832
Coefficients Upper 0.95
(Intercept) 0.069857
cvar(x, id) 0.940161
dvar(x, id) -0.447530
Note that the ‘within’ CI covers the true value but the compositional CI is far from the true value.
Why? Measurement error: cvar(x, id) is equal to ‘true’ xmean + error from the random selection of 2 observations in the school.
Why doesn’t this affect dvar(x,id)? Because, x is ‘true’ x for the individual and dvar(x,id) is orthogonal to cvar(x,id).
Can we fix this? Let’s try a model in Stan that treats ‘true’ x_mean as a latent variable.
library(rstan)
Loading required package: ggplot2
Attaching package: 'ggplot2'
The following object is masked from 'package:latticeExtra':
layer
The following object is masked from 'package:spida2':
labs
Loading required package: StanHeaders
rstan (Version 2.14.2, packaged: 2017-03-19 00:42:29 UTC, GitRev: 5fa1e80eb817)
For execution on a local, multicore CPU with excess RAM we recommend calling
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
Attaching package: 'rstan'
The following object is masked from 'package:magrittr':
extract
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
print(readLines('shortitudinal_1.stan'), quote = F)
[1] data {
[2] int N;
[3] int J;
[4] int id[N];
[5] vector[N] x;
[6] vector[N] y;
[7] }
[8] parameters {
[9] real b_contextual;
[10] real b_within;
[11] real alpha;
[12] real<lower=0> sigma;
[13] real<lower=0> sigma_u;
[14] real<lower=0> sigma_x_between;
[15] real<lower=0> sigma_x_within;
[16] real mu_x;
[17] vector[J] u; // random intercept
[18] vector[J] xmean;
[19] }
[20] transformed parameters {
[21] real b_compositional;
[22] b_compositional = b_within + b_contextual;
[23] }
[24] model {
[25] xmean ~ normal(mu_x, sigma_x_between);
[26] x ~ normal(xmean[id], sigma_x_within);
[27] u ~ normal(0, sigma_u);
[28] y ~ normal(alpha + u[id] + b_contextual * xmean[id] + b_within * x, sigma );
[29] }
We can write a quick function to make this easier
pr <- function(x) print(readLines(x), quote = FALSE) # write a function for repetitive tasks
stanfile <- 'shortitudinal_1.stan'
# pr(stanfile)
system.time(
shor_1_dso <- stan_model(stanfile)
)
user system elapsed
0.740 0.043 1.495
Let’s try it on our data.
head(dd)
id y x
1 1 0.2136943 0.9763922
2 1 0.2932619 1.2347078
3 2 0.8467329 -0.1830278
4 2 1.7769236 -0.8847983
5 3 -1.5670531 -3.4127757
6 3 -4.0563064 -2.0009708
dat <- list( N = nrow(dd),
id = idn <- as.numeric(as.factor(dd$id)),
J = max(idn),
x = dd$x,
y = dd$y
)
mod <- sampling(shor_1_dso, dat)
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
traceplot(mod)
'pars' not specified. Showing first 10 parameters by default.
Note the irony that ‘true sigma_u’ is zero and that’s the one giving us trouble
pars <- grepv('^u|^xm',names(mod), invert = T)
traceplot(mod, pars = pars)
pairs(mod, pars = pars)
But look at b_contextual and b_compositional! The bias is gone.
print(mod, pars = pars)
Inference for Stan model: shortitudinal_1.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50%
b_contextual 1.49 0.00 0.03 1.43 1.47 1.49
b_within -0.50 0.00 0.03 -0.56 -0.52 -0.50
alpha -0.01 0.00 0.04 -0.09 -0.04 -0.01
sigma 1.01 0.00 0.02 0.97 1.00 1.01
sigma_u 0.33 0.02 0.10 0.10 0.26 0.33
sigma_x_between 2.96 0.00 0.07 2.83 2.91 2.96
sigma_x_within 1.00 0.00 0.02 0.96 0.99 1.00
mu_x -0.01 0.00 0.09 -0.19 -0.07 -0.01
b_compositional 0.99 0.00 0.01 0.96 0.98 0.99
lp__ -2933.67 79.71 341.76 -3450.65 -3170.78 -2984.00
75% 97.5% n_eff Rhat
b_contextual 1.52 1.56 171 1.02
b_within -0.48 -0.45 261 1.01
alpha 0.02 0.07 2740 1.00
sigma 1.02 1.05 2308 1.00
sigma_u 0.40 0.55 21 1.14
sigma_x_between 3.01 3.11 4000 1.00
sigma_x_within 1.01 1.04 336 1.01
mu_x 0.05 0.18 4000 1.00
b_compositional 1.00 1.02 641 1.01
lp__ -2756.15 -1849.45 18 1.17
Samples were drawn using NUTS(diag_e) at Tue Mar 28 15:33:49 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Here, we’ll put a gamma prior on sigma_u. We can add parameters through data so we can experiment without having to recompile.
stanfile <- 'shortitudinal_2.stan'
pr(stanfile)
[1] data {
[2] int N;
[3] int J;
[4] int id[N];
[5] vector[N] x;
[6] vector[N] y;
[7] int gamma_par;
[8] real gamma_scale;
[9] }
[10] parameters {
[11] real b_contextual;
[12] real b_within;
[13] real alpha;
[14] real<lower=0> sigma;
[15] real<lower=0> sigma_u;
[16] real<lower=0> sigma_x_between;
[17] real<lower=0> sigma_x_within;
[18] real mu_x;
[19] vector[J] u; // random intercept
[20] vector[J] xmean;
[21] }
[22] transformed parameters {
[23] real b_compositional;
[24] b_compositional = b_within + b_contextual;
[25] }
[26] model {
[27] sigma_u ~ gamma(gamma_par, 1/gamma_scale); // prior for sigma_u
[28] xmean ~ normal(mu_x, sigma_x_between);
[29] x ~ normal(xmean[id], sigma_x_within);
[30] u ~ normal(0, sigma_u);
[31] y ~ normal(alpha + u[id] + b_contextual * xmean[id] + b_within * x, sigma );
[32] }
system.time(
shor_2_dso <- stan_model(stanfile)
)
user system elapsed
0.084 0.028 0.820
mod2 <- sampling(shor_2_dso, c(dat, gamma_par = 4, gamma_scale = .2))
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
pars <- grepv('^u|^xm',names(mod2), invert = T)
Note how the log-posterior trace is almost an inversion of the sigma_u trace
traceplot(mod2, pars = pars)
pairs(mod2, pars = pars)
print(mod2, pars = pars)
Inference for Stan model: shortitudinal_2.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50%
b_contextual 1.49 0.00 0.03 1.42 1.47 1.49
b_within -0.50 0.00 0.03 -0.56 -0.52 -0.50
alpha -0.01 0.00 0.04 -0.09 -0.04 -0.01
sigma 1.01 0.00 0.02 0.97 0.99 1.01
sigma_u 0.35 0.02 0.10 0.20 0.28 0.34
sigma_x_between 2.96 0.00 0.07 2.83 2.92 2.96
sigma_x_within 1.00 0.00 0.02 0.96 0.98 1.00
mu_x -0.01 0.00 0.10 -0.20 -0.07 -0.01
b_compositional 0.99 0.00 0.02 0.96 0.98 0.99
lp__ -3021.71 53.85 265.10 -3503.88 -3219.00 -3027.16
75% 97.5% n_eff Rhat
b_contextual 1.52 1.56 171 1.02
b_within -0.48 -0.44 286 1.01
alpha 0.02 0.07 2572 1.00
sigma 1.02 1.05 2377 1.00
sigma_u 0.42 0.57 25 1.16
sigma_x_between 3.01 3.10 4000 1.00
sigma_x_within 1.01 1.04 261 1.02
mu_x 0.05 0.18 4000 1.00
b_compositional 1.00 1.02 751 1.01
lp__ -2849.75 -2498.54 24 1.18
Samples were drawn using NUTS(diag_e) at Tue Mar 28 15:35:29 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).