36Doing the same multi-level model with R and stata
Published
October 26, 2023
This post is a demo for doing the same multi-level models with R and stata. R’s multilevel models are mostly done with lme4 and nlme. Stata’s multilevel models are mostly done with mixed and menl. The two packages are not exactly the same. For example, mixed does not allow complicated structure for the residual. That’s why we need menl. Likewise, in R lme4’s “lmer” does not allow this specification. We have to use nlme. But it’s good to know how to do the same thing in both packages.
For example, we want to do a consensus emergence model (CEM) as described in Lang et al. 2012(https://core.ac.uk/download/pdf/158346339.pdf). The CEM model is basically a growth model with the residual being a exponential function of the time variable. We want to do the same thing in both R and stata.
We start with simpler models.
36.1 Stata and R syntax comparison for the same three level model
We run a three level model specified in stata with mixed and menl. In R with nlme and lme4. This is a random intercept model. Only intercept is varying for each unit at each level. In this data set supposedly state is nested within region. There are situations that levels are not nested, then we need to specify cross effect models. We do not discuss that here. What bothers me is that nlme’s lme function takes the first level as higher level. So I have to put region before state. Otherwise, it seems like it treated regions nested within states?
There are different ways to specify this model in nlme. I show two ways here (including lme2 which is commented out).
36.1.1 Stata
* let's see how to use menlwebuse productivity* to replicate mixed with menlmixed gsp || region: || state:menl gsp = {b1:}, define(b1: U1[region] UU1[region>state])
(Public capital productivity)
Performing EM optimization ...
Performing gradient-based optimization:
Iteration 0: Log likelihood = 200.86162
Iteration 1: Log likelihood = 200.86162
Computing standard errors ...
Mixed-effects ML regression Number of obs = 816
Grouping information
-------------------------------------------------------------
| No. of Observations per group
Group variable | groups Minimum Average Maximum
----------------+--------------------------------------------
region | 9 51 90.7 136
state | 48 17 17.0 17
-------------------------------------------------------------
Wald chi2(0) = .
Log likelihood = 200.86162 Prob > chi2 = .
------------------------------------------------------------------------------
gsp | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
_cons | 10.65961 .2503806 42.57 0.000 10.16887 11.15035
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects parameters | Estimate Std. err. [95% conf. interval]
-----------------------------+------------------------------------------------
region: Identity |
var(_cons) | .4376123 .2697616 .13073 1.464887
-----------------------------+------------------------------------------------
state: Identity |
var(_cons) | .6080626 .1382733 .3893904 .9495357
-----------------------------+------------------------------------------------
var(Residual) | .0246633 .0012586 .0223159 .0272577
------------------------------------------------------------------------------
LR test vs. linear model: chi2(2) = 2750.56 Prob > chi2 = 0.0000
Note: LR test is conservative and provided only for reference.
Obtaining starting values by EM:
Alternating PNLS/LME algorithm:
Iteration 1: Linearization log likelihood = 200.86162
Iteration 2: Linearization log likelihood = 200.86162
Computing standard errors:
Mixed-effects ML nonlinear regression Number of obs = 816
Grouping information
------------------------------------------------------------------
| No. of Observations per group
Path | groups Minimum Average Maximum
---------------------+--------------------------------------------
region | 9 51 90.7 136
region>state | 48 17 17.0 17
------------------------------------------------------------------
Linearization log likelihood = 200.86162
b1: U1[region] UU1[region>state]
------------------------------------------------------------------------------
gsp | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
b1 |
_cons | 10.65961 .2505341 42.55 0.000 10.16857 11.15065
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects parameters | Estimate Std. err. [95% conf. interval]
-----------------------------+------------------------------------------------
region: Identity |
var(U1) | .4376119 .2686479 .1313834 1.457598
-----------------------------+------------------------------------------------
region>state: Identity |
var(UU1) | .6080625 .1381527 .3895418 .9491666
-----------------------------+------------------------------------------------
var(Residual) | .0246633 .0012586 .0223159 .0272577
------------------------------------------------------------------------------
36.1.2 R
These two R packages return same results. Note that lmer reports variance, while lme reports standard deviation.
library(tidyverse)library(haven)library(dplyr)library(readxl)library(nlme)library(lme4)data <-read_dta('https://www.stata-press.com/data/r18/productivity.dta')# lme4 for three level modellmer1 <-lmer(gsp ~1+ (1|region) + (1|state), data=data)summary(lmer1)
Linear mixed model fit by REML ['lmerMod']
Formula: gsp ~ 1 + (1 | region) + (1 | state)
Data: data
REML criterion at convergence: -400.9
Scaled residuals:
Min 1Q Median 3Q Max
-2.8328 -0.6757 0.0474 0.6382 3.3928
Random effects:
Groups Name Variance Std.Dev.
state (Intercept) 0.60775 0.7796
region (Intercept) 0.50977 0.7140
Residual 0.02466 0.1570
Number of obs: 816, groups: state, 48; region, 9
Fixed effects:
Estimate Std. Error t value
(Intercept) 10.665 0.266 40.1
lme1 <-lme(gsp ~1, random=list( region=~1, state=~1), data=data)#or#lme1 <- lme(gsp ~ 1, random= list(region = pdSymm(~1), state=pdSymm(~1)), data=data)#but#lme1 <- lme(gsp ~ 1, random= list( state=~1, region=~1), data=data)#would be a different result!#The other way to specify:#lme2 <- lme(gsp ~ 1, random= ~ 1 | region/state, data=data)summary(lme1)
Linear mixed-effects model fit by REML
Data: data
AIC BIC logLik
-392.8509 -374.0381 200.4254
Random effects:
Formula: ~1 | region
(Intercept)
StdDev: 0.7139792
Formula: ~1 | state %in% region
(Intercept) Residual
StdDev: 0.7795814 0.1570457
Fixed effects: gsp ~ 1
Value Std.Error DF t-value p-value
(Intercept) 10.66483 0.2659842 768 40.09574 0
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.83282326 -0.67572390 0.04744387 0.63817953 3.39281358
Number of Observations: 816
Number of Groups:
region state %in% region
9 48
36.2 A growth model in R and stata
We implement a growth model in R and stata. The results are somewhat different. The results may not be that reliable given that the year slope random component is tiny.
36.2.1 Stata
* same growth model with mixed and menlwebuse productivitymixed gsp year || region: year|| state:menl gsp = {b2:} + {U1[region]}*year, define(b2: year U0[region] U11[region>state])
(Public capital productivity)
Performing EM optimization ...
Performing gradient-based optimization:
Iteration 0: Log likelihood = 784.59336
Iteration 1: Log likelihood = 784.64562
Iteration 2: Log likelihood = 784.65341
Iteration 3: Log likelihood = 784.65343
Iteration 4: Log likelihood = 784.65344
Computing standard errors ...
Mixed-effects ML regression Number of obs = 816
Grouping information
-------------------------------------------------------------
| No. of Observations per group
Group variable | groups Minimum Average Maximum
----------------+--------------------------------------------
region | 9 51 90.7 136
state | 48 17 17.0 17
-------------------------------------------------------------
Wald chi2(1) = 2744.51
Log likelihood = 784.65344 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
gsp | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
year | .0274903 .0005247 52.39 0.000 .0264618 .0285188
_cons | -43.71613 1.067739 -40.94 0.000 -45.80886 -41.6234
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects parameters | Estimate Std. err. [95% conf. interval]
-----------------------------+------------------------------------------------
region: Independent |
var(year) | 5.44e-16 5.06e-13 0 .
var(_cons) | .4380961 .2700722 .1308672 1.466587
-----------------------------+------------------------------------------------
state: Identity |
var(_cons) | .6091766 .1382625 .3904355 .9504672
-----------------------------+------------------------------------------------
var(Residual) | .0053926 .0002752 .0048793 .0059598
------------------------------------------------------------------------------
LR test vs. linear model: chi2(3) = 3903.81 Prob > chi2 = 0.0000
Note: LR test is conservative and provided only for reference.
Obtaining starting values by EM:
Alternating PNLS/LME algorithm:
Iteration 1: Linearization log likelihood = 784.65134
Iteration 2: Linearization log likelihood = 784.65343
Iteration 3: Linearization log likelihood = 784.65343
Computing standard errors:
Mixed-effects ML nonlinear regression Number of obs = 816
Grouping information
------------------------------------------------------------------
| No. of Observations per group
Path | groups Minimum Average Maximum
---------------------+--------------------------------------------
region | 9 51 90.7 136
region>state | 48 17 17.0 17
------------------------------------------------------------------
Wald chi2(1) = 2737.78
Linearization log likelihood = 784.65343 Prob > chi2 = 0.0000
b2: year U0[region] U11[region>state]
------------------------------------------------------------------------------
gsp | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
b2 |
year | .0274903 .0005254 52.32 0.000 .0264605 .02852
_cons | -43.71615 1.069039 -40.89 0.000 -45.81143 -41.62087
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects parameters | Estimate Std. err. [95% conf. interval]
-----------------------------+------------------------------------------------
region: Independent |
var(U0) | .4378219 .2687857 .1314413 1.458355
var(U1) | 5.40e-13 2.67e-10 0 .
-----------------------------+------------------------------------------------
region>state: Identity |
var(U11) | .6091976 .1381402 .3906086 .9501115
-----------------------------+------------------------------------------------
var(Residual) | .0053926 .0002752 .0048793 .0059598
------------------------------------------------------------------------------
36.2.2 R
# lme4 for three level model, growth modellmer_growth1 <-lmer(gsp ~1+ year + (1|region: year) + (1|region) + (1|state), data=data, REML=FALSE)summary(lmer_growth1)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: gsp ~ 1 + year + (1 | region:year) + (1 | region) + (1 | state)
Data: data
AIC BIC logLik -2*log(L) df.resid
-1772.4 -1744.1 892.2 -1784.4 810
Scaled residuals:
Min 1Q Median 3Q Max
-3.8497 -0.4675 -0.0397 0.5098 4.0558
Random effects:
Groups Name Variance Std.Dev.
region:year (Intercept) 0.002360 0.04858
state (Intercept) 0.609333 0.78060
region (Intercept) 0.437471 0.66142
Residual 0.003019 0.05494
Number of obs: 816, groups: region:year, 153; state, 48; region, 9
Fixed effects:
Estimate Std. Error t value
(Intercept) -4.190e+01 1.802e+00 -23.25
year 2.657e-02 9.021e-04 29.46
Correlation of Fixed Effects:
(Intr)
year -0.990
# nlme for three level model, growth model#lme_growth1<-lme(gsp ~ year, random = list(region=pdSymm(~year), state=pdIdent(~1)),data=data, method="ML")lme_growth1<-lme(gsp ~ year, random =list(region=~year, state=~1),data=data, method="ML")summary(lme_growth1)
Linear mixed-effects model fit by maximum likelihood
Data: data
AIC BIC logLik
-1555.307 -1522.376 784.6534
Random effects:
Formula: ~year | region
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 6.61523e-01 (Intr)
year 7.96178e-09 0.001
Formula: ~1 | state %in% region
(Intercept) Residual
StdDev: 0.7805102 0.0734343
Fixed effects: gsp ~ year
Value Std.Error DF t-value p-value
(Intercept) -43.71617 1.0690287 767 -40.89336 0
year 0.02749 0.0005254 767 52.32366 0
Correlation:
(Intr)
year -0.972
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-3.24606517 -0.59087173 0.03087456 0.62062441 4.27790886
Number of Observations: 816
Number of Groups:
region state %in% region
9 48
36.3 cem in R and stata
“CEM” is a growth model with the residual being a exponential function of the time variable.
36.3.1 stata
This mode has some convergence problem. I still use it for demonstration purpose.
webuse productivity, clear* this replicates the results from Rmenl gsp = {b2:} + {U1[region]}*year, define(b2: year U0[region] U11[region>state]) resvariance(exponential year) nolog
(Public capital productivity)
Mixed-effects ML nonlinear regression Number of obs = 816
Grouping information
------------------------------------------------------------------
| No. of Observations per group
Path | groups Minimum Average Maximum
---------------------+--------------------------------------------
region | 9 51 90.7 136
region>state | 48 17 17.0 17
------------------------------------------------------------------
Wald chi2(1) = 107.29
Linearization log likelihood = 877.96184 Prob > chi2 = 0.0000
b2: year U0[region] U11[region>state]
------------------------------------------------------------------------------
gsp | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
b2 |
year | .0274256 .0026478 10.36 0.000 .0222361 .0326151
_cons | -43.54047 5.400007 -8.06 0.000 -54.12429 -32.95665
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects parameters | Estimate Std. err. [95% conf. interval]
-----------------------------+------------------------------------------------
region: Independent |
var(U0) | 263.5659 128.3252 101.4975 684.4206
var(U1) | .0000633 .0000309 .0000243 .0001646
-----------------------------+------------------------------------------------
region>state: Identity |
var(U11) | .6046714 .1369848 .3878686 .942658
-----------------------------+------------------------------------------------
Residual variance: |
Exponential year |
sigma2 | 9.96e-99 . . .
gamma | .0556293 .000013 .0556039 .0556548
------------------------------------------------------------------------------
36.3.2 R
As in stata, it has some convergence problem. So the results can differ somewhat between R and stata.
lme_growth1<-lme(gsp ~ year, random =list(region=pdSymm(~year), state=pdIdent(~1)),data=data, method="ML")
Error in `lme()`:
! could not find function "lme"
lme_growth2<-update(lme_growth1,weights=varExp( form =~ year), method="ML", control =lmeControl(opt ="optim"))
Error in `lme.formula()`:
! could not find function "lme.formula"
---title: "Doing the same multi-level model with R and stata"date: "2023-10-26"---This post is a demo for doing the same multi-level models with R and stata. R's multilevel models are mostly done with lme4 and nlme. Stata's multilevel models are mostly done with mixed and menl. The two packages are not exactly the same. For example, mixed does not allow complicated structure for the residual. That's why we need menl. Likewise, in R lme4's "lmer" does not allow this specification. We have to use nlme. But it's good to know how to do the same thing in both packages.For example, we want to do a consensus emergence model (CEM) as described in Lang et al. 2012(https://core.ac.uk/download/pdf/158346339.pdf). The CEM model is basically a growth model with the residual being a exponential function of the time variable. We want to do the same thing in both R and stata.We start with simpler models.```{r}#| label: setup#| include: falseknitr::opts_chunk$set(echo =TRUE)```## Stata and R syntax comparison for the same three level modelWe run a three level model specified in stata with mixed and menl. In R with nlme and lme4. This is a random intercept model. Only intercept is varying for each unit at each level. In this data set supposedly state is nested within region. There are situations that levels are not nested, then we need to specify cross effect models. We do not discuss that here. What bothers me is that nlme's lme function takes the first level as higher level. So I have to put region before state. Otherwise, it seems like it treated regions nested within states? There are different ways to specify this model in nlme. I show two ways here (including lme2 which is commented out). ### Stata```{r}#| include: falselibrary(Statamarkdown)stataexe <-find_stata()#stataexe <- "/usr/local/bin/stata"knitr::opts_chunk$set(engine.path=list(stata=stataexe))``````{stata}*| label: stata1*| echo: true*| cache: true*| collectcode: true* let's see how to use menlwebuse productivity* to replicate mixed with menlmixed gsp || region: || state:menl gsp = {b1:}, define(b1: U1[region] UU1[region>state])```### RThese two R packages return same results. Note that lmer reports variance, while lme reports standard deviation.```{r}#| label: r1#| echo: true#| cache: true#| collectcode: truelibrary(tidyverse)library(haven)library(dplyr)library(readxl)library(nlme)library(lme4)data <-read_dta('https://www.stata-press.com/data/r18/productivity.dta')# lme4 for three level modellmer1 <-lmer(gsp ~1+ (1|region) + (1|state), data=data)summary(lmer1)lme1 <-lme(gsp ~1, random=list( region=~1, state=~1), data=data)#or#lme1 <- lme(gsp ~ 1, random= list(region = pdSymm(~1), state=pdSymm(~1)), data=data)#but#lme1 <- lme(gsp ~ 1, random= list( state=~1, region=~1), data=data)#would be a different result!#The other way to specify:#lme2 <- lme(gsp ~ 1, random= ~ 1 | region/state, data=data)summary(lme1)```## A growth model in R and stataWe implement a growth model in R and stata. The results are somewhat different. The results may not be that reliable given that the year slope random component is tiny. ### Stata```{stata}*| label: stata2*| echo: true*| cache: true*| collectcode: true* same growth model with mixed and menlwebuse productivitymixed gsp year || region: year|| state:menl gsp = {b2:} + {U1[region]}*year, define(b2: year U0[region] U11[region>state])```### R```{r}#| label: r2#| echo: true#| cache: true#| collectcode: true# lme4 for three level model, growth modellmer_growth1 <-lmer(gsp ~1+ year + (1|region: year) + (1|region) + (1|state), data=data, REML=FALSE)summary(lmer_growth1)# nlme for three level model, growth model#lme_growth1<-lme(gsp ~ year, random = list(region=pdSymm(~year), state=pdIdent(~1)),data=data, method="ML")lme_growth1<-lme(gsp ~ year, random =list(region=~year, state=~1),data=data, method="ML")summary(lme_growth1)```## cem in R and stata"CEM" is a growth model with the residual being a exponential function of the time variable. ### stataThis mode has some convergence problem. I still use it for demonstration purpose.```{stata}*| label: stata3*| echo: true*| cache: true*| collectcode: truewebuse productivity, clear* this replicates the results from Rmenl gsp = {b2:} + {U1[region]}*year, define(b2: year U0[region] U11[region>state]) resvariance(exponential year) nolog```### RAs in stata, it has some convergence problem. So the results can differ somewhat between R and stata.```{r}#| label: cem1lme_growth1<-lme(gsp ~ year, random =list(region=pdSymm(~year), state=pdIdent(~1)),data=data, method="ML")lme_growth2<-update(lme_growth1,weights=varExp( form =~ year), method="ML", control =lmeControl(opt ="optim"))summary(lme_growth2)delta1<-lme_growth2$modelStruct$varStructdelta1```