28 Mediation analysis in R and Stata

Published

August 6, 2019

28.1 Mediation analysis

Traditionally mediation model can be represented in the following equestions:

\[ Y = a X + b M + \epsilon_1 \] \[ M = c X + \epsilon_2 \]

That is, we’d like to study the effect of \(X\) on \(Y\), and we see the effect can be a direct effect, and an indirect effect, through \(M\).

Baron and Kenny’s (http://davidakenny.net/cm/mediate.htm) method is done in four steps. Modern approach tends to use SEM (structural equation modeling) to model these two equations directly.

R’s lavaan and Stata’s sem commands are powerful tools.

A simple example:

library(lavaan)
set.seed(1234)
n <- 10000
X <- rnorm(n)
M <- 0.5*X + rnorm(n)
Y <- 0.7*M + 0.3*X + rnorm(n)
Data <- data.frame(X = X, Y = Y, M = M)
model <- '   Y ~ a*X + b*M
             M ~ c*X
           # direct effect (a)
           # indirect effect (b*c)
             bc := b*c
           # total effect
             total := a + (b*c)
         '
fit <- sem(model, data = Data)
summary(fit)

lavaan 0.6-21 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

  Number of observations                         10000

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  Y ~                                                 
    X          (a)    0.286    0.011   25.233    0.000
    M          (b)    0.699    0.010   70.384    0.000
  M ~                                                 
    X          (c)    0.511    0.010   50.161    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .Y                 0.998    0.014   70.711    0.000
   .M                 1.012    0.014   70.711    0.000

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|)
    bc                0.357    0.009   40.849    0.000
    total             0.643    0.012   51.956    0.000

Now let’s do it in stata.

set obs 10000
gen x= rnormal()
gen m= .5*x + rnormal()
gen y= .7*m + .3*x + .4*m*x +rnormal()
sem (y <- x m) (m <- x)
estat teffects

Number of observations (_N) was 0, now 10,000.





Endogenous variables
  Observed: y m

Exogenous variables
  Observed: x

Fitting target model:
Iteration 0:  Log likelihood =  -43667.84  
Iteration 1:  Log likelihood =  -43667.84  

Structural equation model                               Number of obs = 10,000
Estimation method: ml

Log likelihood = -43667.84

------------------------------------------------------------------------------
             |                 OIM
             | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Structural   |
  y          |
           m |   .7186314   .0113407    63.37   0.000     .6964041    .7408587
           x |   .3054332   .0127003    24.05   0.000     .2805411    .3303253
       _cons |    .199257   .0112834    17.66   0.000     .1771418    .2213721
  -----------+----------------------------------------------------------------
  m          |
           x |   .5038605   .0100014    50.38   0.000     .4842581    .5234628
       _cons |   .0187204   .0099478     1.88   0.060    -.0007769    .0382177
-------------+----------------------------------------------------------------
     var(e.y)|   1.272711   .0179989                      1.237919    1.308482
     var(e.m)|   .9895863   .0139949                      .9625336    1.017399
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(0) = 0.00                 Prob > chi2 = .



Direct effects
------------------------------------------------------------------------------
             |                 OIM
             | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Structural   |
  y          |
           m |   .7186314   .0113407    63.37   0.000     .6964041    .7408587
           x |   .3054332   .0127003    24.05   0.000     .2805411    .3303253
  -----------+----------------------------------------------------------------
  m          |
           x |   .5038605   .0100014    50.38   0.000     .4842581    .5234628
------------------------------------------------------------------------------


Indirect effects
------------------------------------------------------------------------------
             |                 OIM
             | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Structural   |
  y          |
           m |          0  (no path)
           x |     .36209    .009182    39.43   0.000     .3440937    .3800863
  -----------+----------------------------------------------------------------
  m          |
           x |          0  (no path)
------------------------------------------------------------------------------


Total effects
------------------------------------------------------------------------------
             |                 OIM
             | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
Structural   |
  y          |
           m |   .7186314   .0113407    63.37   0.000     .6964041    .7408587
           x |   .6675231   .0134277    49.71   0.000     .6412053     .693841
  -----------+----------------------------------------------------------------
  m          |
           x |   .5038605   .0100014    50.38   0.000     .4842581    .5234628
------------------------------------------------------------------------------

The above examples should have direct effect of .3 and indirect effect of .35, and total effect of .65.

28.2 Causal Mediation analysis

The traditional mediation analysis has been criticized for the lack of causal interpretation. Without manipulation of the mediator, it is hard to interpret the effects causally, because even if the treatment is from random experiments, the mediator is often not. Therefore there could be an unmeasured confounder that is causing both \(M\) and \(Y\).

R’s “mediation” package is for causal mediation analysis. It uses simulation to estimate the causal effects of treatment, under assumptions of sequential ignorability. It estimates the following quantities:

\[\tau_i = Y_i(1, M_i(1)) - Y_i(0, Mi(0))\]

This is the total treatment effefct, which is to say, what’s the change in \(Y\) if we change each unit from control to treated, hypothetically?

Then this can be decomposed into the causal mediation effects:

\[\delta_i(t) = Y_i(t, M_i(1)) - Y_i(t, M_i(0))\]

for treatment status \(t=0,1\). This is to say, given treatment status, what’s the mediation effect?

\[\eta_i(t) = Y_i(1, M_i(t)) - Y_i(0, M_i(t)) \]

for treatment status \(t=0,1\). This is to say, given mediator stauts for each treatment status, what’s the direct effect?

Therefore there are four quantities estimated, direct and mediation effect for treated and control.

R’s “mediation” needs users to feed two models, outcome model and mediation model.

If we study the same data, we would expect it returns the same estimates as the tranditional methods. However, the causal mediation models can be much more flexible in outcome and mediation models.

library(mediation)
med.fit <- lm(M ~ X, data=Data)
summary(med.fit)


Call:
lm(formula = M ~ X, data = Data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1134 -0.6972  0.0086  0.6837  3.7309 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.007814   0.010060  -0.777    0.437    
X            0.510954   0.010187  50.156   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.006 on 9998 degrees of freedom
Multiple R-squared:  0.201, Adjusted R-squared:  0.2009 
F-statistic:  2516 on 1 and 9998 DF,  p-value: < 2.2e-16

out.fit <- lm(Y ~ X + M,data=Data)
summary(out.fit)


Call:
lm(formula = Y ~ X + M, data = Data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0680 -0.6771 -0.0081  0.6643  3.8304 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.004170   0.009992   0.417    0.676    
X           0.285582   0.011319  25.229   <2e-16 ***
M           0.699006   0.009933  70.373   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9991 on 9997 degrees of freedom
Multiple R-squared:  0.4734,    Adjusted R-squared:  0.4733 
F-statistic:  4494 on 2 and 9997 DF,  p-value: < 2.2e-16

set.seed(123)
med.out <- mediate(med.fit, out.fit, treat = "X", mediator = "M", sims = 100)
summary(med.out)


Causal Mediation Analysis 

Quasi-Bayesian Confidence Intervals

               Estimate 95% CI Lower 95% CI Upper   p-value    
ACME            0.35684      0.34208      0.37741 < 2.2e-16 ***
ADE             0.28540      0.26522      0.30604 < 2.2e-16 ***
Total Effect    0.64224      0.62091      0.66752 < 2.2e-16 ***
Prop. Mediated  0.55530      0.53571      0.57897 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Sample Size Used: 10000 


Simulations: 100

28.3 Binary outcome

For example, in the case of binary outcome, the traditional approach will have difficulties. We can estimate the outcome model and mediator model jointly, but the total effects are not easy to decompose into direct and indirect effect (see Imai et al, page 320 https://imai.fas.harvard.edu/research/files/BaronKenny.pdf).

The causal mediation analysis framework is much more general.

library(mediation)
data(framing)
med.fit <- lm(emo ~ treat + age + educ + gender + income, data = framing)
out.fit <- glm(cong_mesg ~ emo + treat + age + educ + gender + income,
               data = framing, family = binomial("probit"))
set.seed(123)
med.out <- mediate(med.fit, out.fit, treat = "treat", mediator = "emo",
                     robustSE = TRUE, sims = 100)
summary(med.out)


Causal Mediation Analysis 

Quasi-Bayesian Confidence Intervals

                           Estimate 95% CI Lower 95% CI Upper p-value    
ACME (control)             0.081341     0.037680     0.122101  <2e-16 ***
ACME (treated)             0.082417     0.036606     0.126460  <2e-16 ***
ADE (control)              0.018299    -0.105898     0.158469    0.72    
ADE (treated)              0.019375    -0.115315     0.172002    0.72    
Total Effect               0.100716    -0.028036     0.259054    0.24    
Prop. Mediated (control)   0.632609   -13.705651     4.305615    0.24    
Prop. Mediated (treated)   0.661761   -12.422330     4.036601    0.24    
ACME (average)             0.081879     0.037143     0.124080  <2e-16 ***
ADE (average)              0.018837    -0.110606     0.165198    0.72    
Prop. Mediated (average)   0.647185   -13.063991     4.171108    0.24    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Sample Size Used: 265 


Simulations: 100

“mediation” package has more functionalities, such as multilevel, interaction of treatment and mediator, etc.

Stata’s sem and gsem commands can model different situations, but the direct effect and indirect effects are not easy to compute, especially when you have binary outcome, or other non-continuous outcome situations. They are not designed for causal mediation analysis.

28.4 Going further: formal causal mediation

The mediation package above implements sequential ignorability — a specific set of identifying assumptions. The causal mediation chapter covers the assumptions more formally, introduces the controlled direct effect (CDE) and natural direct/indirect effects (NDE/NIE) using potential-outcome notation, and discusses identification with unmeasured mediator–outcome confounding. If your setting involves a non-binary outcome, treatment–mediator interaction, or you need a doubly-robust estimator, start there.