8  Nonparametric Causal Methods

using DataFrames
using Distributions
using Random
using Statistics
using MLJ
using EvoTrees
using MLJLinearModels
using DecisionTree
using CausalEstimate
import CausalEstimate: estimate, confint, pvalue
using Printf

Related reading: For an R-based TMLE simulation that compares super-learner libraries and stress-tests model misspecification, see the companion blog chapter on TMLE in Topics on Econometrics and Causal Inference.

8.1 Why Do We Need Nonparametric Methods?

  • Parametric models are often mis-specified
  • Nonparametric models are more flexible; less assumptions
  • What kind of nonparametrics are good?

Considerations:

  • Plug-in estimators are easy
  • But they are biased, or inefficient, because they are trying to estimate \(E[Y(0)]\) and \(E[Y(1)]\), not the difference.
  • They can be consistent, but convergence rate can be terrible, especially when the dimension is high
  • Ideally we can correct the biases, and get efficiency
  • Influence function based estimators are shown to be efficient.

8.2 Influence Function Based Estimators

Estimators based on efficient influence function are often \(\sqrt n\) consistent, meaning they converge to the truth as fast as \(\sqrt n\) rate. This means the bias not only goes to 0 with \(n\) goes to infinity, but also as fast as \(1/ \sqrt n\) goes to 0. Why does this matter? Because we have limited sample size, when we are based on asymptotics, the faster the bias goes to 0 the smaller sample size we need.

If we have an estimator such that \[ \sqrt n (\hat \psi - \psi) = \sqrt n P_n(\phi) + o_P(1) \rightarrow N(0, var(\phi))\]

Then this estimator is \(\sqrt n\) consistent and asymptotically normal; and \(\phi\) is the influence function. The efficient influence function is the one with variance at the lower bound (similar to parametric case for Cramer-Rao lower bound).

Unfortunately there is no universal formula for efficient influence function for every problem. We need to derive it for each problem. For example, ATE has an IF based estimator, but ATT needs a different formula; and LATE has a different formula too, etc. Deriving them is hard!

For the treated potential-outcome mean, \(\psi_1 = E\{E(Y \mid X, A=1)\}\), the efficient influence function is \[\phi_1 (Z; P) = \frac{A}{\pi (X)} \{ Y - \mu_1 (X) \} + \mu_1(X) - \psi_1\] where \(\pi (X) = P(A=1 | X)\) is the propensity score model and \(\mu_1 (X) = E(Y |X, A=1)\) is the treated outcome model. The ATE influence function is the difference between the treated and untreated components, \(\phi_1-\phi_0\).

For example, an IF-based bias-corrected estimator \[\hat \psi = P_n \left[ \frac{A}{\hat \pi (X)} \{ Y - \hat \mu (X) \} + \hat \mu(X) \right]\] where \(P_n\) means sample mean. This is the familiar AIPW estimator. It is known to be doubly robust.

Since we know the influence function, we can calculate its variance. This variance is the same as variance of \(\hat \psi\) since they differ by a constant \(\psi\). We can use any estimators to estimate the nuisance parameters, then get \(\hat \psi\). Then the sample variance is the variance of the influence function; and the sample mean is the ATE.

In this estimator, both \(\hat \mu\) and \(\hat \pi\) can be done using any estimator, such as machine learning.

8.2.1 Simulation Setup

Random.seed!(1)
n = 1000
w1 = rand(Binomial(1, 0.5), n)
w2 = rand(Binomial(1, 0.5), n)
w3 = round.(rand(Uniform(0, 4), n), digits=3)
w4 = round.(rand(Uniform(0, 5), n), digits=3)

logit(x) = 1.0 ./ (1.0 .+ exp.(-x))

prob_A = logit.(-0.4 .+ 0.2.*w2 .+ 0.15.*w3 .+ 0.2.*w4 .+ 0.15.*w2.*w4)
A = rand.(Binomial.(1, prob_A))

prob_Y1 = logit.(-1.0 .+ 1.0 .- 0.1.*w1 .+ 0.3.*w2 .+ 0.25.*w3 .+ 0.2.*w4 .+ 0.15.*w2.*w4)
prob_Y0 = logit.(-1.0 .+ 0.0 .- 0.1.*w1 .+ 0.3.*w2 .+ 0.25.*w3 .+ 0.2.*w4 .+ 0.15.*w2.*w4)

Y_1 = rand.(Binomial.(1, prob_Y1))
Y_0 = rand.(Binomial.(1, prob_Y0))
Y = Y_1 .* A .+ Y_0 .* (1 .- A)

W = DataFrame(w1=w1, w2=w2, w3=w3, w4=w4)
df = DataFrame(w1=w1, w2=w2, w3=w3, w4=w4, A=A, Y=Y, Y_1=Y_1, Y_0=Y_0)

True_Psi = mean(df.Y_1 .- df.Y_0)
@printf "True ATE = %.4g\n" True_Psi
True ATE = 0.184
# Load MLJ models to form our equivalent of SuperLearner
LogisticClassifier = @load LogisticClassifier pkg=MLJLinearModels verbosity=0
EvoTreeClassifier = @load EvoTreeClassifier pkg=EvoTrees verbosity=0
RandomForestClassifier = @load RandomForestClassifier pkg=DecisionTree verbosity=0

# Define a Stack (SuperLearner)
stack = Stack(
    metalearner = LogisticClassifier(),
    resampling = CV(nfolds=5),
    evo = EvoTreeClassifier(max_depth=4, nrounds=100),
    rf = RandomForestClassifier(n_trees=100, max_depth=5),
    glm = LogisticClassifier()
)
ProbabilisticStack(
  metalearner = LogisticClassifier(
        lambda = 2.220446049250313e-16, 
        gamma = 0.0, 
        penalty = :l2, 
        fit_intercept = true, 
        penalize_intercept = false, 
        scale_penalty_with_samples = true, 
        solver = nothing), 
  resampling = CV(
        nfolds = 5, 
        shuffle = false, 
        rng = TaskLocalRNG()), 
  measures = nothing, 
  cache = true, 
  acceleration = CPU1{Nothing}(nothing), 
  evo = EvoTreeClassifier(
        loss = :mlogloss, 
        metric = :mlogloss, 
        nrounds = 100, 
        bagging_size = 1, 
        early_stopping_rounds = 9223372036854775807, 
        L2 = 1.0, 
        lambda = 0.0, 
        gamma = 0.0, 
        eta = 0.1, 
        max_depth = 4, 
        min_weight = 1.0, 
        rowsample = 1.0, 
        colsample = 1.0, 
        nbins = 64, 
        tree_type = :binary, 
        seed = 123, 
        device = :cpu), 
  rf = RandomForestClassifier(
        max_depth = 5, 
        min_samples_leaf = 1, 
        min_samples_split = 2, 
        min_purity_increase = 0.0, 
        n_subfeatures = -1, 
        n_trees = 100, 
        sampling_fraction = 0.7, 
        feature_importance = :impurity, 
        rng = TaskLocalRNG()), 
  glm = LogisticClassifier(
        lambda = 2.220446049250313e-16, 
        gamma = 0.0, 
        penalty = :l2, 
        fit_intercept = true, 
        penalize_intercept = false, 
        scale_penalty_with_samples = true, 
        solver = nothing))

8.2.2 Plug-in Estimator from Outcome Regression

Random.seed!(1)
X_mu = select(df, :w1, :w2, :w3, :w4, :A)
mach_mu = machine(stack, X_mu, categorical(df.Y))
MLJ.fit!(mach_mu, verbosity=0)

# predict the PO Y^1
X_mu1 = copy(X_mu); X_mu1.A .= 1
X_mu0 = copy(X_mu); X_mu0.A .= 0

# Predict probability of Y=1 (assuming 1 is the positive class)
mu1hat = pdf.(MLJ.predict(mach_mu, X_mu1), 1)
mu0hat = pdf.(MLJ.predict(mach_mu, X_mu0), 1)

plug_in_ate = mean(mu1hat .- mu0hat)
@printf "ATE (plug-in) = %.4g\n" plug_in_ate
ATE (plug-in) = 0.1759

8.2.3 AIPW ATE

Random.seed!(123)
X_pi = select(df, :w1, :w2, :w3, :w4)
mach_pi = machine(stack, X_pi, categorical(df.A))
MLJ.fit!(mach_pi, verbosity=0)

pi1hat = pdf.(MLJ.predict(mach_pi, X_pi), 1)

# Bound propensity scores to avoid extreme weights
pi1hat = max.(min.(pi1hat, 0.999), 0.001)
pi0hat = 1.0 .- pi1hat

# Current prediction on observed data
mu1hat_obs = mu1hat
mu0hat_obs = mu0hat

IF_1 = ((df.A ./ pi1hat) .* (df.Y .- mu1hat_obs) .+ mu1hat_obs)
IF_0 = (((1 .- df.A) ./ pi0hat) .* (df.Y .- mu0hat_obs) .+ mu0hat_obs)
IF = IF_1 .- IF_0

aipw_1 = mean(IF_1); aipw_0 = mean(IF_0)
aipw_manual = aipw_1 - aipw_0

ci_lb = mean(IF) - quantile(Normal(), 0.975) * std(IF) / sqrt(length(IF))
ci_ub = mean(IF) + quantile(Normal(), 0.975) * std(IF) / sqrt(length(IF))
res_manual_aipw = [aipw_manual, ci_lb, ci_ub]
@printf "ATE (AIPW)   = %.4g\n" aipw_manual
@printf "95%% CI       = [%.4g, %.4g]\n" ci_lb ci_ub
ATE (AIPW)   = 0.1918
95% CI       = [0.1311, 0.2525]

8.2.4 AIPW ATE Using CausalEstimate.jl

using CausalEstimate

r(x) = round(x, sigdigits=4)

psi_ate = ATE(outcome=:Y, treatment=:A, confounders=[:w1, :w2, :w3, :w4])
aipw_result = estimate(psi_ate, AIPW(crossfit=5), df)

(
    ATE    = r(estimate(aipw_result)),
    CI     = r.(confint(aipw_result)),
    pvalue = r(pvalue(aipw_result)),
)
(ATE = 0.2072, CI = (0.1206, 0.2939), pvalue = 2.776e-6)

8.2.5 AIPW ATT

Random.seed!(1)
df2 = copy(df)
df2.pi0hat = pi0hat
df2.pi1hat = pi1hat

df0 = filter(row -> row.A == 0, df2)
df1 = filter(row -> row.A == 1, df2)

# Fit outcome model only on controls
X_mu0 = select(df0, :w1, :w2, :w3, :w4)
mach_mu0 = machine(stack, X_mu0, categorical(df0.Y))
MLJ.fit!(mach_mu0, verbosity=0)

# Predict counterfactual for treated
X_mu1 = select(df1, :w1, :w2, :w3, :w4)
mu0hat_treated = pdf.(MLJ.predict(mach_mu0, X_mu1), 1)

IF_0_att = (((1 .- df1.A) ./ df1.pi0hat) .* (df1.Y .- mu0hat_treated) .+ mu0hat_treated)
IF_att = df1.Y .- IF_0_att

aipw_manual_att = mean(IF_att)
ci_lb_att = mean(IF_att) - quantile(Normal(), 0.975) * std(IF_att) / sqrt(length(IF_att))
ci_ub_att = mean(IF_att) + quantile(Normal(), 0.975) * std(IF_att) / sqrt(length(IF_att))
res_manual_aipw_att = [aipw_manual_att, ci_lb_att, ci_ub_att]
res_manual_aipw_att
3-element Vector{Float64}:
 0.20145553037952954
 0.1700985774823039
 0.2328124832767552

8.2.6 AIPW ATT Using CausalEstimate.jl

psi_att = ATT(outcome=:Y, treatment=:A, confounders=[:w1, :w2, :w3, :w4])
att_result = estimate(psi_att, AIPW(crossfit=5), df)

(
    ATT    = r(estimate(att_result)),
    CI     = r.(confint(att_result)),
    pvalue = r(pvalue(att_result)),
)
(ATT = 0.1861, CI = (-0.04244, 0.4146), pvalue = 0.1105)

8.3 TMLE

TMLE is a variant of IF based estimator. It is also a doubly robust estimator. See Mark van der Laan and his coauthors for the recent development of TMLE.

We should expect AIPW and TMLE to give similar answers.

Random.seed!(42)
tmle_result = estimate(psi_ate, TMLE(crossfit=5), df)

(
    TMLE   = r(estimate(tmle_result)),
    CI     = r.(confint(tmle_result)),
    pvalue = r(pvalue(tmle_result)),
)
(TMLE = 0.1775, CI = (0.08638, 0.2687), pvalue = 0.0001349)
@printf "True ATE: %.4g\n" True_Psi
True ATE: 0.184

8.4 DoubleML

DoubleML (Double/Debiased Machine Learning) was proposed by Chernozhukov et al. (2018). An R package DoubleML implements the framework; as of this writing there is no native Julia package, but the algorithm is straightforward to implement directly. The double machine learning framework consists of three key ingredients: Neyman orthogonality, high-quality machine learning estimation, and sample splitting.

They consider a partially linear model:

\[ y_i = \theta d_i + g_0(x_i) + \eta_i \]

\[ d_i = m_0(x_i) + v_i \]

This model is quite general, except it does not allow interaction of \(d\) and \(x\); therefore no heterogeneous treatment effect across \(x\). But “DoubleML” implements more than partially linear model, it actually allows for heterogeneous treatment effects, in models such as interactive regression model.

The basic idea of doubleML is to use any machine learning algorithm to estimate outcome equation (\(l_0(X) = E(Y | X)\)) and treatment equation (\(m_0(X) = E(D | X)\)). Then get the residuals, namely \(\hat W=Y-\hat l_0(X)\) and \(\hat V = D - \hat m_0(X)\).

Then regress \(\hat W\) on \(\hat V\). Based on FWL theorem, you get \(\hat \theta\).

An important component here is to specify a Neyman-orthogonal score function. In the case of PLR, it can be the product of the two residuals:

\[\psi (W; \theta, \eta) = (Y-D\theta -g(X))(D-m(X)) \]

The estimator \(\hat \theta\) is to solve the equation that the sample mean of this score function being 0.

And the variance of this score function is used as the variance of the doubleML estimator’s variance.

using MLJ
using Random

# We implement DoubleML PLR (Partially Linear Regression) manually with cross-fitting
Random.seed!(123)
nsplits = 5
n_samples = nrow(df)
s = shuffle(1:n_samples)
folds = [s[1 + floor(Int, (k-1)*n_samples/nsplits) : floor(Int, k*n_samples/nsplits)] for k in 1:nsplits]

# Base learner (Random Forest for regression)
RandomForestRegressor = @load RandomForestRegressor pkg=DecisionTree verbosity=0
learner = RandomForestRegressor(n_trees=500, max_depth=5)

W_res = zeros(n_samples)
V_res = zeros(n_samples)

for fold in 1:nsplits
    test_idx = folds[fold]
    train_idx = setdiff(1:n_samples, test_idx)
    
    # Fit outcome model l_0(X) = E(Y | X)
    mach_l = machine(learner, W[train_idx, :], df.Y[train_idx])
    MLJ.fit!(mach_l, verbosity=0)
    y_pred = MLJ.predict(mach_l, W[test_idx, :])
    W_res[test_idx] = df.Y[test_idx] .- y_pred
    
    # Fit treatment model m_0(X) = E(D | X)
    mach_m = machine(learner, W[train_idx, :], df.A[train_idx])
    MLJ.fit!(mach_m, verbosity=0)
    a_pred = MLJ.predict(mach_m, W[test_idx, :])
    V_res[test_idx] = df.A[test_idx] .- a_pred
end
# FWL theorem: regress residual W_res on residual V_res
theta_hat = sum(W_res .* V_res) / sum(V_res .^ 2)

# Variance of score function (Neyman-orthogonal score)
psi = (W_res .- theta_hat .* V_res) .* V_res
var_theta = var(psi) / (mean(V_res .^ 2)^2)
se_theta = sqrt(var_theta / n_samples)

ci_lb = theta_hat - 1.96 * se_theta
ci_ub = theta_hat + 1.96 * se_theta
@printf "DML PLR Estimate = %.4g (SE = %.4g)\n" theta_hat se_theta
@printf "95%% CI           = [%.4g, %.4g]\n" ci_lb ci_ub
DML PLR Estimate = 0.192 (SE = 0.03206)
95% CI           = [0.1292, 0.2549]