9 Heterogeneous Treatment Effects with Machine Learning

using DataFrames
using Distributions
using Random
using Statistics
using LinearAlgebra
using Printf
using MLJ
using MLJDecisionTreeInterface
using GLM
using CairoMakie

The estimands chapter defined the conditional average treatment effect

\[ \text{CATE}(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x] \]

as the effect of treatment for units with covariates \(X=x\). In simulation we can compute CATE from known potential outcomes. In real data we observe only one potential outcome per unit, so CATE has to be estimated from \((X,D,Y)\).

Here I focus on meta-learners: recipes that turn a regression method into a CATE estimator. I implement S-, T-, X-, R-, and DR-learners using MLJ and DecisionTree.jl random forests.

Julia does not currently have an equivalent of R’s grf::causal_forest with honest sample splitting and CATE confidence intervals. For that workflow, use the R companion chapter.

9.1 The data-generating process

I use the same CATE function as the estimands chapter, \(\tau(x)=1+2x_1\) (the rest of the DGP differs: five covariates and confounding through the observed \(X_2\)).

Random.seed!(42)
n = 5000
p = 5

X    = rand(n, p)
# Treatment depends on the OBSERVED covariate X2 (a backdoor confounder we
# adjust for), so unconfoundedness given X holds and the estimators below are
# consistent for the true CATE/ATE.
ps   = @. 1 / (1 + exp(-(-0.3 + 1.5 * X[:, 2])))
D    = Float64.(rand(n) .< ps)
tau  = @. 1 + 2 * X[:, 1]
Y0   = @. 0.5 * X[:, 2] + randn()
Y1   = Y0 .+ tau
Y    = ifelse.(D .== 1, Y1, Y0)

@printf("n = %d, true ATE = %.3f\n", n, mean(tau))
@printf("True CATE at X1 = 0.2: %.2f\n", 1 + 2 * 0.2)
@printf("True CATE at X1 = 0.8: %.2f\n", 1 + 2 * 0.8)

n = 5000, true ATE = 2.009
True CATE at X1 = 0.2: 1.40
True CATE at X1 = 0.8: 2.60

Treatment is related to the observed covariate \(X_2\), which also affects the outcome — a backdoor confounder. Because \(X_2\) is observed and in the conditioning set, unconfoundedness holds and the estimators below are consistent for the true CATE/ATE.

9.2 A wrapper for fitting random forests

To keep the code short, define a helper that fits a random forest and returns predictions.

const RFR = @load RandomForestRegressor pkg=DecisionTree verbosity=0

"""
    rf_fit_predict(Xtrain, ytrain, Xpredict; n_trees=500, weights=nothing)
Fit a random forest on (Xtrain, ytrain) and return predictions at Xpredict.
DecisionTree's `RandomForestRegressor` does not accept per-sample weights, so
when `weights` are supplied we approximate a weighted fit by resampling the
training rows with probability proportional to the weights (weighted bootstrap).
"""
function rf_fit_predict(Xtrain, ytrain, Xpredict; n_trees::Int=500,
                        weights=nothing)
    if weights !== nothing
        w = Float64.(weights)
        m = size(Xtrain, 1)
        cdf = cumsum(w) ./ sum(w)            # weighted-bootstrap CDF
        idx = [searchsortedfirst(cdf, rand()) for _ in 1:m]
        Xtrain = Xtrain[idx, :]
        ytrain = ytrain[idx]
    end
    learner = RFR(n_trees=n_trees, max_depth=-1)
    Xtrain_t = MLJ.table(Xtrain)
    Xpred_t  = MLJ.table(Xpredict)
    mach = machine(learner, Xtrain_t, Float64.(ytrain))
    fit!(mach, verbosity=0)
    return MLJ.predict(mach, Xpred_t)
end
nothing

9.3 S-learner (“single”)

Fit one regression of \(Y\) on \((D, X)\), then predict the difference between \(D = 1\) and \(D = 0\) for each \(x\):

X_with_D = hcat(D, X)
X1_test  = hcat(ones(n), X)
X0_test  = hcat(zeros(n), X)

mu1_S = rf_fit_predict(X_with_D, Y, X1_test)
mu0_S = rf_fit_predict(X_with_D, Y, X0_test)
tau_S = mu1_S .- mu0_S

@printf("S-learner CATE correlation with truth: %.3f\n", cor(tau_S, tau))

S-learner CATE correlation with truth: 0.723

The S-learner is simple. Its weakness is that the model may treat \(D\) as a minor predictor and shrink treatment effects toward zero.

9.4 T-learner (“two”)

Fit two separate regressions, one on treated units and one on controls:

idx_T = D .== 1
idx_C = D .== 0

mu1_T = rf_fit_predict(X[idx_T, :], Y[idx_T], X)
mu0_T = rf_fit_predict(X[idx_C, :], Y[idx_C], X)
tau_T = mu1_T .- mu0_T

@printf("T-learner CATE correlation with truth: %.3f\n", cor(tau_T, tau))

T-learner CATE correlation with truth: 0.727

The T-learner gives treatment and control separate outcome models. It can work well, but it can extrapolate badly when treated and control covariate distributions do not overlap.

9.5 X-learner (Künzel et al. 2019)

The X-learner starts from the T-learner. It imputes missing potential outcomes, creates pseudo-treatment effects, smooths them over \(X\), and then combines the two arms with propensity-score weights.

# Step 1: pseudo-outcomes
D1_pseudo = Y[idx_T] .- mu0_T[idx_T]      # observed - imputed Y(0)
D0_pseudo = mu1_T[idx_C] .- Y[idx_C]      # imputed Y(1) - observed

# Step 2: regress pseudo-outcomes on X in each arm
tau_X1 = rf_fit_predict(X[idx_T, :], D1_pseudo, X)
tau_X0 = rf_fit_predict(X[idx_C, :], D0_pseudo, X)

# Step 3: weight by propensity score
e_hat = rf_fit_predict(X, D, X)
e_hat = clamp.(e_hat, 0.02, 0.98)

tau_X = e_hat .* tau_X0 .+ (1 .- e_hat) .* tau_X1
@printf("X-learner CATE correlation with truth: %.3f\n", cor(tau_X, tau))

X-learner CATE correlation with truth: 0.901

The X-learner is useful when one treatment arm is much smaller than the other.

9.6 R-learner (Nie and Wager 2021)

The R-learner partials out the main effects of \(X\) from both \(Y\) and \(D\):

\[ \tilde Y_i = \frac{Y_i - \hat m(X_i)}{D_i - \hat e(X_i)}, \qquad \text{weight}_i = (D_i - \hat e(X_i))^2, \]

where \(\hat m(x) = \mathbb{E}[Y \mid X = x]\) and \(\hat e(x) = \mathbb{E}[D \mid X = x]\) are cross-fitted nuisance estimates.

# Cross-fitting: 2 folds
Random.seed!(7)
folds  = rand(1:2, n)
m_hat  = zeros(n)
e_hat_cv = zeros(n)

for k in 1:2
    train = folds .!= k
    test  = folds .== k
    m_hat[test]    = rf_fit_predict(X[train, :], Y[train], X[test, :])
    e_hat_cv[test] = rf_fit_predict(X[train, :], D[train], X[test, :])
end
e_hat_cv = clamp.(e_hat_cv, 0.02, 0.98)

pseudo_R   = (Y .- m_hat) ./ (D .- e_hat_cv)
weights_R  = (D .- e_hat_cv) .^ 2

# The R-learner is a *weighted* regression of pseudo_R on X with weights
# (D - e_hat)^2. DecisionTree's RF does not accept per-sample weights, so we
# pass them through rf_fit_predict's weighted-bootstrap path. This downweights
# observations with D close to e_hat, whose pseudo_R blows up (division by a
# near-zero denominator) and would otherwise dominate the split criterion.
tau_R = rf_fit_predict(X, pseudo_R, X; weights = weights_R)
@printf("R-learner CATE correlation with truth: %.3f\n", cor(tau_R, tau))

R-learner CATE correlation with truth: 0.467

The R-learner is often stable when overlap is reasonable because nuisance model errors have only second-order effects on the target.

9.7 DR-learner (Kennedy 2020)

The DR-learner uses an AIPW-style pseudo-outcome and then regresses it on \(X\):

\[ \tilde Y_i^{DR} = \hat\mu_1(X_i) - \hat\mu_0(X_i) + \frac{D_i (Y_i - \hat\mu_1(X_i))}{\hat e(X_i)} - \frac{(1 - D_i) (Y_i - \hat\mu_0(X_i))}{1 - \hat e(X_i)}. \]

mu1_cf = zeros(n)
mu0_cf = zeros(n)
e_cf   = zeros(n)

for k in 1:2
    train = folds .!= k
    test  = folds .== k
    mu1_cf[test] = rf_fit_predict(X[train .& (D .== 1), :], Y[train .& (D .== 1)],
                                  X[test, :])
    mu0_cf[test] = rf_fit_predict(X[train .& (D .== 0), :], Y[train .& (D .== 0)],
                                  X[test, :])
    e_cf[test]   = rf_fit_predict(X[train, :], D[train], X[test, :])
end
e_cf = clamp.(e_cf, 0.02, 0.98)

pseudo_DR = @. (mu1_cf - mu0_cf) +
               D * (Y - mu1_cf) / e_cf -
               (1 - D) * (Y - mu0_cf) / (1 - e_cf)

tau_DR = rf_fit_predict(X, pseudo_DR, X)
@printf("DR-learner CATE correlation with truth: %.3f\n", cor(tau_DR, tau))

DR-learner CATE correlation with truth: 0.433

9.8 Comparing all five estimators

fig = Figure(size = (1000, 700))
labs = ["S-learner", "T-learner", "X-learner", "R-learner", "DR-learner"]
preds = [tau_S, tau_T, tau_X, tau_R, tau_DR]

for (i, (lab, pred)) in enumerate(zip(labs, preds))
    row, col = divrem(i - 1, 3) .+ (1, 1)
    ax = Axis(fig[row, col],
              xlabel = "X1", ylabel = "Estimated CATE",
              title = lab)
    scatter!(ax, X[:, 1], pred, color = (:steelblue, 0.2), markersize = 4)
    lines!(ax, 0:0.01:1, x -> 1 + 2x, color = :firebrick, linestyle = :dash,
           linewidth = 2)
end
fig

Estimated CATE vs true τ(x) = 1 + 2 X₁ for each meta-learner. Red dashed line = ground truth.

9.9 Best Linear Projection (BLP)

Even if CATE is nonlinear, we often want a regression-style summary: which covariates are associated with larger effects? The best linear projection is the OLS regression of pseudo-outcomes on covariates:

\[ (\beta_0^*, \beta^*) = \arg\min_{\beta_0, \beta} \mathbb{E}\left[(\tau(X) - \beta_0 - X'\beta)^2\right], \qquad \text{BLP}(X) = \beta_0^* + X'\beta^*. \]

The \(\arg\min\) returns the coefficient vector \((\beta_0^*, \beta^*)\); the best linear projection itself is the fitted function \(\beta_0^* + X'\beta^*\).

A doubly-robust BLP uses the DR-learner pseudo-outcomes:

blp_df = DataFrame(hcat(pseudo_DR, X), [:tau_pseudo, :X1, :X2, :X3, :X4, :X5])
blp_fit = lm(@formula(tau_pseudo ~ X1 + X2 + X3 + X4 + X5), blp_df)

println(coeftable(blp_fit))

StatsBase.CoefTable(Any[[0.9973611151354204, 2.110501538859011, -0.04832888316981774, 0.006995729388057639, -0.030093840456521418, -0.0002630372445912356], [0.14012726480327659, 0.12065891499332179, 0.12124516012164167, 0.12245343590563336, 0.1227109064592853, 0.12270936693844868], [7.117537879124428, 17.491467903352376, -0.39860463808477636, 0.05712971086780103, -0.2452417745484291, -0.002143579183512337], [1.2545284517643298e-12, 1.5476873630260935e-66, 0.6902015847825644, 0.9544441547575743, 0.8062793090862093, 0.9982897581916563], [0.7226501431552275, 1.873957081487173, -0.2860226384263, -0.2330667769855357, -0.2706611021762589, -0.24082728082745142], [1.2720720871156133, 2.347045996230849, 0.1893648720866645, 0.247058235761651, 0.21047342126321603, 0.24030120633826896]], ["Coef.", "Std. Error", "t", "Pr(>|t|)", "Lower 95%", "Upper 95%"], ["(Intercept)", "X1", "X2", "X3", "X4", "X5"], 4, 3)

Here the coefficient on \(X_1\) should be close to 2, and the coefficients on the other variables should be close to 0.

9.10 GATES: sorted group ATEs

GATES (group average treatment effects) is a simple way to report heterogeneity (Chernozhukov et al. 2018). Sort observations by predicted CATE, split them into bins, and estimate the ATE in each bin. (The same paper’s CLAN — classification analysis — is the natural companion: compare average covariates between the most- and least-affected bins; here that would show high \(X_1\) in the top quintile.)

nq = 5
quintile_edges = quantile(tau_DR, range(0, 1, length = nq + 1))
quintiles      = searchsortedfirst.(Ref(quintile_edges), tau_DR) .- 1
quintiles      = clamp.(quintiles, 1, nq)

# AIPW-style ATE within each quintile using the DR pseudo-outcomes
function quintile_ate(tau_pseudo, idx)
    n_q = sum(idx)
    est = mean(tau_pseudo[idx])
    sd  = std(tau_pseudo[idx]) / sqrt(n_q)
    return (est = est, se = sd)
end

clan_df = DataFrame(quintile = 1:nq,
                    ATE = [quintile_ate(pseudo_DR, quintiles .== q).est for q in 1:nq],
                    SE  = [quintile_ate(pseudo_DR, quintiles .== q).se  for q in 1:nq])
clan_df.lo = clan_df.ATE .- 1.96 .* clan_df.SE
clan_df.hi = clan_df.ATE .+ 1.96 .* clan_df.SE

@printf("%-10s %8s %8s %8s %8s\n", "Quintile", "ATE", "SE", "95% LB", "95% UB")
for row in eachrow(clan_df)
    @printf("%-10d %8.3f %8.3f %8.3f %8.3f\n",
            row.quintile, row.ATE, row.SE, row.lo, row.hi)
end

Quintile        ATE       SE   95% LB   95% UB
1            -1.204    0.062   -1.326   -1.082
2             1.008    0.024    0.962    1.054
3             2.032    0.023    1.987    2.077
4             3.025    0.023    2.979    3.071
5             5.271    0.068    5.138    5.404

Quintile 5 should have a larger ATE than quintile 1 in this simulation. One caveat: the quintiles are formed from the same pseudo-outcomes used to estimate the group ATEs, which biases the spread outward (the top bin collects positive noise, the bottom bin negative noise). For inference, form the groups and estimate the ATEs on separate folds.

9.11 Variable importance — which covariate drives heterogeneity?

A simple diagnostic is to regress the DR pseudo-outcome on each covariate separately and compare \(R^2\).

function single_var_r2(target, x)
    df_one = DataFrame(t = target, x = x)
    fit = lm(@formula(t ~ x), df_one)
    1 - sum(abs2.(residuals(fit))) / sum(abs2.(target .- mean(target)))
end

vi = [single_var_r2(pseudo_DR, X[:, j]) for j in 1:p]
vi_df = DataFrame(variable = ["X$j" for j in 1:p], R2 = vi)
sort!(vi_df, :R2, rev = true)
println(vi_df)

5×2 DataFrame

 Row │ variable  R2         

     │ String    Float64    

─────┼──────────────────────

   1 │ X1        0.0577564

   2 │ X4        3.29404e-5

   3 │ X5        2.40635e-5

   4 │ X2        1.23678e-5

   5 │ X3        5.64835e-6

\(X_1\) should be the most important variable in this simulation.

9.12 Policy learning: who should we treat?

A CATE estimate is not yet a policy. If treatment has a cost, the policy question is who should be treated. Define a treatment cost \(c\) in outcome units:

\[ \pi^*(x) = \mathbb{1}\{\tau(x) > c\}. \]

For interpretability, we can restrict the policy to a single threshold rule.

# cost = 2 makes the problem non-degenerate: tau(x) = 1 + 2 x1 is in [1, 3],
# so the optimal rule is "treat iff x1 > 0.5" (about half the population).
# A cost below 1 would make treat-everyone optimal and there would be
# nothing for a policy to learn.
cost = 2.0

# Welfare of a rule = average gain over treating nobody, evaluated with the
# DR scores: mean over units of treat(x) * (pseudo_DR - cost)
welfare(treat) = mean(treat .* (pseudo_DR .- cost))

welfare_all  = welfare(trues(n))            # treat everyone
treat_est    = tau_DR .> cost               # rule from the ESTIMATED CATE
treat_true   = tau .> cost                  # oracle rule (simulation only)

# Simple threshold-rule family on X1
threshes = 0:0.05:1
welfares = [welfare(X[:, 1] .> t) for t in threshes]
best_t   = threshes[argmax(welfares)]

@printf("Treatment rates: estimated rule %.2f, oracle rule %.2f\n",
        mean(treat_est), mean(treat_true))
@printf("Welfare (treat everyone):                    %.3f\n", welfare_all)
@printf("Welfare (oracle rule τ(x) > %.0f):            %.3f\n", cost, welfare(treat_true))
@printf("Welfare (best X1-threshold rule, X1 > %.2f): %.3f\n",
        best_t, maximum(welfares))
@printf("Welfare (τ̂-rule, SAME scores, biased):       %.3f\n", welfare(treat_est))
@printf("Welfare (τ̂-rule, evaluated on true τ):       %.3f\n",
        mean(treat_est .* (tau .- cost)))

Treatment rates: estimated rule 0.51, oracle rule 0.51
Welfare (treat everyone):                    0.026
Welfare (oracle rule τ(x) > 2):            0.302
Welfare (best X1-threshold rule, X1 > 0.45): 0.303
Welfare (τ̂-rule, SAME scores, biased):       0.887
Welfare (τ̂-rule, evaluated on true τ):       0.133

The threshold rule gives up flexibility, but it is easy to explain — and it recovers the oracle threshold of \(0.5\) almost exactly.

One number above deserves a warning. The \(\hat\tau\)-rule “evaluated” with the same DR scores that selected it appears to beat even the oracle rule — which is impossible in expectation. Selecting units whose noisy score is high and then averaging those same scores is optimistic by construction. In a simulation we can evaluate the rule against the true \(\tau(x)\) instead (the last line): the honest value is positive — better than treating everyone — but well below the oracle, because the noisy CATE estimates misclassify many units near the threshold. Both lessons matter: never evaluate a rule on the scores that chose it (in real data, estimate the rule and evaluate its welfare on separate folds), and do not expect an estimated rule to attain oracle welfare. Here the simple \(X_1\)-threshold rule, which searches a small one-dimensional family, gets essentially the oracle value — restricting the policy class is a form of regularisation.

For more on policy trees with IPW and AIPW losses (using R’s policytree package), see the companion blog chapter on policytree. For a cross-software comparison of CATE estimators (including Stata 19’s new cate command), see the Stata CATE blog chapter.

9.13 Causal forests with panel data

With panel data, unit effects can be correlated with treatment and covariates. A cross-sectional meta-learner can then be biased. The fixed effect adjustment is to demean by unit before applying the learner.

Random.seed!(2024)
n_firms = 200
n_t     = 5
N       = n_firms * n_t

firm_id  = repeat(1:n_firms, inner = n_t)
unit_fe  = randn(n_firms) .* 1.5
V1_firm  = randn(n_firms) .* 1.0
V1_panel = V1_firm[firm_id] .+ 0.3 .* randn(N)
# The unit effect drives BOTH treatment and the outcome (classic
# fixed-effect confounding); unit_fe is unobserved to the learner.
W_panel  = Float64.(rand(N) .<
           @. 1 / (1 + exp(-(0.3 * V1_firm[firm_id] + 0.5 * unit_fe[firm_id]))))
tau_panel = @. 0.5 + 1.0 * V1_panel
Y_panel  = unit_fe[firm_id] .+ V1_panel .+ tau_panel .* W_panel .+ randn(N)

df_panel = DataFrame(firm = firm_id, V1 = V1_panel, W = W_panel, Y = Y_panel)
# Y(1) - Y(0) = tau_panel exactly, so the true ATE is mean(tau_panel).
true_panel_ate = mean(tau_panel)
@printf("True panel ATE: %.3f\n", true_panel_ate)

True panel ATE: 0.655

A naive T-learner ignores the firm fixed effects. Because unit_fe raises both the treatment probability and the outcome, and is not in the learner’s covariates, the naive estimate is badly biased upward:

X_panel_naive = reshape(df_panel.V1, N, 1)
idx_T_p = df_panel.W .== 1
idx_C_p = df_panel.W .== 0

mu1_naive = rf_fit_predict(X_panel_naive[idx_T_p, :], df_panel.Y[idx_T_p],
                            X_panel_naive)
mu0_naive = rf_fit_predict(X_panel_naive[idx_C_p, :], df_panel.Y[idx_C_p],
                            X_panel_naive)
tau_naive = mu1_naive .- mu0_naive

@printf("Naive panel T-learner ATE: %.3f  (true = %.3f)\n",
        mean(tau_naive), true_panel_ate)

Naive panel T-learner ATE: 2.014  (true = 0.655)

The within transformation removes firm-level variation and leaves the within-firm variation:

# Within transformation: subtract firm means
df_dm = combine(groupby(df_panel, :firm),
                :Y  => (y -> y .- mean(y)) => :Y_dm,
                :W  => (w -> w .- mean(w)) => :W_dm,
                :V1 => (v -> v .- mean(v)) => :V1_dm)

X_dm = reshape(df_dm.V1_dm, N, 1)
idx_T_dm = df_dm.W_dm .> 0
idx_C_dm = df_dm.W_dm .<= 0

mu1_dm = rf_fit_predict(X_dm[idx_T_dm, :], df_dm.Y_dm[idx_T_dm], X_dm)
mu0_dm = rf_fit_predict(X_dm[idx_C_dm, :], df_dm.Y_dm[idx_C_dm], X_dm)
tau_dm = mu1_dm .- mu0_dm

# The two demeaned groups differ in W_dm by less than a full 0 -> 1 switch
# (treated-above-average vs below-average within firm), so the raw contrast
# is attenuated; rescale by the W_dm gap, as in a Wald estimator.
w_gap = mean(df_dm.W_dm[idx_T_dm]) - mean(df_dm.W_dm[idx_C_dm])
@printf("Within-transform T-learner ATE: %.3f  (true = %.3f)\n",
        mean(tau_dm) / w_gap, true_panel_ate)

Within-transform T-learner ATE: 0.689  (true = 0.655)

The within transformation removes the fixed effects, and the rescaled contrast lands near the truth in this DGP. The cost is variance, because identification now comes from within-firm variation.

A caveat on what this estimator targets. Splitting on \(W_{dm} > 0\) vs \(W_{dm} \le 0\) and dividing the T-learner contrast by the average \(W_{dm}\) gap is not a general fixed-effect CATE estimator. It is a heuristic within-firm contrast whose target depends on the distribution of demeaned treatment values and on the grouping rule, and the rescaling is a Wald-style approximation rather than an identified within estimand. It works here because the treatment effect is linear in \(V_1\) and the DGP is benign. Properly identified heterogeneous panel effects require a proper orthogonal score or a dedicated panel causal-forest / DML construction; treat this section as intuition, not a turnkey method.

For R’s grf::causal_forest, pass clusters = firm_id so the honest sample split keeps each firm’s observations together. Julia does not currently have an equivalent; see the companion blog chapter on causal forests in panel data and the R companion to this chapter for the GRF-based workflow.

9.14 Summary

Meta-learners turn regression tools into CATE estimators.
In Julia they can be built with MLJ random forests, but there is no full grf equivalent yet.
BLP and GATES summarize heterogeneity in regression-style output.
Policy learning turns CATE estimates into treatment rules.
In applied work, compare at least two CATE estimators and report simple heterogeneity summaries, not only a CATE scatterplot.