7  Matching Estimators

using DataFrames
using GLM
using Statistics
using Random
using LinearAlgebra
using Printf
using CairoMakie

Matching is the oldest and most-taught approach to observational causal inference. For each treated unit, find a “twin” control unit with similar covariates, then compare outcomes within matched pairs. With enough good matches, the comparison approximates a randomised experiment.

The mature implementation in R (MatchIt, cobalt, WeightIt) has no direct Julia counterpart. This chapter implements the core ideas inline — propensity score matching, balance diagnostics, weighted estimation, entropy balancing — using GLM.jl, Optim.jl, and base Julia. For optimal matching, full matching, coarsened exact matching, and additional weighting variants (CBPS, energy balancing) with mature tooling, see the R companion chapter.

Related reading: The Matching and Weighting Part 1 and Treatment effects and matching chapters of Topics on Econometrics and Causal Inference walk through matching with MatchIt and Stata’s teffects commands respectively.

7.1 Assumptions

Matching identifies a causal effect under three assumptions:

  1. SUTVA — no interference, no hidden treatment versions.
  2. Ignorability — conditional on \(X\), \(D \perp (Y(0), Y(1))\).
  3. Overlap — every \(x\) has positive probability of both treatments.

7.2 A simulated example

Random.seed!(42)
n  = 2000
X1 = randn(n)
X2 = randn(n)

# Treatment depends on covariates (confounding)
ps_true = @. 1 / (1 + exp(-(-0.5 + 0.6 * X1 + 0.4 * X2)))
D       = Float64.(rand(n) .< ps_true)

# Outcome: true treatment effect = 1, plus confounding through X1, X2
Y = @. 1.0 * D + 0.8 * X1 + 0.6 * X2 + 0.5 * randn()

df = DataFrame(Y = Y, D = D, X1 = X1, X2 = X2)
@printf("Naive difference in means: %.3f\n", mean(Y[D .== 1]) - mean(Y[D .== 0]))
@printf("True ATE: 1.000\n")

7.3 Estimating the propensity score

ps_fit = glm(@formula(D ~ X1 + X2), df, Binomial(), LogitLink())
df.ps  = predict(ps_fit)

@printf("PS range: [%.3f, %.3f]\n", minimum(df.ps), maximum(df.ps))

7.4 Nearest-neighbour 1:1 matching

"""
    nearest_neighbor_match(ps_treated, ps_control)
Return a vector of indices into the control group, matching each treated
unit to its nearest-neighbour control on PS. Sampling is without replacement.
"""
function nearest_neighbor_match(ps_treated, ps_control)
    n_t = length(ps_treated)
    avail = trues(length(ps_control))
    matches = zeros(Int, n_t)
    for (i, p) in enumerate(ps_treated)
        candidates = findall(avail)
        if isempty(candidates)
            break
        end
        diffs = abs.(ps_control[candidates] .- p)
        best  = candidates[argmin(diffs)]
        matches[i] = best
        avail[best] = false
    end
    matches
end

idx_T = findall(df.D .== 1)
idx_C = findall(df.D .== 0)

matched_C = nearest_neighbor_match(df.ps[idx_T], df.ps[idx_C])
matched_idx_C = idx_C[matched_C[matched_C .> 0]]
matched_idx_T = idx_T[1:length(matched_idx_C)]

@printf("Matched %d treated to %d controls\n",
        length(matched_idx_T), length(matched_idx_C))

7.5 Balance diagnostics

The standardised mean difference (SMD) before and after matching shows whether the matched groups are balanced:

function smd(x_t, x_c)
    (mean(x_t) - mean(x_c)) / sqrt((var(x_t) + var(x_c)) / 2)
end

@printf("%-15s %12s %12s\n", "Covariate", "SMD before", "SMD after")
for v in [:X1, :X2]
    smd_before = smd(df[!, v][df.D .== 1], df[!, v][df.D .== 0])
    smd_after  = smd(df[!, v][matched_idx_T], df[!, v][matched_idx_C])
    @printf("%-15s %12.3f %12.3f\n", String(v), smd_before, smd_after)
end

Standardised mean differences of magnitude < 0.1 after matching indicate adequate balance. Larger differences suggest the matching is insufficient and a more flexible method (e.g. Mahalanobis distance, or matching with replacement) is needed.

7.6 Estimating the ATT on matched data

After matching, the ATT is the simple difference in means:

att = mean(df.Y[matched_idx_T]) - mean(df.Y[matched_idx_C])
@printf("Matched ATT: %.3f  (true ATE = 1.000)\n", att)

# Standard error: paired-difference SE (treats matched pairs as paired)
diffs = df.Y[matched_idx_T] .- df.Y[matched_idx_C]
se    = std(diffs) / sqrt(length(diffs))
@printf("Paired-diff SE: %.3f\n", se)
@printf("95%% CI: [%.3f, %.3f]\n", att - 1.96 * se, att + 1.96 * se)

The matched estimate is close to the truth, and the paired-difference standard error is the right inferential approach for matched pairs.

7.7 Matching with replacement

When the treated group is larger or the propensity distributions diverge, allowing each control to be matched more than once improves balance at the cost of effective sample size:

function nn_match_replace(ps_treated, ps_control)
    [argmin(abs.(ps_control .- p)) for p in ps_treated]
end

matched_repl = nn_match_replace(df.ps[idx_T], df.ps[idx_C])
@printf("Average reuse: %.2f (1.0 = no replacement)\n",
        length(matched_repl) / length(unique(matched_repl)))

att_repl = mean(df.Y[idx_T]) - mean(df.Y[idx_C[matched_repl]])
@printf("Matched-with-replacement ATT: %.3f\n", att_repl)

7.8 Matching vs weighting

A close cousin of matching is inverse propensity weighting (IPW): weight treated units by \(1 / \hat e(X)\) and control units by \(1 / (1 - \hat e(X))\), then take the weighted difference in means:

w = ifelse.(df.D .== 1, 1 ./ df.ps, 1 ./ (1 .- df.ps))
w_trim = min.(w, quantile(w, 0.99))

att_ipw = sum(df.Y[df.D .== 1] .* w_trim[df.D .== 1]) / sum(w_trim[df.D .== 1]) -
          sum(df.Y[df.D .== 0] .* w_trim[df.D .== 0]) / sum(w_trim[df.D .== 0])
@printf("IPW ATT (≈ ATE):       %.3f\n", att_ipw)
@printf("Matched ATT (NN):      %.3f\n", att)
@printf("Matched ATT (replace): %.3f\n", att_repl)
@printf("True ATE:              1.000\n")

IPW and matching often agree closely. When they disagree, the divergence usually reflects either a misspecified propensity score (check with balance diagnostics) or an overlap problem.

7.9 Entropy balancing

Beyond IPW, the modern toolkit of weighting estimators includes entropy balancing (Hainmueller 2012), which yields weights with exact mean balance on every covariate by solving a small convex optimisation problem — no propensity-score model required. R’s WeightIt package implements this and several related methods (CBPS, energy balancing, SuperLearner-based weights); the Julia ecosystem currently lacks an equivalent package, but the core math is short enough to implement inline.

The entropy-balancing problem for the ATT: find weights \(w_i\) on the control units that solve

\[ \min_{w_i \geq 0} \sum_i w_i \log w_i \quad \text{subject to} \quad \sum_i w_i x_{ik} = \bar x_{k}^{\text{treated}} \;\; \forall k, \quad \sum_i w_i = N_{\text{treated}}. \]

Lagrangian duality reduces this to a low-dimensional convex problem in the Lagrange multipliers \(\lambda\) (one per balance constraint). The dual is:

\[ \min_{\lambda} \;\; \log\left(\sum_i \exp(-x_i' \lambda)\right) + \lambda' \bar x^{\text{treated}}. \]

The dual problem is small (one Lagrange multiplier per covariate) and strictly convex, so it can be solved with a few iterations of Newton’s method:

"""
    entropy_balance(X_control, X_treated; tol=1e-8, maxiter=50)
Compute entropy-balancing weights on the control units so that the
weighted control mean of every covariate equals the treated mean.
Returns a length-`size(X_control, 1)` vector of normalised weights that
sum to 1.
"""
function entropy_balance(X_control::AbstractMatrix, X_treated::AbstractMatrix;
                          tol::Float64 = 1e-8, maxiter::Int = 50)
    target = vec(mean(X_treated, dims=1))   # length p target means
    Xc = X_control                          # n_c × p
    p  = size(Xc, 2)
    λ  = zeros(p)

    for _ in 1:maxiter
        z = -Xc * λ
        m = maximum(z)
        w = exp.(z .- m)
        w ./= sum(w)                                    # normalised weights
        mean_w = vec(w' * Xc)                           # weighted control mean
        grad   = target .- mean_w                       # negative gradient
        if maximum(abs.(grad)) < tol
            break
        end
        # Hessian: Σ w_i (x_i - mean_w)(x_i - mean_w)'  (with tiny ridge for stability)
        Xc_centered = Xc .- reshape(mean_w, 1, :)
        H = Matrix((Xc_centered .* w)' * Xc_centered) + 1e-6 .* Matrix{Float64}(I, p, p)
        # Newton step: λ ← λ + H⁻¹ grad
        step = H \ grad
        λ .+= step
    end

    z = -Xc * λ
    m = maximum(z)
    w = exp.(z .- m)
    w ./ sum(w)
end

# Apply to the simulated example from earlier sections
Xc_mat = Matrix(df[df.D .== 0, [:X1, :X2]])
Xt_mat = Matrix(df[df.D .== 1, [:X1, :X2]])

w_ebal = entropy_balance(Xc_mat, Xt_mat)
@printf("Sum of weights: %.6f  (should be 1.0)\n", sum(w_ebal))
@printf("Max weight: %.4f, min weight: %.6f\n",
        maximum(w_ebal), minimum(w_ebal))

# Verify the mean-balance constraints
control_means_unweighted = mean(Xc_mat, dims=1)
control_means_balanced   = w_ebal' * Xc_mat
treated_means            = mean(Xt_mat, dims=1)

@printf("\n%-15s %12s %12s %12s\n",
        "Covariate", "Treated", "Control (raw)", "Control (ebal)")
for (j, name) in enumerate([:X1, :X2])
    @printf("%-15s %12.4f %12.4f %12.4f\n",
            String(name),
            treated_means[j], control_means_unweighted[j],
            control_means_balanced[j])
end

Notice that the entropy-balanced control means exactly equal the treated means — this is guaranteed by the construction. The cost is that the weights can be quite variable when the covariate distributions differ sharply (poor overlap).

The ATT estimate uses these weights on the control outcomes:

Yt = df.Y[df.D .== 1]
Yc = df.Y[df.D .== 0]

# ATT = mean(Y_treated) - weighted mean(Y_control)
att_ebal = mean(Yt) - dot(w_ebal, Yc)
@printf("Entropy-balanced ATT: %.3f  (true ATE = 1.000)\n", att_ebal)

Entropy balancing is particularly effective when balance on covariate means is the analyst’s primary diagnostic. For balance on full covariate distributions, R’s WeightIt package also implements energy balancing (Huling & Mak 2024); no Julia port currently exists.

7.10 When to use which method

Method When to use Limitation
NN matching (no replacement) Simple, intuitive; ATT estimand Drops unmatched units
NN matching (with replacement) Few treated, many controls Wasteful of controls
IPW Good propensity model Sensitive to extreme weights
Entropy balancing Want exact mean balance; small covariate set Balances means only, not distributions
Doubly robust (AIPW) Combine matching/IPW with regression More complex

For applied work in Julia, IPW + AIPW is usually the more practical default than 1:1 matching, because Julia’s regression and weighting infrastructure (GLM, Panelest) is mature whereas matching algorithms are not. For matched analyses, the R ecosystem (MatchIt, cobalt) is the workhorse — the companion R chapter walks through full matching, CEM, and balance diagnostics in detail.

7.11 Summary

  • Matching restricts comparisons to similar treated–control pairs, approximating a randomised experiment.
  • Propensity score nearest-neighbour matching is the simplest form and implementable in a few lines of Julia.
  • Balance diagnostics (standardised mean differences) verify the matching’s success.
  • Estimation: paired-difference SE on the matched data; regression adjustment on the matched sample (an even better default).
  • For richer matching (optimal, full, CEM) and mature balance tooling, use R’s MatchIt/cobalt ecosystem.