7 Matching Estimators

using DataFrames
using GLM
using Statistics
using Random
using LinearAlgebra
using Printf
using CairoMakie

Matching is a way to make treated and control observations more comparable. For each treated observation, we look for controls with similar covariates. Then we compare outcomes in this matched sample.

This does not solve endogeneity. Matching only helps with selection on observables. If treatment is selected on unobserved variables, matching will not fix the problem. Its value is that it makes overlap visible.

Julia does not yet have the same matching ecosystem as R’s MatchIt, cobalt, and WeightIt. Here I implement the basic pieces directly: propensity-score matching, balance diagnostics, IPW, and entropy balancing. For full matching, CEM, CBPS, energy balancing, and better balance plots, see the R companion chapter.

Related reading: The Matching and Weighting Part 1 and Treatment effects and matching chapters of Topics on Econometrics and Causal Inference walk through matching with MatchIt and Stata’s teffects commands respectively.

7.1 Assumptions

Matching needs the same assumptions as other selection-on-observables methods:

SUTVA — no interference, no hidden treatment versions.
Ignorability — conditional on \(X\), \(D \perp (Y(0), Y(1))\).
Overlap — every \(x\) has positive probability of both treatments.

7.2 A simulated example

Random.seed!(42)
n  = 2000
X1 = randn(n)
X2 = randn(n)

# Treatment depends on covariates (confounding)
ps_true = @. 1 / (1 + exp(-(-0.5 + 0.6 * X1 + 0.4 * X2)))
D       = Float64.(rand(n) .< ps_true)

# Outcome: true treatment effect = 1, plus confounding through X1, X2
Y = @. 1.0 * D + 0.8 * X1 + 0.6 * X2 + 0.5 * randn()

df = DataFrame(Y = Y, D = D, X1 = X1, X2 = X2)
@printf("Naive difference in means: %.3f\n", mean(Y[D .== 1]) - mean(Y[D .== 0]))
@printf("True ATE: 1.000\n")

Naive difference in means: 1.713
True ATE: 1.000

7.3 Estimating the propensity score

ps_fit = glm(@formula(D ~ X1 + X2), df, Binomial(), LogitLink())
df.ps  = predict(ps_fit)

@printf("PS range: [%.3f, %.3f]\n", minimum(df.ps), maximum(df.ps))

PS range: [0.060, 0.927]

7.4 Nearest-neighbour 1:1 matching

"""
    nearest_neighbor_match(ps_treated, ps_control)
Return a vector of indices into the control group, matching each treated
unit to its nearest-neighbour control on PS. Sampling is without replacement.
"""
function nearest_neighbor_match(ps_treated, ps_control)
    n_t = length(ps_treated)
    avail = trues(length(ps_control))
    matches = zeros(Int, n_t)
    for (i, p) in enumerate(ps_treated)
        candidates = findall(avail)
        if isempty(candidates)
            break
        end
        diffs = abs.(ps_control[candidates] .- p)
        best  = candidates[argmin(diffs)]
        matches[i] = best
        avail[best] = false
    end
    matches
end

idx_T = findall(df.D .== 1)
idx_C = findall(df.D .== 0)

matched_C = nearest_neighbor_match(df.ps[idx_T], df.ps[idx_C])
matched_idx_C = idx_C[matched_C[matched_C .> 0]]
matched_idx_T = idx_T[1:length(matched_idx_C)]

@printf("Matched %d treated to %d controls\n",
        length(matched_idx_T), length(matched_idx_C))

Matched 818 treated to 818 controls

7.5 Balance diagnostics

The standardized mean difference before and after matching tells us whether the matched groups are balanced:

function smd(x_t, x_c)
    (mean(x_t) - mean(x_c)) / sqrt((var(x_t) + var(x_c)) / 2)
end

@printf("%-15s %12s %12s\n", "Covariate", "SMD before", "SMD after")
for v in [:X1, :X2]
    smd_before = smd(df[!, v][df.D .== 1], df[!, v][df.D .== 0])
    smd_after  = smd(df[!, v][matched_idx_T], df[!, v][matched_idx_C])
    @printf("%-15s %12.3f %12.3f\n", String(v), smd_before, smd_after)
end

Covariate         SMD before    SMD after
X1                     0.614        0.270
X2                     0.394        0.152

Standardized mean differences below 0.1 are usually treated as acceptable. If they remain large, the matching is not doing enough.

7.6 Estimating the ATT on matched data

After matching, the ATT is the simple difference in means:

att = mean(df.Y[matched_idx_T]) - mean(df.Y[matched_idx_C])
@printf("Matched ATT: %.3f  (true ATE = 1.000)\n", att)

# Standard error: paired-difference SE (treats matched pairs as paired)
diffs = df.Y[matched_idx_T] .- df.Y[matched_idx_C]
se    = std(diffs) / sqrt(length(diffs))
@printf("Paired-diff SE: %.3f\n", se)
@printf("95%% CI: [%.3f, %.3f]\n", att - 1.96 * se, att + 1.96 * se)

Matched ATT: 1.287  (true ATE = 1.000)
Paired-diff SE: 0.032
95% CI: [1.225, 1.350]

The matched estimate is still visibly biased (compare it to the truth of 1.0). The balance table above says why: with 818 treated units and only 1,182 controls, 1:1 matching without replacement exhausts the good controls, and the post-matching standardized mean differences (0.27 and 0.15) remain above the 0.1 threshold. Residual imbalance means residual confounding. The with-replacement, IPW, and entropy-balancing estimates below, which are not constrained to use each control at most once, all land much closer to the truth. For matched pairs, the paired-difference standard error is the natural first check.

7.7 Matching with replacement

With replacement, the same control can be used several times. This can improve balance when good controls are scarce, but the effective sample size is smaller.

function nn_match_replace(ps_treated, ps_control)
    [argmin(abs.(ps_control .- p)) for p in ps_treated]
end

matched_repl = nn_match_replace(df.ps[idx_T], df.ps[idx_C])
@printf("Average reuse: %.2f (1.0 = no replacement)\n",
        length(matched_repl) / length(unique(matched_repl)))

att_repl = mean(df.Y[idx_T]) - mean(df.Y[idx_C[matched_repl]])
@printf("Matched-with-replacement ATT: %.3f\n", att_repl)

Average reuse: 1.76 (1.0 = no replacement)
Matched-with-replacement ATT: 0.988

7.8 Matching vs weighting

IPW is very close to matching in spirit. Instead of dropping or pairing observations, we reweight observations by the inverse probability of receiving their observed treatment.

# These are the ATE (Horvitz-Thompson/Hajek) weights: treated by 1/ps and
# controls by 1/(1-ps). They target the ATE, not the ATT. With a homogeneous
# +1 effect here, ATE = ATT numerically, but the estimand is the ATE.
w = ifelse.(df.D .== 1, 1 ./ df.ps, 1 ./ (1 .- df.ps))
w_trim = min.(w, quantile(w, 0.99))

ate_ipw = sum(df.Y[df.D .== 1] .* w_trim[df.D .== 1]) / sum(w_trim[df.D .== 1]) -
          sum(df.Y[df.D .== 0] .* w_trim[df.D .== 0]) / sum(w_trim[df.D .== 0])
@printf("IPW ATE:               %.3f\n", ate_ipw)
@printf("Matched ATT (NN):      %.3f\n", att)
@printf("Matched ATT (replace): %.3f\n", att_repl)
@printf("True ATE:              1.000\n")

IPW ATE:               1.064
Matched ATT (NN):      1.287
Matched ATT (replace): 0.988
True ATE:              1.000

When IPW and matching disagree sharply, it is usually a warning about the propensity-score model or lack of overlap.

7.9 Entropy balancing

Entropy balancing chooses weights so that the weighted control group has the same covariate means as the treated group. It does not require a propensity-score model. R’s WeightIt package implements this and related methods. In Julia, the basic optimization is short enough to write directly.

The entropy-balancing problem for the ATT: find weights \(w_i\) on the control units that solve

\[ \min_{w_i \geq 0} \sum_i w_i \log w_i \quad \text{subject to} \quad \sum_i w_i x_{ik} = \bar x_{k}^{\text{treated}} \;\; \forall k, \quad \sum_i w_i = 1. \]

The dual problem has one parameter per balance constraint:

\[ \min_{\lambda} \;\; \log\left(\sum_i \exp(-x_i' \lambda)\right) + \lambda' \bar x^{\text{treated}}. \]

This is a small convex problem, so Newton’s method works well here.

"""
    entropy_balance(X_control, X_treated; tol=1e-8, maxiter=50)
Compute entropy-balancing weights on the control units so that the
weighted control mean of every covariate equals the treated mean.
Returns a length-`size(X_control, 1)` vector of normalised weights that
sum to 1.
"""
function entropy_balance(X_control::AbstractMatrix, X_treated::AbstractMatrix;
                          tol::Float64 = 1e-8, maxiter::Int = 50)
    target = vec(mean(X_treated, dims=1))   # length p target means
    Xc = X_control                          # n_c × p
    p  = size(Xc, 2)
    λ  = zeros(p)

    for _ in 1:maxiter
        z = -Xc * λ
        m = maximum(z)
        w = exp.(z .- m)
        w ./= sum(w)                                    # normalised weights
        mean_w = vec(w' * Xc)                           # weighted control mean
        grad   = target .- mean_w                       # negative gradient
        if maximum(abs.(grad)) < tol
            break
        end
        # Hessian: Σ w_i (x_i - mean_w)(x_i - mean_w)'  (with tiny ridge for stability)
        Xc_centered = Xc .- reshape(mean_w, 1, :)
        H = Matrix((Xc_centered .* w)' * Xc_centered) + 1e-6 .* Matrix{Float64}(I, p, p)
        # Newton step on the convex dual L(λ)=log Σexp(-xᵢλ)+λ·target,
        # whose gradient is `grad`=target-mean_w and Hessian is H: λ ← λ - H⁻¹ grad
        step = H \ grad
        λ .-= step
    end

    z = -Xc * λ
    m = maximum(z)
    w = exp.(z .- m)
    w ./ sum(w)
end

# Apply to the simulated example from earlier sections
Xc_mat = Matrix(df[df.D .== 0, [:X1, :X2]])
Xt_mat = Matrix(df[df.D .== 1, [:X1, :X2]])

w_ebal = entropy_balance(Xc_mat, Xt_mat)
@printf("Sum of weights: %.6f  (should be 1.0)\n", sum(w_ebal))
@printf("Max weight: %.4f, min weight: %.6f\n",
        maximum(w_ebal), minimum(w_ebal))

# Verify the mean-balance constraints
control_means_unweighted = mean(Xc_mat, dims=1)
control_means_balanced   = w_ebal' * Xc_mat
treated_means            = mean(Xt_mat, dims=1)

@printf("\n%-15s %12s %12s %12s\n",
        "Covariate", "Treated", "Control (raw)", "Control (ebal)")
for (j, name) in enumerate([:X1, :X2])
    @printf("%-15s %12.4f %12.4f %12.4f\n",
            String(name),
            treated_means[j], control_means_unweighted[j],
            control_means_balanced[j])
end

Sum of weights: 1.000000  (should be 1.0)
Max weight: 0.0095, min weight: 0.000079

Covariate            Treated Control (raw) Control (ebal)
X1                    0.3117      -0.2801       0.3117
X2                    0.2395      -0.1541       0.2395

The weighted control means equal the treated means because the optimization imposes that constraint. If overlap is poor, the weights can become very uneven.

The ATT estimate uses these weights on the control outcomes:

Yt = df.Y[df.D .== 1]
Yc = df.Y[df.D .== 0]

# ATT = mean(Y_treated) - weighted mean(Y_control)
att_ebal = mean(Yt) - dot(w_ebal, Yc)
@printf("Entropy-balanced ATT: %.3f  (true ATE = 1.000)\n", att_ebal)

Entropy-balanced ATT: 1.014  (true ATE = 1.000)

Entropy balancing is useful when mean balance is the main diagnostic. If we need balance over whole covariate distributions, energy balancing in R’s WeightIt is a better tool. There is no Julia equivalent yet.

7.10 When to use which method

Method	When to use	Limitation
NN matching (no replacement)	Simple, intuitive; ATT estimand	Drops unmatched units
NN matching (with replacement)	Few treated, many controls	Wasteful of controls
IPW	Good propensity model	Sensitive to extreme weights
Entropy balancing	Want exact mean balance; small covariate set	Balances means only, not distributions
Doubly robust (AIPW)	Combine matching/IPW with regression	More complex

For applied work in Julia, IPW or AIPW is usually more practical than 1:1 matching. Julia has good regression tools, but not yet a full matching toolkit. For a matched analysis with many diagnostics, I would still use R’s MatchIt and cobalt.

7.11 Summary

Matching is for selection on observables. It does not fix unobserved confounding.
The main diagnostic is balance, not the treatment-effect coefficient.
Propensity-score nearest-neighbor matching is easy to implement in Julia, but full matching and CEM are better handled in R.
IPW and entropy balancing are often more practical Julia workflows.