library(tidyverse)
library(MatchIt)
library(cobalt)
library(WeightIt)
library(marginaleffects)7 Matching Estimators
Matching is the oldest and most-taught approach to observational causal inference. The idea is simple: for each treated unit, find a “twin” control unit with similar covariates, then compare outcomes within matched pairs. With enough good matches, the comparison approximates what we would have learned from a randomised experiment.
The implementation has many variants — nearest-neighbour matching, optimal matching, full matching, coarsened exact matching (CEM) — and the choice has real consequences. This chapter walks through the modern matching workflow using R’s MatchIt and companion cobalt packages, which together represent the de facto standard in applied econometrics and epidemiology.
Related reading: This chapter is based on the longer treatment in Matching and Weighting Part 1 of Topics on Econometrics and Causal Inference, which credits Noah Greifer and coauthors (the
MatchIt/WeightItpackage authors).
7.1 Assumptions
Matching identifies a causal effect under three assumptions:
- SUTVA — no interference, no hidden treatment versions.
- Ignorability (unconfoundedness) — conditional on \(X\), the treatment \(D\) is independent of the potential outcomes \((Y(0), Y(1))\).
- Overlap (positivity) — every value of \(X\) has positive probability of being both treated and untreated.
The motivating problem: even under ignorability, the observed distribution of \(X\) usually differs between treated and control. Regression adjustment implicitly extrapolates a parametric model across that imbalance; matching restricts the comparison to the region of overlap, which is more robust to model misspecification.
7.2 The Lalonde example
The canonical test case is the Lalonde (1986) job-training experiment. We use the observational version included in MatchIt, where the goal is to recover the experimental estimate of the effect of a training programme on 1978 earnings:
data("lalonde", package = "MatchIt")
glimpse(lalonde)Rows: 614
Columns: 9
$ treat <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ age <int> 37, 22, 30, 27, 33, 22, 23, 32, 22, 33, 19, 21, 18, 27, 17, 1…
$ educ <int> 11, 9, 12, 11, 8, 9, 12, 11, 16, 12, 9, 13, 8, 10, 7, 10, 13,…
$ race <fct> black, hispan, black, black, black, black, black, black, blac…
$ married <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
$ nodegree <int> 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1…
$ re74 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ re75 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ re78 <dbl> 9930.0460, 3595.8940, 24909.4500, 7506.1460, 289.7899, 4056.4…
The treatment is treat; the outcome is re78 (1978 earnings); the covariates are age, educ, race, married, nodegree, re74, re75 (pre-treatment earnings).
Before matching, treated and control groups are badly imbalanced:
# Construct a pre-match MatchIt object for balance reporting
m.pre <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde, method = NULL,
distance = "glm")
bal.tab(m.pre, thresholds = c(m = 0.1))Balance Measures
Type Diff.Un M.Threshold.Un
distance Distance 1.7941
age Contin. -0.3094 Not Balanced, >0.1
educ Contin. 0.0550 Balanced, <0.1
race_black Binary 0.6404 Not Balanced, >0.1
race_hispan Binary -0.0827 Balanced, <0.1
race_white Binary -0.5577 Not Balanced, >0.1
married Binary -0.3236 Not Balanced, >0.1
nodegree Binary 0.1114 Not Balanced, >0.1
re74 Contin. -0.7211 Not Balanced, >0.1
re75 Contin. -0.2903 Not Balanced, >0.1
Balance tally for mean differences
count
Balanced, <0.1 2
Not Balanced, >0.1 7
Variable with the greatest mean difference
Variable Diff.Un M.Threshold.Un
re74 -0.7211 Not Balanced, >0.1
Sample sizes
Control Treated
All 429 185
The standardised mean differences (“Diff.Adj”) on several covariates exceed 0.1 — the conventional threshold for adequate balance.
7.3 Distance measures
Matching needs a way to define “close.” Three common choices:
- Propensity score: estimate \(\hat e(x) = P(D = 1 \mid X = x)\) and use \(|\hat e(x_i) - \hat e(x_j)|\) as the distance. Reduces a multidimensional matching problem to a one-dimensional one.
- Mahalanobis distance: \((x_i - x_j)' \Sigma^{-1} (x_i - x_j)\), computed on the raw covariates. Sensitive to scale; works well in low dimensions.
- Hybrid: use Mahalanobis distance within propensity score calipers — the Rubin-Stuart “best of both” approach.
7.4 Matching methods
7.4.1 Nearest-neighbour matching on a propensity score
m.nn <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde,
method = "nearest",
distance = "glm",
link = "linear.logit",
ratio = 1)
m.nnA `matchit` object
- method: 1:1 nearest neighbor matching without replacement
- distance: Propensity score
- estimated with logistic regression and linearized
- number of obs.: 614 (original), 370 (matched)
- target estimand: ATT
- covariates: age, educ, race, married, nodegree, re74, re75
After matching, the treated and control groups should be balanced:
bal.tab(m.nn, thresholds = c(m = 0.1))Balance Measures
Type Diff.Adj M.Threshold
distance Distance 0.9192
age Contin. 0.0718 Balanced, <0.1
educ Contin. -0.1290 Not Balanced, >0.1
race_black Binary 0.3730 Not Balanced, >0.1
race_hispan Binary -0.1568 Not Balanced, >0.1
race_white Binary -0.2162 Not Balanced, >0.1
married Binary -0.0216 Balanced, <0.1
nodegree Binary 0.0703 Balanced, <0.1
re74 Contin. -0.0505 Balanced, <0.1
re75 Contin. -0.0257 Balanced, <0.1
Balance tally for mean differences
count
Balanced, <0.1 5
Not Balanced, >0.1 4
Variable with the greatest mean difference
Variable Diff.Adj M.Threshold
race_black 0.373 Not Balanced, >0.1
Sample sizes
Control Treated
All 429 185
Matched 185 185
Unmatched 244 0
Several covariates still exceed the 0.1 threshold — nearest-neighbour matching on a logistic propensity score isn’t always enough. Try a more flexible distance:
7.4.2 Full matching
Full matching creates subclasses of one treated unit with multiple controls, or one control with multiple treated, weighted to minimise total within-pair distance. It uses all observations and typically achieves better balance:
m.full <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde,
method = "full",
distance = "glm",
link = "probit")
bal.tab(m.full, thresholds = c(m = 0.1))Balance Measures
Type Diff.Adj M.Threshold
distance Distance 0.0045 Balanced, <0.1
age Contin. 0.0393 Balanced, <0.1
educ Contin. -0.0956 Balanced, <0.1
race_black Binary 0.0043 Balanced, <0.1
race_hispan Binary 0.0103 Balanced, <0.1
race_white Binary -0.0146 Balanced, <0.1
married Binary 0.0259 Balanced, <0.1
nodegree Binary 0.0504 Balanced, <0.1
re74 Contin. -0.0009 Balanced, <0.1
re75 Contin. -0.0091 Balanced, <0.1
Balance tally for mean differences
count
Balanced, <0.1 10
Not Balanced, >0.1 0
Variable with the greatest mean difference
Variable Diff.Adj M.Threshold
educ -0.0956 Balanced, <0.1
Sample sizes
Control Treated
All 429. 185
Matched (ESS) 50.76 185
Matched (Unweighted) 429. 185
Full matching usually achieves much better balance than 1:1 nearest-neighbour.
7.4.3 Mahalanobis matching
For low-dimensional matching problems, Mahalanobis distance on the raw covariates is a good alternative:
m.maha <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde,
distance = "mahalanobis")
bal.tab(m.maha, thresholds = c(m = 0.1))Balance Measures
Type Diff.Adj M.Threshold
age Contin. 0.1269 Not Balanced, >0.1
educ Contin. -0.0430 Balanced, <0.1
race_black Binary 0.3784 Not Balanced, >0.1
race_hispan Binary 0.0000 Balanced, <0.1
race_white Binary -0.3784 Not Balanced, >0.1
married Binary -0.0595 Balanced, <0.1
nodegree Binary 0.0486 Balanced, <0.1
re74 Contin. -0.2476 Not Balanced, >0.1
re75 Contin. -0.1322 Not Balanced, >0.1
Balance tally for mean differences
count
Balanced, <0.1 4
Not Balanced, >0.1 5
Variable with the greatest mean difference
Variable Diff.Adj M.Threshold
race_black 0.3784 Not Balanced, >0.1
Sample sizes
Control Treated
All 429 185
Matched 185 185
Unmatched 244 0
7.4.4 Coarsened Exact Matching (CEM)
CEM (Iacus, King, Porro 2012) coarsens each covariate into bins, then matches exactly on the coarsened values. The bin widths are typically chosen automatically. CEM has the advantage that the resulting matched sample has guaranteed covariate balance up to the coarsening level.
m.cem <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde,
method = "cem")
bal.tab(m.cem, thresholds = c(m = 0.1))Balance Measures
Type Diff.Adj M.Threshold
age Contin. 0.0493 Balanced, <0.1
educ Contin. 0.0446 Balanced, <0.1
race_black Binary 0.0000 Balanced, <0.1
race_hispan Binary 0.0000 Balanced, <0.1
race_white Binary 0.0000 Balanced, <0.1
married Binary 0.0000 Balanced, <0.1
nodegree Binary 0.0000 Balanced, <0.1
re74 Contin. -0.0427 Balanced, <0.1
re75 Contin. -0.0492 Balanced, <0.1
Balance tally for mean differences
count
Balanced, <0.1 9
Not Balanced, >0.1 0
Variable with the greatest mean difference
Variable Diff.Adj M.Threshold
age 0.0493 Balanced, <0.1
Sample sizes
Control Treated
All 429. 185
Matched (ESS) 41.29 65
Matched (Unweighted) 75. 65
Unmatched 354. 120
CEM often drops treated units that have no exact match — a feature, not a bug. It is transparent about lack of overlap.
7.5 Estimation after matching
Once a matched dataset is in hand, the treatment effect is estimated as a weighted regression on the matched data, using the weights from MatchIt. Standard errors should be clustered by the subclass:
m.data <- match_data(m.full)
fit <- lm(re78 ~ treat * (age + educ + race + married + nodegree + re74 + re75),
data = m.data,
weights = weights)
# Use marginaleffects for the average treatment effect with cluster-by-subclass SEs
avg_comparisons(fit,
variables = "treat",
vcov = ~subclass,
newdata = subset(m.data, treat == 1)) # ATT
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
1977 704 2.81 0.00501 7.6 596 3357
Term: treat
Type: response
Comparison: 1 - 0
The estimate from full matching should be much closer to the experimental benchmark of about $1,800 (Dehejia & Wahba 1999) than a naive comparison of means.
7.6 Balance plots
cobalt::love.plot produces the standard “love plot” diagnostic showing absolute standardised mean differences before and after matching:
love.plot(m.full,
stats = "mean.diffs",
thresholds = c(m = 0.1),
abs = TRUE,
var.order = "unadjusted",
title = "Balance plot — Full matching")
A successful matching produces balance below 0.1 (or, more conservatively, 0.05) for every covariate. The love plot makes this immediate.
7.7 Matching with replacement vs without
By default, matchit(..., replace = FALSE) uses each control unit at most once. When the treated group is much larger or the propensity distributions diverge sharply, matching with replacement (replace = TRUE) can improve balance at the cost of effective sample size:
m.repl <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde,
method = "nearest",
distance = "glm",
replace = TRUE)
bal.tab(m.repl, thresholds = c(m = 0.1))Balance Measures
Type Diff.Adj M.Threshold
distance Distance 0.0044 Balanced, <0.1
age Contin. 0.2395 Not Balanced, >0.1
educ Contin. -0.0161 Balanced, <0.1
race_black Binary 0.0054 Balanced, <0.1
race_hispan Binary -0.0054 Balanced, <0.1
race_white Binary 0.0000 Balanced, <0.1
married Binary 0.0595 Balanced, <0.1
nodegree Binary 0.0054 Balanced, <0.1
re74 Contin. -0.0493 Balanced, <0.1
re75 Contin. 0.0087 Balanced, <0.1
Balance tally for mean differences
count
Balanced, <0.1 9
Not Balanced, >0.1 1
Variable with the greatest mean difference
Variable Diff.Adj M.Threshold
age 0.2395 Not Balanced, >0.1
Sample sizes
Control Treated
All 429. 185
Matched (ESS) 46.31 185
Matched (Unweighted) 82. 185
Unmatched 347. 0
Matching with replacement is recommended when overlap is poor and balance is otherwise unachievable.
7.8 ATE, ATT, ATC — choose carefully
The default MatchIt estimand is ATT (effect on the treated). To target the ATE, use estimand = "ATE" and a matching method that doesn’t drop units (full matching or weighting). To target ATC (effect on the controls), use estimand = "ATC". The three estimands answer different policy questions — see the Five Estimands chapter for the framing.
7.9 When matching fails
Matching has known limitations:
- Curse of dimensionality: high-dimensional matching is hard. Use propensity scores or coarsened exact matching to reduce dimensions.
- No overlap: if certain regions of \(X\) have no treated (or no control) units, no matching method can rescue them. CEM exposes this honestly by dropping such units; nearest-neighbour matching hides the problem in imbalanced pairs.
- Sensitivity to specification: the propensity score model is itself a regression that can be misspecified. The Sensitivity Analysis chapter covers Rosenbaum bounds, which specifically address matched-sample hidden bias.
7.10 Matching vs weighting
Inverse propensity weighting (IPW) and matching are close cousins. Matching subsets to comparable units; weighting reweights to make the treated and control distributions match. Modern theory increasingly favours weighting for two reasons:
- No information loss — IPW uses every observation; matching may drop unmatched units.
- Smooth estimands — IPW’s reweighted distribution is exact, while matching introduces noise from the matching algorithm.
In practice, the difference is small when overlap is good. When overlap is poor, both methods break down, and the right response is either to restrict the population (CEM-style) or to use a doubly-robust estimator (AIPW, TMLE) that combines a weighting model with an outcome model and is robust to misspecification of either.
7.11 Modern weighting alternatives
The WeightIt package implements modern weighting methods that often outperform plain IPW, particularly when overlap is marginal. The companion blog chapter on weighting covers these in more detail; here we summarise the four most useful variants and apply them to the Lalonde data already in this chapter.
7.11.1 Inverse probability of treatment weighting (IPW)
The baseline: estimate the propensity score by logistic regression, weight treated units by \(1/\hat e(X)\) and controls by \(1/(1 - \hat e(X))\):
w_ipw <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde, estimand = "ATT", method = "glm")
bal.tab(w_ipw, stats = c("m"), thresholds = c(m = 0.05))Balance Measures
Type Diff.Adj M.Threshold
prop.score Distance -0.0205 Balanced, <0.05
age Contin. 0.1188 Not Balanced, >0.05
educ Contin. -0.0284 Balanced, <0.05
race_black Binary -0.0022 Balanced, <0.05
race_hispan Binary 0.0002 Balanced, <0.05
race_white Binary 0.0021 Balanced, <0.05
married Binary 0.0186 Balanced, <0.05
nodegree Binary 0.0184 Balanced, <0.05
re74 Contin. -0.0021 Balanced, <0.05
re75 Contin. 0.0110 Balanced, <0.05
Balance tally for mean differences
count
Balanced, <0.05 9
Not Balanced, >0.05 1
Variable with the greatest mean difference
Variable Diff.Adj M.Threshold
age 0.1188 Not Balanced, >0.05
Effective sample sizes
Control Treated
Unadjusted 429. 185
Adjusted 99.82 185
7.11.2 Covariate balancing propensity score (CBPS)
CBPS (Imai-Ratkovic 2014) estimates the propensity score under balance constraints, ensuring that the weighted covariate means are balanced even when the logistic model is misspecified:
w_cbps <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde, estimand = "ATT", method = "cbps")
bal.tab(w_cbps, stats = c("m"), thresholds = c(m = 0.05))Balance Measures
Type Diff.Adj M.Threshold
prop.score Distance -0.0181 Balanced, <0.05
age Contin. -0.0000 Balanced, <0.05
educ Contin. -0.0001 Balanced, <0.05
race_black Binary -0.0000 Balanced, <0.05
race_hispan Binary -0.0000 Balanced, <0.05
race_white Binary 0.0000 Balanced, <0.05
married Binary -0.0000 Balanced, <0.05
nodegree Binary -0.0000 Balanced, <0.05
re74 Contin. -0.0000 Balanced, <0.05
re75 Contin. -0.0000 Balanced, <0.05
Balance tally for mean differences
count
Balanced, <0.05 10
Not Balanced, >0.05 0
Variable with the greatest mean difference
Variable Diff.Adj M.Threshold
educ -0.0001 Balanced, <0.05
Effective sample sizes
Control Treated
Unadjusted 429. 185
Adjusted 98.45 185
7.11.3 Entropy balancing
Entropy balancing (Hainmueller 2012) solves a convex optimisation problem that yields weights with exact mean balance on the included covariates. It is doubly robust for the ATT and attains the semiparametric efficiency bound when both nuisance models are correct.
w_ebal <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde, estimand = "ATT", method = "ebal")
bal.tab(w_ebal, stats = c("m"), thresholds = c(m = 0.05))Balance Measures
Type Diff.Adj M.Threshold
age Contin. -0 Balanced, <0.05
educ Contin. -0 Balanced, <0.05
race_black Binary 0 Balanced, <0.05
race_hispan Binary -0 Balanced, <0.05
race_white Binary -0 Balanced, <0.05
married Binary 0 Balanced, <0.05
nodegree Binary 0 Balanced, <0.05
re74 Contin. -0 Balanced, <0.05
re75 Contin. -0 Balanced, <0.05
Balance tally for mean differences
count
Balanced, <0.05 9
Not Balanced, >0.05 0
Variable with the greatest mean difference
Variable Diff.Adj M.Threshold
race_black 0 Balanced, <0.05
Effective sample sizes
Control Treated
Unadjusted 429. 185
Adjusted 98.46 185
Notice that every standardised mean difference is now essentially zero — entropy balancing guarantees this by construction.
7.11.4 Energy balancing
Energy balancing (Huling & Mak 2024) optimises a weighted energy distance between treated and control covariate distributions. Unlike entropy balancing — which balances means — energy balancing balances the entire distribution of every covariate.
w_energy <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde, estimand = "ATT", method = "energy")
bal.tab(w_energy, stats = c("m"), thresholds = c(m = 0.05))Balance Measures
Type Diff.Adj M.Threshold
age Contin. -0.0016 Balanced, <0.05
educ Contin. 0.0106 Balanced, <0.05
race_black Binary 0.0060 Balanced, <0.05
race_hispan Binary -0.0008 Balanced, <0.05
race_white Binary -0.0053 Balanced, <0.05
married Binary -0.0011 Balanced, <0.05
nodegree Binary 0.0050 Balanced, <0.05
re74 Contin. -0.0021 Balanced, <0.05
re75 Contin. 0.0226 Balanced, <0.05
Balance tally for mean differences
count
Balanced, <0.05 9
Not Balanced, >0.05 0
Variable with the greatest mean difference
Variable Diff.Adj M.Threshold
re75 0.0226 Balanced, <0.05
Effective sample sizes
Control Treated
Unadjusted 429. 185
Adjusted 41.82 185
7.11.5 Estimation with weighted-aware regression
Once weights are computed, use lm_weightit() (rather than plain lm with a weights argument) so that the weighting uncertainty is propagated into the treatment-effect standard error:
fit_ebal <- lm_weightit(
re78 ~ treat * (age + educ + race + married + nodegree + re74 + re75),
data = lalonde,
weightit = w_ebal
)
avg_comparisons(fit_ebal, variables = "treat",
newdata = subset(lalonde, treat == 1)) # ATT
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
1273 770 1.65 0.0983 3.3 -236 2783
Term: treat
Type: probs
Comparison: 1 - 0
The result is the entropy-balanced ATT with valid sandwich-style standard errors that account for the estimated weights.
7.11.6 When to use each weighting method
| Method | Strengths | Weaknesses |
|---|---|---|
| IPW (glm) | Familiar; well-studied | Sensitive to PS misspecification; extreme weights |
| CBPS | Balance-constrained PS; robust to model misspecification | Slower; can fail with many covariates |
| Entropy balancing | Exact mean balance; doubly robust for ATT | Balances means only, not distributions |
| Energy balancing | Balances entire distribution | Computationally heavier; less mature theory |
For applied work, entropy balancing is the practical default — it typically achieves better balance than IPW with fewer extreme weights and faster runtime than CBPS. Energy balancing is the modern state-of-the-art when full distributional balance matters.
7.11.7 Comparing matching to weighting
The matching examples earlier in this chapter and the weighting examples here both target the ATT on the Lalonde data. In practice, an applied researcher should try several methods, report all balance diagnostics and ATT estimates, and choose based on which method achieves the best covariate balance. When matching, weighting, and modern weighting methods all converge to similar estimates, the identification strategy is on firm ground. When they diverge sharply, the overlap is poor and the right response is to restrict the population or use a doubly-robust estimator.
For an extended treatment of matching with MatchIt (including a comparison with Stata’s teffects commands), see the companion blog chapters on matching and treatment-effects in Stata.
7.12 Summary
- Matching estimates causal effects by pairing treated units to similar controls, restricting the comparison to the region of overlap in covariates.
- Choose a distance: propensity score for high-dimensional problems; Mahalanobis for low-dimensional; CEM when exact coarsened balance is desired.
- Choose a method: nearest-neighbour is simple but often imbalanced; full matching uses all observations and usually balances best.
- Estimate with weighted regression on the matched dataset, clustering SEs by subclass.
- Diagnose with balance plots (
love.plot) before reporting the effect. - For more advanced treatment of
MatchItandWeightIt, see the companion blog chapter.