7  Matching Estimators

library(tidyverse)
library(MatchIt)
library(cobalt)
library(WeightIt)
library(marginaleffects)

Matching is the oldest and most-taught approach to observational causal inference. The idea is simple: for each treated unit, find a “twin” control unit with similar covariates, then compare outcomes within matched pairs. With enough good matches, the comparison approximates what we would have learned from a randomised experiment.

The implementation has many variants — nearest-neighbour matching, optimal matching, full matching, coarsened exact matching (CEM) — and the choice has real consequences. This chapter walks through the modern matching workflow using R’s MatchIt and companion cobalt packages, which together represent the de facto standard in applied econometrics and epidemiology.

Related reading: This chapter is based on the longer treatment in Matching and Weighting Part 1 of Topics on Econometrics and Causal Inference, which credits Noah Greifer and coauthors (the MatchIt/WeightIt package authors).

7.1 Assumptions

Matching identifies a causal effect under three assumptions:

  1. SUTVA — no interference, no hidden treatment versions.
  2. Ignorability (unconfoundedness) — conditional on \(X\), the treatment \(D\) is independent of the potential outcomes \((Y(0), Y(1))\).
  3. Overlap (positivity) — every value of \(X\) has positive probability of being both treated and untreated.

The motivating problem: even under ignorability, the observed distribution of \(X\) usually differs between treated and control. Regression adjustment implicitly extrapolates a parametric model across that imbalance; matching restricts the comparison to the region of overlap, which is more robust to model misspecification.

7.2 The Lalonde example

The canonical test case is the Lalonde (1986) job-training experiment. We use the observational version included in MatchIt, where the goal is to recover the experimental estimate of the effect of a training programme on 1978 earnings:

data("lalonde", package = "MatchIt")
glimpse(lalonde)
Rows: 614
Columns: 9
$ treat    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ age      <int> 37, 22, 30, 27, 33, 22, 23, 32, 22, 33, 19, 21, 18, 27, 17, 1…
$ educ     <int> 11, 9, 12, 11, 8, 9, 12, 11, 16, 12, 9, 13, 8, 10, 7, 10, 13,…
$ race     <fct> black, hispan, black, black, black, black, black, black, blac…
$ married  <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
$ nodegree <int> 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1…
$ re74     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ re75     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ re78     <dbl> 9930.0460, 3595.8940, 24909.4500, 7506.1460, 289.7899, 4056.4…

The treatment is treat; the outcome is re78 (1978 earnings); the covariates are age, educ, race, married, nodegree, re74, re75 (pre-treatment earnings).

Before matching, treated and control groups are badly imbalanced:

# Construct a pre-match MatchIt object for balance reporting
m.pre <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75,
                 data = lalonde, method = NULL,
                 distance = "glm")
bal.tab(m.pre, thresholds = c(m = 0.1))
Balance Measures
                Type Diff.Un     M.Threshold.Un
distance    Distance  1.7941                   
age          Contin. -0.3094 Not Balanced, >0.1
educ         Contin.  0.0550     Balanced, <0.1
race_black    Binary  0.6404 Not Balanced, >0.1
race_hispan   Binary -0.0827     Balanced, <0.1
race_white    Binary -0.5577 Not Balanced, >0.1
married       Binary -0.3236 Not Balanced, >0.1
nodegree      Binary  0.1114 Not Balanced, >0.1
re74         Contin. -0.7211 Not Balanced, >0.1
re75         Contin. -0.2903 Not Balanced, >0.1

Balance tally for mean differences
                   count
Balanced, <0.1         2
Not Balanced, >0.1     7

Variable with the greatest mean difference
 Variable Diff.Un     M.Threshold.Un
     re74 -0.7211 Not Balanced, >0.1

Sample sizes
    Control Treated
All     429     185

The standardised mean differences (“Diff.Adj”) on several covariates exceed 0.1 — the conventional threshold for adequate balance.

7.3 Distance measures

Matching needs a way to define “close.” Three common choices:

  1. Propensity score: estimate \(\hat e(x) = P(D = 1 \mid X = x)\) and use \(|\hat e(x_i) - \hat e(x_j)|\) as the distance. Reduces a multidimensional matching problem to a one-dimensional one.
  2. Mahalanobis distance: \((x_i - x_j)' \Sigma^{-1} (x_i - x_j)\), computed on the raw covariates. Sensitive to scale; works well in low dimensions.
  3. Hybrid: use Mahalanobis distance within propensity score calipers — the Rubin-Stuart “best of both” approach.

7.4 Matching methods

7.4.1 Nearest-neighbour matching on a propensity score

m.nn <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75,
                data = lalonde,
                method = "nearest",
                distance = "glm",
                link = "linear.logit",
                ratio = 1)

m.nn
A `matchit` object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score
             - estimated with logistic regression and linearized
 - number of obs.: 614 (original), 370 (matched)
 - target estimand: ATT
 - covariates: age, educ, race, married, nodegree, re74, re75

After matching, the treated and control groups should be balanced:

bal.tab(m.nn, thresholds = c(m = 0.1))
Balance Measures
                Type Diff.Adj        M.Threshold
distance    Distance   0.9192                   
age          Contin.   0.0718     Balanced, <0.1
educ         Contin.  -0.1290 Not Balanced, >0.1
race_black    Binary   0.3730 Not Balanced, >0.1
race_hispan   Binary  -0.1568 Not Balanced, >0.1
race_white    Binary  -0.2162 Not Balanced, >0.1
married       Binary  -0.0216     Balanced, <0.1
nodegree      Binary   0.0703     Balanced, <0.1
re74         Contin.  -0.0505     Balanced, <0.1
re75         Contin.  -0.0257     Balanced, <0.1

Balance tally for mean differences
                   count
Balanced, <0.1         5
Not Balanced, >0.1     4

Variable with the greatest mean difference
   Variable Diff.Adj        M.Threshold
 race_black    0.373 Not Balanced, >0.1

Sample sizes
          Control Treated
All           429     185
Matched       185     185
Unmatched     244       0

Several covariates still exceed the 0.1 threshold — nearest-neighbour matching on a logistic propensity score isn’t always enough. Try a more flexible distance:

7.4.2 Full matching

Full matching creates subclasses of one treated unit with multiple controls, or one control with multiple treated, weighted to minimise total within-pair distance. It uses all observations and typically achieves better balance:

m.full <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75,
                  data = lalonde,
                  method = "full",
                  distance = "glm",
                  link = "probit")

bal.tab(m.full, thresholds = c(m = 0.1))
Balance Measures
                Type Diff.Adj    M.Threshold
distance    Distance   0.0045 Balanced, <0.1
age          Contin.   0.0393 Balanced, <0.1
educ         Contin.  -0.0956 Balanced, <0.1
race_black    Binary   0.0043 Balanced, <0.1
race_hispan   Binary   0.0103 Balanced, <0.1
race_white    Binary  -0.0146 Balanced, <0.1
married       Binary   0.0259 Balanced, <0.1
nodegree      Binary   0.0504 Balanced, <0.1
re74         Contin.  -0.0009 Balanced, <0.1
re75         Contin.  -0.0091 Balanced, <0.1

Balance tally for mean differences
                   count
Balanced, <0.1        10
Not Balanced, >0.1     0

Variable with the greatest mean difference
 Variable Diff.Adj    M.Threshold
     educ  -0.0956 Balanced, <0.1

Sample sizes
                     Control Treated
All                   429.       185
Matched (ESS)          50.76     185
Matched (Unweighted)  429.       185

Full matching usually achieves much better balance than 1:1 nearest-neighbour.

7.4.3 Mahalanobis matching

For low-dimensional matching problems, Mahalanobis distance on the raw covariates is a good alternative:

m.maha <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75,
                  data = lalonde,
                  distance = "mahalanobis")
bal.tab(m.maha, thresholds = c(m = 0.1))
Balance Measures
               Type Diff.Adj        M.Threshold
age         Contin.   0.1269 Not Balanced, >0.1
educ        Contin.  -0.0430     Balanced, <0.1
race_black   Binary   0.3784 Not Balanced, >0.1
race_hispan  Binary   0.0000     Balanced, <0.1
race_white   Binary  -0.3784 Not Balanced, >0.1
married      Binary  -0.0595     Balanced, <0.1
nodegree     Binary   0.0486     Balanced, <0.1
re74        Contin.  -0.2476 Not Balanced, >0.1
re75        Contin.  -0.1322 Not Balanced, >0.1

Balance tally for mean differences
                   count
Balanced, <0.1         4
Not Balanced, >0.1     5

Variable with the greatest mean difference
   Variable Diff.Adj        M.Threshold
 race_black   0.3784 Not Balanced, >0.1

Sample sizes
          Control Treated
All           429     185
Matched       185     185
Unmatched     244       0

7.4.4 Coarsened Exact Matching (CEM)

CEM (Iacus, King, Porro 2012) coarsens each covariate into bins, then matches exactly on the coarsened values. The bin widths are typically chosen automatically. CEM has the advantage that the resulting matched sample has guaranteed covariate balance up to the coarsening level.

m.cem <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75,
                 data = lalonde,
                 method = "cem")
bal.tab(m.cem, thresholds = c(m = 0.1))
Balance Measures
               Type Diff.Adj    M.Threshold
age         Contin.   0.0493 Balanced, <0.1
educ        Contin.   0.0446 Balanced, <0.1
race_black   Binary   0.0000 Balanced, <0.1
race_hispan  Binary   0.0000 Balanced, <0.1
race_white   Binary   0.0000 Balanced, <0.1
married      Binary   0.0000 Balanced, <0.1
nodegree     Binary   0.0000 Balanced, <0.1
re74        Contin.  -0.0427 Balanced, <0.1
re75        Contin.  -0.0492 Balanced, <0.1

Balance tally for mean differences
                   count
Balanced, <0.1         9
Not Balanced, >0.1     0

Variable with the greatest mean difference
 Variable Diff.Adj    M.Threshold
      age   0.0493 Balanced, <0.1

Sample sizes
                     Control Treated
All                   429.       185
Matched (ESS)          41.29      65
Matched (Unweighted)   75.        65
Unmatched             354.       120

CEM often drops treated units that have no exact match — a feature, not a bug. It is transparent about lack of overlap.

7.5 Estimation after matching

Once a matched dataset is in hand, the treatment effect is estimated as a weighted regression on the matched data, using the weights from MatchIt. Standard errors should be clustered by the subclass:

m.data <- match_data(m.full)

fit <- lm(re78 ~ treat * (age + educ + race + married + nodegree + re74 + re75),
          data    = m.data,
          weights = weights)

# Use marginaleffects for the average treatment effect with cluster-by-subclass SEs
avg_comparisons(fit,
                variables = "treat",
                vcov = ~subclass,
                newdata = subset(m.data, treat == 1))   # ATT

 Estimate Std. Error    z Pr(>|z|)   S 2.5 % 97.5 %
     1977        704 2.81  0.00501 7.6   596   3357

Term: treat
Type: response
Comparison: 1 - 0

The estimate from full matching should be much closer to the experimental benchmark of about $1,800 (Dehejia & Wahba 1999) than a naive comparison of means.

7.6 Balance plots

cobalt::love.plot produces the standard “love plot” diagnostic showing absolute standardised mean differences before and after matching:

love.plot(m.full,
          stats = "mean.diffs",
          thresholds = c(m = 0.1),
          abs = TRUE,
          var.order = "unadjusted",
          title = "Balance plot — Full matching")

Balance improvement from nearest-neighbour and full matching.

A successful matching produces balance below 0.1 (or, more conservatively, 0.05) for every covariate. The love plot makes this immediate.

7.7 Matching with replacement vs without

By default, matchit(..., replace = FALSE) uses each control unit at most once. When the treated group is much larger or the propensity distributions diverge sharply, matching with replacement (replace = TRUE) can improve balance at the cost of effective sample size:

m.repl <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75,
                  data = lalonde,
                  method = "nearest",
                  distance = "glm",
                  replace = TRUE)
bal.tab(m.repl, thresholds = c(m = 0.1))
Balance Measures
                Type Diff.Adj        M.Threshold
distance    Distance   0.0044     Balanced, <0.1
age          Contin.   0.2395 Not Balanced, >0.1
educ         Contin.  -0.0161     Balanced, <0.1
race_black    Binary   0.0054     Balanced, <0.1
race_hispan   Binary  -0.0054     Balanced, <0.1
race_white    Binary   0.0000     Balanced, <0.1
married       Binary   0.0595     Balanced, <0.1
nodegree      Binary   0.0054     Balanced, <0.1
re74         Contin.  -0.0493     Balanced, <0.1
re75         Contin.   0.0087     Balanced, <0.1

Balance tally for mean differences
                   count
Balanced, <0.1         9
Not Balanced, >0.1     1

Variable with the greatest mean difference
 Variable Diff.Adj        M.Threshold
      age   0.2395 Not Balanced, >0.1

Sample sizes
                     Control Treated
All                   429.       185
Matched (ESS)          46.31     185
Matched (Unweighted)   82.       185
Unmatched             347.         0

Matching with replacement is recommended when overlap is poor and balance is otherwise unachievable.

7.8 ATE, ATT, ATC — choose carefully

The default MatchIt estimand is ATT (effect on the treated). To target the ATE, use estimand = "ATE" and a matching method that doesn’t drop units (full matching or weighting). To target ATC (effect on the controls), use estimand = "ATC". The three estimands answer different policy questions — see the Five Estimands chapter for the framing.

7.9 When matching fails

Matching has known limitations:

  • Curse of dimensionality: high-dimensional matching is hard. Use propensity scores or coarsened exact matching to reduce dimensions.
  • No overlap: if certain regions of \(X\) have no treated (or no control) units, no matching method can rescue them. CEM exposes this honestly by dropping such units; nearest-neighbour matching hides the problem in imbalanced pairs.
  • Sensitivity to specification: the propensity score model is itself a regression that can be misspecified. The Sensitivity Analysis chapter covers Rosenbaum bounds, which specifically address matched-sample hidden bias.

7.10 Matching vs weighting

Inverse propensity weighting (IPW) and matching are close cousins. Matching subsets to comparable units; weighting reweights to make the treated and control distributions match. Modern theory increasingly favours weighting for two reasons:

  1. No information loss — IPW uses every observation; matching may drop unmatched units.
  2. Smooth estimands — IPW’s reweighted distribution is exact, while matching introduces noise from the matching algorithm.

In practice, the difference is small when overlap is good. When overlap is poor, both methods break down, and the right response is either to restrict the population (CEM-style) or to use a doubly-robust estimator (AIPW, TMLE) that combines a weighting model with an outcome model and is robust to misspecification of either.

7.11 Modern weighting alternatives

The WeightIt package implements modern weighting methods that often outperform plain IPW, particularly when overlap is marginal. The companion blog chapter on weighting covers these in more detail; here we summarise the four most useful variants and apply them to the Lalonde data already in this chapter.

7.11.1 Inverse probability of treatment weighting (IPW)

The baseline: estimate the propensity score by logistic regression, weight treated units by \(1/\hat e(X)\) and controls by \(1/(1 - \hat e(X))\):

w_ipw <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
                  data = lalonde, estimand = "ATT", method = "glm")
bal.tab(w_ipw, stats = c("m"), thresholds = c(m = 0.05))
Balance Measures
                Type Diff.Adj         M.Threshold
prop.score  Distance  -0.0205     Balanced, <0.05
age          Contin.   0.1188 Not Balanced, >0.05
educ         Contin.  -0.0284     Balanced, <0.05
race_black    Binary  -0.0022     Balanced, <0.05
race_hispan   Binary   0.0002     Balanced, <0.05
race_white    Binary   0.0021     Balanced, <0.05
married       Binary   0.0186     Balanced, <0.05
nodegree      Binary   0.0184     Balanced, <0.05
re74         Contin.  -0.0021     Balanced, <0.05
re75         Contin.   0.0110     Balanced, <0.05

Balance tally for mean differences
                    count
Balanced, <0.05         9
Not Balanced, >0.05     1

Variable with the greatest mean difference
 Variable Diff.Adj         M.Threshold
      age   0.1188 Not Balanced, >0.05

Effective sample sizes
           Control Treated
Unadjusted  429.       185
Adjusted     99.82     185

7.11.2 Covariate balancing propensity score (CBPS)

CBPS (Imai-Ratkovic 2014) estimates the propensity score under balance constraints, ensuring that the weighted covariate means are balanced even when the logistic model is misspecified:

w_cbps <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
                   data = lalonde, estimand = "ATT", method = "cbps")
bal.tab(w_cbps, stats = c("m"), thresholds = c(m = 0.05))
Balance Measures
                Type Diff.Adj     M.Threshold
prop.score  Distance  -0.0181 Balanced, <0.05
age          Contin.  -0.0000 Balanced, <0.05
educ         Contin.  -0.0001 Balanced, <0.05
race_black    Binary  -0.0000 Balanced, <0.05
race_hispan   Binary  -0.0000 Balanced, <0.05
race_white    Binary   0.0000 Balanced, <0.05
married       Binary  -0.0000 Balanced, <0.05
nodegree      Binary  -0.0000 Balanced, <0.05
re74         Contin.  -0.0000 Balanced, <0.05
re75         Contin.  -0.0000 Balanced, <0.05

Balance tally for mean differences
                    count
Balanced, <0.05        10
Not Balanced, >0.05     0

Variable with the greatest mean difference
 Variable Diff.Adj     M.Threshold
     educ  -0.0001 Balanced, <0.05

Effective sample sizes
           Control Treated
Unadjusted  429.       185
Adjusted     98.45     185

7.11.3 Entropy balancing

Entropy balancing (Hainmueller 2012) solves a convex optimisation problem that yields weights with exact mean balance on the included covariates. It is doubly robust for the ATT and attains the semiparametric efficiency bound when both nuisance models are correct.

w_ebal <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
                   data = lalonde, estimand = "ATT", method = "ebal")
bal.tab(w_ebal, stats = c("m"), thresholds = c(m = 0.05))
Balance Measures
               Type Diff.Adj     M.Threshold
age         Contin.       -0 Balanced, <0.05
educ        Contin.       -0 Balanced, <0.05
race_black   Binary        0 Balanced, <0.05
race_hispan  Binary       -0 Balanced, <0.05
race_white   Binary       -0 Balanced, <0.05
married      Binary        0 Balanced, <0.05
nodegree     Binary        0 Balanced, <0.05
re74        Contin.       -0 Balanced, <0.05
re75        Contin.       -0 Balanced, <0.05

Balance tally for mean differences
                    count
Balanced, <0.05         9
Not Balanced, >0.05     0

Variable with the greatest mean difference
   Variable Diff.Adj     M.Threshold
 race_black        0 Balanced, <0.05

Effective sample sizes
           Control Treated
Unadjusted  429.       185
Adjusted     98.46     185

Notice that every standardised mean difference is now essentially zero — entropy balancing guarantees this by construction.

7.11.4 Energy balancing

Energy balancing (Huling & Mak 2024) optimises a weighted energy distance between treated and control covariate distributions. Unlike entropy balancing — which balances means — energy balancing balances the entire distribution of every covariate.

w_energy <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
                     data = lalonde, estimand = "ATT", method = "energy")
bal.tab(w_energy, stats = c("m"), thresholds = c(m = 0.05))
Balance Measures
               Type Diff.Adj     M.Threshold
age         Contin.  -0.0016 Balanced, <0.05
educ        Contin.   0.0106 Balanced, <0.05
race_black   Binary   0.0060 Balanced, <0.05
race_hispan  Binary  -0.0008 Balanced, <0.05
race_white   Binary  -0.0053 Balanced, <0.05
married      Binary  -0.0011 Balanced, <0.05
nodegree     Binary   0.0050 Balanced, <0.05
re74        Contin.  -0.0021 Balanced, <0.05
re75        Contin.   0.0226 Balanced, <0.05

Balance tally for mean differences
                    count
Balanced, <0.05         9
Not Balanced, >0.05     0

Variable with the greatest mean difference
 Variable Diff.Adj     M.Threshold
     re75   0.0226 Balanced, <0.05

Effective sample sizes
           Control Treated
Unadjusted  429.       185
Adjusted     41.82     185

7.11.5 Estimation with weighted-aware regression

Once weights are computed, use lm_weightit() (rather than plain lm with a weights argument) so that the weighting uncertainty is propagated into the treatment-effect standard error:

fit_ebal <- lm_weightit(
  re78 ~ treat * (age + educ + race + married + nodegree + re74 + re75),
  data     = lalonde,
  weightit = w_ebal
)

avg_comparisons(fit_ebal, variables = "treat",
                newdata = subset(lalonde, treat == 1))   # ATT

 Estimate Std. Error    z Pr(>|z|)   S 2.5 % 97.5 %
     1273        770 1.65   0.0983 3.3  -236   2783

Term: treat
Type: probs
Comparison: 1 - 0

The result is the entropy-balanced ATT with valid sandwich-style standard errors that account for the estimated weights.

7.11.6 When to use each weighting method

Method Strengths Weaknesses
IPW (glm) Familiar; well-studied Sensitive to PS misspecification; extreme weights
CBPS Balance-constrained PS; robust to model misspecification Slower; can fail with many covariates
Entropy balancing Exact mean balance; doubly robust for ATT Balances means only, not distributions
Energy balancing Balances entire distribution Computationally heavier; less mature theory

For applied work, entropy balancing is the practical default — it typically achieves better balance than IPW with fewer extreme weights and faster runtime than CBPS. Energy balancing is the modern state-of-the-art when full distributional balance matters.

7.11.7 Comparing matching to weighting

The matching examples earlier in this chapter and the weighting examples here both target the ATT on the Lalonde data. In practice, an applied researcher should try several methods, report all balance diagnostics and ATT estimates, and choose based on which method achieves the best covariate balance. When matching, weighting, and modern weighting methods all converge to similar estimates, the identification strategy is on firm ground. When they diverge sharply, the overlap is poor and the right response is to restrict the population or use a doubly-robust estimator.

For an extended treatment of matching with MatchIt (including a comparison with Stata’s teffects commands), see the companion blog chapters on matching and treatment-effects in Stata.

7.12 Summary

  • Matching estimates causal effects by pairing treated units to similar controls, restricting the comparison to the region of overlap in covariates.
  • Choose a distance: propensity score for high-dimensional problems; Mahalanobis for low-dimensional; CEM when exact coarsened balance is desired.
  • Choose a method: nearest-neighbour is simple but often imbalanced; full matching uses all observations and usually balances best.
  • Estimate with weighted regression on the matched dataset, clustering SEs by subclass.
  • Diagnose with balance plots (love.plot) before reporting the effect.
  • For more advanced treatment of MatchIt and WeightIt, see the companion blog chapter.