17 Shift-Share Instrumental Variables

using DataFrames
using LinearAlgebra
using Statistics
using Random
using CairoMakie
using Panelest
using StatsModels
using CSV

using ShiftShareIV

Shift-share, or Bartik, instruments combine local industry shares with industry-level shocks. The idea is simple: a location with a large baseline share in an industry is more exposed to shocks in that industry. The difficult part is not constructing the instrument. The difficult part is stating what is assumed exogenous: the shares, the shocks, or both.

17.1 The construction

For each location \(\ell\) and industry \(k\), let \(s_{\ell k}\) be the share of local employment in industry \(k\) at baseline, and let \(g_k\) be a national growth rate (or trade shock) for that industry. The shift-share instrument is

\[ B_\ell \;=\; \sum_{k=1}^{K} s_{\ell k}\, g_k. \]

This is “shift-share” because it combines a local share with an industry-level shift. Bartik (1991) used local industry mix to predict local employment growth. Autor, Dorn, and Hanson (2013) used local industry shares interacted with industry-level Chinese import growth in other high-income countries.

The first-stage regression is then

\[ \Delta L_\ell \;=\; \pi_0 + \pi_1 B_\ell + X_\ell'\gamma + \epsilon_\ell, \]

with \(B_\ell\) instrumenting for the observed local shock in a 2SLS for the outcome of interest.

17.2 Two identification views

There are two ways to justify a shift-share instrument. They put the exogeneity assumption on different objects.

17.2.2 Shock view (BHJ 2022)

Treat the shocks \(g_k\) as the source of identifying variation. Shares are exposure weights. Under this view, shocks should be uncorrelated with the second-stage unobservables, conditional on industry-level controls. BHJ show how to rewrite the regression at the shock level.

The two views are not mutually exclusive. In an application, we should be clear which one is being defended and report diagnostics for both.

17.3 Inference

Standard OLS or cluster-robust standard errors on the regional regression underestimate uncertainty because the same industry shocks appear in many regions, inducing cross-region correlation that region-level clustering does not capture. Two corrections are now standard:

Cluster on shocks (BHJ): equivalent to running the regression at the industry-shock level. Implemented via the bhj_collapse function.
AKM SE (Adão, Kolesár, Morales 2019): explicit formula accounting for shock-level correlation; requires custom implementation or a dedicated package.

17.4 Simulation: build intuition

Use a small DGP with regions, industries, and a known causal effect.

# The CSVs below are 500 locations x 20 industries; beta_true is the DGP coefficient.
beta_true  = 0.5

df     = CSV.read("data/shift_share_sim.csv", DataFrame)
shares = Matrix(CSV.read("data/shift_share_shares.csv", DataFrame))
shocks = CSV.read("data/shift_share_shocks.csv", DataFrame).shock
X = df.X; Y = df.Y; u = df.u
first(df[:, [:region, :X, :Y, :u]], 6)

6×4 DataFrame

Row	region	X	Y	u
	Int64	Float64	Float64	Float64
1	1	-0.518194	0.11729	-0.308258
2	2	0.212703	-1.06765	-0.538806
3	3	0.652136	0.892118	1.37164
4	4	0.285676	0.861261	1.09974
5	5	-0.334493	-0.329605	-0.111544
6	6	0.37613	-1.72747	-1.75462

A naive OLS suffers from the confounder u:

ols  = feols(df, @formula(Y ~ X))
ivss = feiv(df, @formula(Y ~ 1), endo = :X, inst = :B)

@printf("%-25s %8s %8s\n", "Estimator", "Coef", "SE")
@printf("%-25s %8.3f %8.3f\n", "OLS (biased)",
        coef(ols)[end], stderror(ols)[end])
@printf("%-25s %8.3f %8.3f\n", "Shift-share IV",
        coef(ivss)[1], stderror(ivss)[1])
@printf("True beta: %.3f\n", beta_true)

Estimator                     Coef       SE
OLS (biased)                 1.451    0.079
Shift-share IV               0.531    0.120
True beta: 0.500

OLS is biased upward by the confounder; the shift-share IV recovers a value close to the true effect.

17.4.1 Rotemberg weights

The GPSS decomposition says the shift-share IV is a weighted average of \(K\) just-identified IVs (one per industry share). Compute the weights:

rw = rotemberg_weights(shares, shocks, X, Y)

@printf("%-10s %8s %8s %8s\n", "Industry", "Shock", "α_k", "β_k")
for row in eachrow(rw)
    @printf("%-10d %8.3f %8.3f %8.3f\n",
            row.industry, row.shock, row.alpha, row.beta_k)
end
@printf("\nGPSS identity: Σ α_k β_k = %.4f\n", sum(rw.alpha_beta))
@printf("Shift-share IV estimate: %.4f\n", coef(ivss)[1])

Industry      Shock      α_k      β_k
1             1.250    0.054    0.604
2            -0.959    0.048    0.244
3            -0.212    0.005    2.576
4             0.745    0.002  -10.225
5             2.663    0.387    0.469
6            -1.068    0.082    1.025
7             0.100   -0.001   -1.627
8            -0.718    0.025   -0.689
9             1.181    0.058    0.853
10           -0.197    0.004    1.188
11            0.741    0.023   -0.112
12            0.051   -0.001   -0.055
13           -0.442    0.014    0.611
14            0.270    0.003   -0.013
15            0.631    0.003   -3.240
16            2.271    0.204    0.746
17           -0.212    0.001    0.174
18           -0.115    0.000  -39.532
19            0.020   -0.000    1.234
20           -1.183    0.089    0.554

GPSS identity: Σ α_k β_k = 0.5309
Shift-share IV estimate: 0.5309

Two diagnostics matter:

Weight concentration: if one or two industries carry most of the weight, the identifying variation is essentially coming from those industries’ shares. Their exogeneity is the assumption you are really leaning on.
Heterogeneous \(\hat\beta_k\): in this benign simulation, the industry-specific IVs are all near the true effect (with sampling noise). In real data, if a few industries give wildly different estimates, the “average” shift-share IV is averaging over heterogeneous treatment effects you should report explicitly.

fig = Figure(size = (640, 380))
ax  = Axis(fig[1, 1],
           xlabel = "Industry",
           ylabel = "Just-identified IV estimate (β_k)",
           title  = "Industry-level IV estimates and Rotemberg weights")

# Bubble plot: position = beta_k, size = |alpha_k|
scatter!(ax, rw.industry, rw.beta_k,
         markersize = 30 .* sqrt.(abs.(rw.alpha)),
         color = (:steelblue, 0.7))
hlines!(ax, [beta_true], color = :firebrick, linestyle = :dash,
        linewidth = 2, label = "True β = $(beta_true)")
axislegend(ax, position = :rb, framevisible = false)
fig

17.4.2 What goes wrong when shares are endogenous

Now break the share-exogeneity assumption: make industry 1’s share correlated with the second-stage error.

v = CSV.read("data/shift_share_bad_v.csv", DataFrame).v
shares_bad = copy(shares)
shares_bad[:, 1] .= max.(0.01, shares[:, 1] .+ 0.15 .* v)
shares_bad = shares_bad ./ sum(shares_bad, dims=2)   # renormalise

bad_noise = CSV.read("data/shift_share_bad_noise.csv", DataFrame)
B_bad = bartik_iv(shares_bad, shocks)
X_bad = B_bad .+ 0.3 .* u .+ bad_noise.noise_x
Y_bad = beta_true .* X_bad .+ u .+ 0.6 .* v .+ bad_noise.noise_y

df_bad  = DataFrame(X = X_bad, Y = Y_bad, B = B_bad)
iv_bad  = feiv(df_bad, @formula(Y ~ 1), endo = :X, inst = :B)

@printf("True beta: %.3f\n", beta_true)
@printf("Shift-share IV with bad share 1: %.3f\n", coef(iv_bad)[1])

# Would the Rotemberg weights have flagged industry 1?
rw_bad = rotemberg_weights(shares_bad, shocks, X_bad, Y_bad)
ord    = sortperm(abs.(rw_bad.alpha), rev = true)
rank1  = findfirst(==(1), rw_bad.industry[ord])
@printf("Rotemberg weight of industry 1: %.3f (rank %d of %d by |α|)\n",
        rw_bad.alpha[rw_bad.industry .== 1][1], rank1, nrow(rw_bad))

# Leave-one-industry-out: rebuild the instrument without industry 1
B_loo  = bartik_iv(shares_bad[:, 2:end], shocks[2:end])
df_loo = DataFrame(X = X_bad, Y = Y_bad, B = B_loo)
iv_loo = feiv(df_loo, @formula(Y ~ 1), endo = :X, inst = :B)
@printf("Leave-industry-1-out IV: %.3f\n", coef(iv_loo)[1])

True beta: 0.500
Shift-share IV with bad share 1: 0.798
Rotemberg weight of industry 1: 0.110 (rank 3 of 20 by |α|)
Leave-industry-1-out IV: 0.480

The estimate is biased, and the bias comes from one industry’s share being related to the error. Note what the diagnostics do and do not catch here. The endogenous industry’s Rotemberg weight is unremarkable — it ranks third of twenty — because the weights measure influence (whose exogeneity the estimate leans on most), not endogeneity. An industry with a modest weight can still move the estimate substantially when its share is correlated with the error. What isolates the problem is the leave-one-industry-out check: dropping industry 1 from the instrument restores an estimate close to the truth.

17.5 BHJ shock-level inference

Collapse the location-level data to the shock level and run a weighted IV at the shock (industry) level. This reframes the identifying assumption: shocks must be uncorrelated with the shock-level aggregated outcome residual.

collapsed = bhj_collapse(shares, shocks, Y, X)

# BHJ shock-level 2SLS: regress Y_agg ~ X_agg | shock, weighted by weight
df_collapsed = collapsed
iv_bhj = feiv(df_collapsed, @formula(Y_agg ~ 1),
              endo = :X_agg, inst = :shock, weights = :weight)

@printf("Location-level shift-share IV: %.3f\n", coef(ivss)[1])
@printf("BHJ shock-level IV:             %.3f\n", coef(iv_bhj)[1])
@printf("(Should be close; equivalent in the no-controls case)\n")

Location-level shift-share IV: 0.531
BHJ shock-level IV:             0.537
(Should be close; equivalent in the no-controls case)

The BHJ shock-level regression coincides with the location-level IV when there are no controls — up to implementation details of the weighting, so expect close rather than identical values in practice. Its value is interpretive: validity is now a statement about shocks (20 observations, one per industry) rather than locations (500 observations), and testing shock-level exogeneity is more transparent.

17.6 Empirical: the China shock (Autor, Dorn, Hanson 2013)

The standard shift-share application is ADH (2013). Commuting zones are exposed differently to Chinese import competition because their baseline industry mixes differ. The instrument is

\[ \text{IV}_\ell \;=\; \sum_k \frac{L_{\ell k,\,1990}}{L_{\ell,\,1990}}\, \frac{\Delta M^{\text{other}}_{k}}{L_{k,\,1990}}, \]

where the shocks \(\Delta M^{\text{other}}_k\) are growth in Chinese imports into other high-income countries. This leave-one-out construction removes US-specific demand from the shock.

The snippet below is schematic — the full ADH specification additionally needs period effects, the ADH control set, and population weights (a validated replication against the GPSS benchmark lives in the ShiftShareIV.jl repository).

# David Dorn distributes the replication data at ddorn.net
# Files needed:
#   workfile_china.csv  (commuting zone data, 1990–2007)
#   industry_shares.csv (czone × industry shares)

using CSV
cz     = DataFrame(CSV.File("data/workfile_china.csv"))
shares = Matrix(select(cz, r"share_"))   # L × K share matrix
shocks = cz.import_shock_other            # K-vector of "other-countries" shocks

B   = bartik_iv(shares, shocks)               # Bartik instrument
df_adh = hcat(cz, DataFrame(B = B))

# First-stage + second-stage
iv_adh = feiv(df_adh, @formula(d_sh_empl_mfg ~ 1),
              endo = :d_tradeusch_pw, inst = :B,
              vcov_type = Vcov.cluster(:statefip))

# Rotemberg decomposition
rw_adh = rotemberg_weights(shares, shocks, df_adh.d_tradeusch_pw,
                           df_adh.d_sh_empl_mfg)
# Top 5 industries by |alpha|:
sort(rw_adh, :alpha, rev=true)[1:5, :]

# BHJ shock-level regression
collapsed_adh = bhj_collapse(shares, shocks, df_adh.d_sh_empl_mfg,
                             df_adh.d_tradeusch_pw)
iv_bhj_adh = feiv(collapsed_adh, @formula(Y_agg ~ 1),
                  endo = :X_agg, inst = :shock, weights = :weight)

For a new shift-share paper, I would expect:

Rotemberg weights (via rotemberg_weights): identify which industries drive the estimate; flag weight concentration and heterogeneous \(\hat\beta_k\).
BHJ shock-level regression (via bhj_collapse): reframe validity as a statement about 20 industry shocks rather than 500 commuting zones.
AKM standard errors (Adão, Kolesár, Morales 2019): account for shock-level correlation in inference. Requires custom implementation; see the accompanying R chapter for an R-based version.

17.7 Summary

Shift-share IV combines local shares and industry shocks.
The GPSS view puts exogeneity on shares; the BHJ view puts it on shocks.
Rotemberg weights show which industry shares drive the 2SLS estimate.
Region-level clustering is usually too optimistic. Use shock-level inference and AKM standard errors when possible.
ShiftShareIV.jl provides bartik_iv, rotemberg_weights, and bhj_collapse; Panelest.jl provides feols and feiv.