Introduction to Causal Econometrics with Observational Data
Preface
Most of what applied researchers want to know from data is causal. We want to know whether a program raised earnings, whether a drug reduced mortality, whether a policy changed behavior. With experimental data the answer can sometimes be read off a difference in means; with observational data it almost never can. The bulk of modern causal econometrics is the careful business of stating what would have to be true about the data-generating process for a number we can compute to mean what we want it to mean — and then estimating that number well.
This book is a working guide to that business, written in R. It is aimed at applied researchers who already think in terms of regression and probability, and who want to see the methods carried out end to end rather than described in the abstract. Each chapter pairs a short account of the identifying assumptions with code that runs on a real or simulated dataset. The packages used are the ones a researcher would actually reach for in a paper: fixest, did, etwfe, synthdid and tasc for the various flavours of difference-in-differences; MatchIt and WeightIt for matching and balancing; tmle, DoubleML and medoutcon for doubly robust and nonparametric estimation; grf, lmtp and policytree for heterogeneous and continuous treatments; dagitty, causaleffect and pcalg for graph-based identification and causal discovery; lavaan for structural equation modelling. The book does not try to teach each package; it shows when and how an applied researcher would use one.
The book is organized in the order in which an applied project usually runs. Part I, Identification, develops the potential-outcomes framework, the main causal estimands, and the use of graphs and sensitivity analysis to argue that a target effect is identified from the data at hand. Part II, Estimation, works through the main strategies — regression adjustment, matching and balancing weights, doubly robust estimators, methods for continuous and heterogeneous treatments, and Bayesian extensions. Part III, Designs, covers difference-in-differences (including staggered timing), instrumental variables, regression discontinuity, and shift-share designs. The remaining parts treat longitudinal and survival settings, mediation, and causal discovery. An appendix surveys the R packages used in the text.
A companion volume, Causal Econometrics with Julia, covers the same ground in Julia. Chapters are cross-linked where the two books diverge — usually because a particular estimator is easier or faster in one language than the other. Source files, datasets, and renv-pinned package versions are available in the repository linked above; the “Edit this page” link at the foot of each chapter goes directly to the corresponding .qmd file. Corrections and suggestions are welcome through the issues tracker.