using CausalGraphs
using CausalEstimate
import CausalGraphs: identify, make_graph, to_mermaid, markov_pillow
import CausalEstimate: estimate, confint, pvalue
using DataFrames
using DelimitedFiles
using Downloads
using Logging
using Random4 Applied Graph Workflow: Smoking Cessation
The previous chapter introduced the graph-identification-estimation workflow with small examples. This chapter applies the same workflow to a real data set: the complete-case National Health and Nutrition Examination Survey I Epidemiologic Follow-up Study (NHEFS) data distributed through causaldata and mirrored by RDatasets.
The question is:
What is the effect of quitting smoking on weight change from 1971 to 1982?
The treatment is qsmk, an indicator for quitting smoking between baseline and follow-up. The outcome is wt82_71, weight change over the follow-up period. This is observational data, so identification depends on the adjustment assumptions encoded in the graph.
4.1 Data
For a focused example, keep the treatment, outcome, and baseline covariates commonly used in the NHEFS smoking cessation examples.
function read_rdatasets_csv(url, cols)
local_file = Downloads.download(url; timeout = 120)
x, header = readdlm(local_file, ',', Any, '\n'; header = true)
raw = DataFrame(x, Symbol.(vec(header)))
DataFrame((c => Float64.(raw[!, c]) for c in cols)...)
end
nhefs_url = "https://vincentarelbundock.github.io/Rdatasets/csv/causaldata/nhefs_complete.csv"
nhefs_cols = [
:qsmk, :wt82_71,
:sex, :race, :age, :school,
:smokeintensity, :smokeyrs,
:exercise, :active, :wt71,
]
nhefs = read_rdatasets_csv(nhefs_url, nhefs_cols);
(n = nrow(nhefs), variables = ncol(nhefs))(n = 1566, variables = 11)
preview_table(nhefs)| Row | qsmk | wt82_71 | sex | race | age | school | smokeintensity | smokeyrs | exercise | active | wt71 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | |
| 1 | 0.0 | -10.09 | 0.0 | 1.0 | 42.0 | 7.0 | 30.0 | 29.0 | 2.0 | 0.0 | 79.04 |
| 2 | 0.0 | 2.605 | 0.0 | 0.0 | 36.0 | 9.0 | 20.0 | 24.0 | 0.0 | 0.0 | 58.63 |
| 3 | 0.0 | 9.414 | 1.0 | 1.0 | 56.0 | 11.0 | 20.0 | 26.0 | 2.0 | 0.0 | 56.81 |
| 4 | 0.0 | 4.99 | 0.0 | 1.0 | 68.0 | 5.0 | 3.0 | 53.0 | 2.0 | 1.0 | 59.42 |
| 5 | 0.0 | 4.989 | 0.0 | 0.0 | 40.0 | 11.0 | 20.0 | 19.0 | 1.0 | 1.0 | 87.09 |
4.2 Adjustment DAG
Start with the grouped graph
X -> A -> Y
X -----> Y
where X is the baseline covariate set, A is quitting smoking, and Y is later weight change. This graph says the baseline variables may affect both quitting and later weight change, and that there is no unmeasured common cause of quitting and later weight change after conditioning on those baseline variables.
conceptual_graph = make_graph(
vertices = [:X, :A, :Y],
di_edges = [(:X, :A), (:X, :Y), (:A, :Y)],
);println("```{mermaid}")
println(to_mermaid(conceptual_graph; direction = "LR"))
println("```")flowchart LR n1_X["X"] n2_A["A"] n3_Y["Y"] n1_X --> n2_A n1_X --> n3_Y n2_A --> n3_Y classDef fixed stroke:#111827,stroke-width:2px linkStyle 0 stroke:#2563eb,stroke-width:1.8px linkStyle 1 stroke:#2563eb,stroke-width:1.8px linkStyle 2 stroke:#2563eb,stroke-width:1.8px
For estimation, expand X into the observed NHEFS columns. The expanded graph is still an adjustment DAG: each baseline variable is allowed to point into quitting and into later weight change.
nhefs_vertices = [
:sex, :race, :age, :school,
:smokeintensity, :smokeyrs,
:exercise, :active, :wt71,
:qsmk, :wt82_71,
]
baseline = setdiff(nhefs_vertices, [:qsmk, :wt82_71])
nhefs_edges = vcat(
[(w, :qsmk) for w in baseline],
[(w, :wt82_71) for w in baseline],
[(:qsmk, :wt82_71)],
)
nhefs_graph = make_graph(vertices = nhefs_vertices, di_edges = nhefs_edges);
(vertices = length(nhefs_vertices), directed_edges = length(nhefs_edges))(vertices = 11, directed_edges = 19)
4.3 Identification
CausalGraphs.identify checks the graph before estimation. In this adjustment DAG, the quitting node is a-fixable. The Markov pillow is the sufficient adjustment set:
nhefs_id = identify(nhefs_graph, :qsmk, :wt82_71)
DataFrame(
strategy = [string(nhefs_id.strategy)],
markov_pillow = [join(string.(markov_pillow(nhefs_graph, :qsmk; treatment = :qsmk)), ", ")],
)| Row | strategy | markov_pillow |
|---|---|---|
| String | String | |
| 1 | a_fixable | sex, race, age, school, smokeintensity, smokeyrs, exercise, active, wt71 |
The graph has therefore identified the causal mean by the adjustment formula
\[ E[Y(a)] = E_X\{E(Y \mid A=a, X)\}. \]
This formula is not a modeling claim by itself. It is the observed-data functional implied by the graph. Estimation still requires nuisance models for the outcome regression and treatment mechanism.
4.4 Estimation with CausalEstimate
The book keeps identification and estimation as separate steps. Here CausalGraphs supplies the graph result and CausalEstimate estimates the identified functional. Because the graph is a-fixable, GraphID routes to the AIPW average treatment effect machinery.
Random.seed!(2026)
nhefs_result = estimate(
ATE(outcome = :wt82_71, treatment = :qsmk),
GraphID(graph = nhefs_graph),
AIPW(crossfit = 2),
nhefs,
)
effect_table("Quit smoking vs continued smoking", nhefs_result)| Row | estimand | estimate | lower_95 | upper_95 | se |
|---|---|---|---|---|---|
| String | Float64 | Float64 | Float64 | Float64 | |
| 1 | Quit smoking vs continued smoking | 3.866 | 1.257 | 6.475 | 1.331 |
Interpreted under the DAG, the estimate is the average effect of quitting smoking versus continuing smoking on weight change in the complete-case NHEFS population. The positive estimate is consistent with weight gain after quitting. The causal interpretation depends on the graph, especially the no unmeasured confounding assumption after adjustment for the baseline variables.
4.5 Sensitivity Graph
Now encode a different assumption: quitting and later weight change share an unobserved common cause even after conditioning on the measured baseline variables. In an ADMG this is a bidirected edge, qsmk <-> wt82_71.
sensitivity_graph = make_graph(
vertices = nhefs_vertices,
di_edges = nhefs_edges,
bi_edges = [(:qsmk, :wt82_71)],
)
sensitivity_id = identify(sensitivity_graph, :qsmk, :wt82_71);println("```{mermaid}")
println(to_mermaid(sensitivity_graph; direction = "LR"))
println("```")flowchart LR n1_sex["sex"] n2_race["race"] n3_age["age"] n4_school["school"] n5_smokeintensity["smokeintensity"] n6_smokeyrs["smokeyrs"] n7_exercise["exercise"] n8_active["active"] n9_wt71["wt71"] n10_qsmk["qsmk"] n11_wt82_71["wt82_71"] n1_sex --> n10_qsmk n2_race --> n10_qsmk n3_age --> n10_qsmk n4_school --> n10_qsmk n5_smokeintensity --> n10_qsmk n6_smokeyrs --> n10_qsmk n7_exercise --> n10_qsmk n8_active --> n10_qsmk n9_wt71 --> n10_qsmk n1_sex --> n11_wt82_71 n2_race --> n11_wt82_71 n3_age --> n11_wt82_71 n4_school --> n11_wt82_71 n5_smokeintensity --> n11_wt82_71 n6_smokeyrs --> n11_wt82_71 n7_exercise --> n11_wt82_71 n8_active --> n11_wt82_71 n9_wt71 --> n11_wt82_71 n10_qsmk --> n11_wt82_71 n10_qsmk <--> n11_wt82_71 classDef fixed stroke:#111827,stroke-width:2px linkStyle 0 stroke:#2563eb,stroke-width:1.8px linkStyle 1 stroke:#2563eb,stroke-width:1.8px linkStyle 2 stroke:#2563eb,stroke-width:1.8px linkStyle 3 stroke:#2563eb,stroke-width:1.8px linkStyle 4 stroke:#2563eb,stroke-width:1.8px linkStyle 5 stroke:#2563eb,stroke-width:1.8px linkStyle 6 stroke:#2563eb,stroke-width:1.8px linkStyle 7 stroke:#2563eb,stroke-width:1.8px linkStyle 8 stroke:#2563eb,stroke-width:1.8px linkStyle 9 stroke:#2563eb,stroke-width:1.8px linkStyle 10 stroke:#2563eb,stroke-width:1.8px linkStyle 11 stroke:#2563eb,stroke-width:1.8px linkStyle 12 stroke:#2563eb,stroke-width:1.8px linkStyle 13 stroke:#2563eb,stroke-width:1.8px linkStyle 14 stroke:#2563eb,stroke-width:1.8px linkStyle 15 stroke:#2563eb,stroke-width:1.8px linkStyle 16 stroke:#2563eb,stroke-width:1.8px linkStyle 17 stroke:#2563eb,stroke-width:1.8px linkStyle 18 stroke:#2563eb,stroke-width:1.8px linkStyle 19 stroke:#dc2626,stroke-width:1.8px
DataFrame(strategy = [string(sensitivity_id.strategy)])| Row | strategy |
|---|---|
| String | |
| 1 | not_identified |
The identifying conclusion changes. Under this graph, the observed NHEFS variables are not enough to identify the effect. This is the practical value of drawing the graph before estimating: the same data can support different causal answers under different assumptions.
4.6 Structured Baseline Graph
The simple adjustment graph is often the clearest starting point. A richer DAG can record more assumptions about baseline temporal ordering and relationships: demographics may precede education, smoking history, activity, and baseline weight; those variables may then affect quitting and later weight change.
structured_edges = [
(:age, :school), (:age, :smokeyrs), (:age, :wt71),
(:age, :active), (:age, :exercise), (:age, :qsmk), (:age, :wt82_71),
(:sex, :smokeintensity), (:sex, :smokeyrs), (:sex, :wt71),
(:sex, :qsmk), (:sex, :wt82_71),
(:race, :school), (:race, :smokeintensity),
(:race, :qsmk), (:race, :wt82_71),
(:school, :smokeintensity), (:school, :qsmk), (:school, :wt82_71),
(:smokeyrs, :smokeintensity), (:smokeyrs, :qsmk), (:smokeyrs, :wt82_71),
(:smokeintensity, :qsmk), (:smokeintensity, :wt82_71),
(:exercise, :wt71), (:exercise, :qsmk), (:exercise, :wt82_71),
(:active, :wt71), (:active, :qsmk), (:active, :wt82_71),
(:wt71, :qsmk), (:wt71, :wt82_71),
(:qsmk, :wt82_71),
]
structured_graph = make_graph(vertices = nhefs_vertices, di_edges = structured_edges);
structured_id = identify(structured_graph, :qsmk, :wt82_71)
DataFrame(
strategy = [string(structured_id.strategy)],
same_adjustment_set = [
Set(markov_pillow(structured_graph, :qsmk; treatment = :qsmk)) == Set(baseline)
],
)| Row | strategy | same_adjustment_set |
|---|---|---|
| String | Bool | |
| 1 | a_fixable | true |
For this treatment effect, the richer baseline DAG is adjustment-equivalent to the simpler expanded DAG: it gives the same measured parents for qsmk and therefore the same identifying functional. The extra arrows among baseline variables may be useful for documentation, but they are additional causal assumptions. If those extra assumptions are not needed for the estimand, the grouped adjustment graph is usually easier to defend.