4 Applied Graph Workflow: Smoking Cessation

using CausalGraphs
using CausalEstimate
import CausalGraphs: identify, make_graph, to_mermaid, markov_pillow
import CausalEstimate: estimate, confint, pvalue
using DataFrames
using DelimitedFiles
using Downloads
using Logging
using Random

The previous chapter introduced the graph-identification-estimation workflow with small examples. This chapter applies the same workflow to a real data set: the complete-case National Health and Nutrition Examination Survey I Epidemiologic Follow-up Study (NHEFS) data distributed through causaldata and mirrored by RDatasets.

The question is:

What is the effect of quitting smoking on weight change from 1971 to 1982?

The treatment is qsmk, an indicator for quitting smoking between baseline and follow-up. The outcome is wt82_71, weight change over the follow-up period. This is observational data, so identification depends on the adjustment assumptions encoded in the graph.

4.1 Data

For a focused example, keep the treatment, outcome, and baseline covariates commonly used in the NHEFS smoking cessation examples.

function read_rdatasets_csv(url, cols)
    local_file = Downloads.download(url; timeout = 120)
    x, header = readdlm(local_file, ',', Any, '\n'; header = true)
    raw = DataFrame(x, Symbol.(vec(header)))
    DataFrame((c => Float64.(raw[!, c]) for c in cols)...)
end

nhefs_url = "https://vincentarelbundock.github.io/Rdatasets/csv/causaldata/nhefs_complete.csv"
nhefs_cols = [
    :qsmk, :wt82_71,
    :sex, :race, :age, :school,
    :smokeintensity, :smokeyrs,
    :exercise, :active, :wt71,
]

nhefs = read_rdatasets_csv(nhefs_url, nhefs_cols);

(n = nrow(nhefs), variables = ncol(nhefs))

(n = 1566, variables = 11)

preview_table(nhefs)

5×11 DataFrame

Row	qsmk	wt82_71	sex	race	age	school	smokeintensity	smokeyrs	exercise	active	wt71
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Float64
1	0.0	-10.09	0.0	1.0	42.0	7.0	30.0	29.0	2.0	0.0	79.04
2	0.0	2.605	0.0	0.0	36.0	9.0	20.0	24.0	0.0	0.0	58.63
3	0.0	9.414	1.0	1.0	56.0	11.0	20.0	26.0	2.0	0.0	56.81
4	0.0	4.99	0.0	1.0	68.0	5.0	3.0	53.0	2.0	1.0	59.42
5	0.0	4.989	0.0	0.0	40.0	11.0	20.0	19.0	1.0	1.0	87.09

4.2 Adjustment DAG

Start with the grouped graph

X -> A -> Y
X -----> Y

where X is the baseline covariate set, A is quitting smoking, and Y is later weight change. This graph says the baseline variables may affect both quitting and later weight change, and that there is no unmeasured common cause of quitting and later weight change after conditioning on those baseline variables.

conceptual_graph = make_graph(
    vertices = [:X, :A, :Y],
    di_edges = [(:X, :A), (:X, :Y), (:A, :Y)],
);

println("```{mermaid}")
println(to_mermaid(conceptual_graph; direction = "LR"))
println("```")

flowchart LR
  n1_X["X"]
  n2_A["A"]
  n3_Y["Y"]
  n1_X --> n2_A
  n1_X --> n3_Y
  n2_A --> n3_Y
  classDef fixed stroke:#111827,stroke-width:2px
  linkStyle 0 stroke:#2563eb,stroke-width:1.8px
  linkStyle 1 stroke:#2563eb,stroke-width:1.8px
  linkStyle 2 stroke:#2563eb,stroke-width:1.8px

For estimation, expand X into the observed NHEFS columns. The expanded graph is still an adjustment DAG: each baseline variable is allowed to point into quitting and into later weight change.

nhefs_vertices = [
    :sex, :race, :age, :school,
    :smokeintensity, :smokeyrs,
    :exercise, :active, :wt71,
    :qsmk, :wt82_71,
]

baseline = setdiff(nhefs_vertices, [:qsmk, :wt82_71])

nhefs_edges = vcat(
    [(w, :qsmk) for w in baseline],
    [(w, :wt82_71) for w in baseline],
    [(:qsmk, :wt82_71)],
)

nhefs_graph = make_graph(vertices = nhefs_vertices, di_edges = nhefs_edges);

(vertices = length(nhefs_vertices), directed_edges = length(nhefs_edges))

(vertices = 11, directed_edges = 19)

4.3 Identification

CausalGraphs.identify checks the graph before estimation. In this adjustment DAG, the quitting node is a-fixable. The Markov pillow is the sufficient adjustment set:

nhefs_id = identify(nhefs_graph, :qsmk, :wt82_71)

DataFrame(
    strategy = [string(nhefs_id.strategy)],
    markov_pillow = [join(string.(markov_pillow(nhefs_graph, :qsmk; treatment = :qsmk)), ", ")],
)

1×2 DataFrame

Row	strategy	markov_pillow
	String	String
1	a_fixable	sex, race, age, school, smokeintensity, smokeyrs, exercise, active, wt71

The graph has therefore identified the causal mean by the adjustment formula

\[ E[Y(a)] = E_X\{E(Y \mid A=a, X)\}. \]

This formula is not a modeling claim by itself. It is the observed-data functional implied by the graph. Estimation still requires nuisance models for the outcome regression and treatment mechanism.

4.4 Estimation with CausalEstimate

The book keeps identification and estimation as separate steps. Here CausalGraphs supplies the graph result and CausalEstimate estimates the identified functional. Because the graph is a-fixable, GraphID routes to the AIPW average treatment effect machinery.

Random.seed!(2026)

nhefs_result = estimate(
    ATE(outcome = :wt82_71, treatment = :qsmk),
    GraphID(graph = nhefs_graph),
    AIPW(crossfit = 2),
    nhefs,
)

effect_table("Quit smoking vs continued smoking", nhefs_result)

1×5 DataFrame

Row	estimand	estimate	lower_95	upper_95	se
	String	Float64	Float64	Float64	Float64
1	Quit smoking vs continued smoking	3.866	1.257	6.475	1.331

Interpreted under the DAG, the estimate is the average effect of quitting smoking versus continuing smoking on weight change in the complete-case NHEFS population. The positive estimate is consistent with weight gain after quitting. The causal interpretation depends on the graph, especially the no unmeasured confounding assumption after adjustment for the baseline variables.

4.5 Sensitivity Graph

Now encode a different assumption: quitting and later weight change share an unobserved common cause even after conditioning on the measured baseline variables. In an ADMG this is a bidirected edge, qsmk <-> wt82_71.

sensitivity_graph = make_graph(
    vertices = nhefs_vertices,
    di_edges = nhefs_edges,
    bi_edges = [(:qsmk, :wt82_71)],
)

sensitivity_id = identify(sensitivity_graph, :qsmk, :wt82_71);

println("```{mermaid}")
println(to_mermaid(sensitivity_graph; direction = "LR"))
println("```")

flowchart LR
  n1_sex["sex"]
  n2_race["race"]
  n3_age["age"]
  n4_school["school"]
  n5_smokeintensity["smokeintensity"]
  n6_smokeyrs["smokeyrs"]
  n7_exercise["exercise"]
  n8_active["active"]
  n9_wt71["wt71"]
  n10_qsmk["qsmk"]
  n11_wt82_71["wt82_71"]
  n1_sex --> n10_qsmk
  n2_race --> n10_qsmk
  n3_age --> n10_qsmk
  n4_school --> n10_qsmk
  n5_smokeintensity --> n10_qsmk
  n6_smokeyrs --> n10_qsmk
  n7_exercise --> n10_qsmk
  n8_active --> n10_qsmk
  n9_wt71 --> n10_qsmk
  n1_sex --> n11_wt82_71
  n2_race --> n11_wt82_71
  n3_age --> n11_wt82_71
  n4_school --> n11_wt82_71
  n5_smokeintensity --> n11_wt82_71
  n6_smokeyrs --> n11_wt82_71
  n7_exercise --> n11_wt82_71
  n8_active --> n11_wt82_71
  n9_wt71 --> n11_wt82_71
  n10_qsmk --> n11_wt82_71
  n10_qsmk <--> n11_wt82_71
  classDef fixed stroke:#111827,stroke-width:2px
  linkStyle 0 stroke:#2563eb,stroke-width:1.8px
  linkStyle 1 stroke:#2563eb,stroke-width:1.8px
  linkStyle 2 stroke:#2563eb,stroke-width:1.8px
  linkStyle 3 stroke:#2563eb,stroke-width:1.8px
  linkStyle 4 stroke:#2563eb,stroke-width:1.8px
  linkStyle 5 stroke:#2563eb,stroke-width:1.8px
  linkStyle 6 stroke:#2563eb,stroke-width:1.8px
  linkStyle 7 stroke:#2563eb,stroke-width:1.8px
  linkStyle 8 stroke:#2563eb,stroke-width:1.8px
  linkStyle 9 stroke:#2563eb,stroke-width:1.8px
  linkStyle 10 stroke:#2563eb,stroke-width:1.8px
  linkStyle 11 stroke:#2563eb,stroke-width:1.8px
  linkStyle 12 stroke:#2563eb,stroke-width:1.8px
  linkStyle 13 stroke:#2563eb,stroke-width:1.8px
  linkStyle 14 stroke:#2563eb,stroke-width:1.8px
  linkStyle 15 stroke:#2563eb,stroke-width:1.8px
  linkStyle 16 stroke:#2563eb,stroke-width:1.8px
  linkStyle 17 stroke:#2563eb,stroke-width:1.8px
  linkStyle 18 stroke:#2563eb,stroke-width:1.8px
  linkStyle 19 stroke:#dc2626,stroke-width:1.8px

DataFrame(strategy = [string(sensitivity_id.strategy)])

1×1 DataFrame

Row	strategy
	String
1	not_identified

The identifying conclusion changes. Under this graph, the observed NHEFS variables are not enough to identify the effect. This is the practical value of drawing the graph before estimating: the same data can support different causal answers under different assumptions.

4.6 Structured Baseline Graph

The simple adjustment graph is often the clearest starting point. A richer DAG can record more assumptions about baseline temporal ordering and relationships: demographics may precede education, smoking history, activity, and baseline weight; those variables may then affect quitting and later weight change.

structured_edges = [
    (:age, :school), (:age, :smokeyrs), (:age, :wt71),
    (:age, :active), (:age, :exercise), (:age, :qsmk), (:age, :wt82_71),

    (:sex, :smokeintensity), (:sex, :smokeyrs), (:sex, :wt71),
    (:sex, :qsmk), (:sex, :wt82_71),

    (:race, :school), (:race, :smokeintensity),
    (:race, :qsmk), (:race, :wt82_71),

    (:school, :smokeintensity), (:school, :qsmk), (:school, :wt82_71),

    (:smokeyrs, :smokeintensity), (:smokeyrs, :qsmk), (:smokeyrs, :wt82_71),
    (:smokeintensity, :qsmk), (:smokeintensity, :wt82_71),

    (:exercise, :wt71), (:exercise, :qsmk), (:exercise, :wt82_71),
    (:active, :wt71), (:active, :qsmk), (:active, :wt82_71),
    (:wt71, :qsmk), (:wt71, :wt82_71),

    (:qsmk, :wt82_71),
]

structured_graph = make_graph(vertices = nhefs_vertices, di_edges = structured_edges);
structured_id = identify(structured_graph, :qsmk, :wt82_71)

DataFrame(
    strategy = [string(structured_id.strategy)],
    same_adjustment_set = [
        Set(markov_pillow(structured_graph, :qsmk; treatment = :qsmk)) == Set(baseline)
    ],
)

1×2 DataFrame

Row	strategy	same_adjustment_set
	String	Bool
1	a_fixable	true

For this treatment effect, the richer baseline DAG is adjustment-equivalent to the simpler expanded DAG: it gives the same measured parents for qsmk and therefore the same identifying functional. The extra arrows among baseline variables may be useful for documentation, but they are additional causal assumptions. If those extra assumptions are not needed for the estimand, the grouped adjustment graph is usually easier to defend.