4  Applied Graph Workflow: Smoking Cessation

using CausalGraphs
using CausalEstimate
import CausalGraphs: identify, make_graph, to_mermaid, markov_pillow
import CausalEstimate: estimate, confint, pvalue
using DataFrames
using DelimitedFiles
using Downloads
using Logging
using Random

The previous chapter introduced the graph-identification-estimation workflow with small examples. This chapter applies the same workflow to a real data set: the complete-case National Health and Nutrition Examination Survey I Epidemiologic Follow-up Study (NHEFS) data distributed through causaldata and mirrored by RDatasets.

The question is:

What is the effect of quitting smoking on weight change from 1971 to 1982?

The treatment is qsmk, an indicator for quitting smoking between baseline and follow-up. The outcome is wt82_71, weight change over the follow-up period. This is observational data, so identification depends on the adjustment assumptions encoded in the graph.

4.1 Data

For a focused example, keep the treatment, outcome, and baseline covariates commonly used in the NHEFS smoking cessation examples.

function read_rdatasets_csv(url, cols)
    local_file = Downloads.download(url; timeout = 120)
    x, header = readdlm(local_file, ',', Any, '\n'; header = true)
    raw = DataFrame(x, Symbol.(vec(header)))
    DataFrame((c => Float64.(raw[!, c]) for c in cols)...)
end

nhefs_url = "https://vincentarelbundock.github.io/Rdatasets/csv/causaldata/nhefs_complete.csv"
nhefs_cols = [
    :qsmk, :wt82_71,
    :sex, :race, :age, :school,
    :smokeintensity, :smokeyrs,
    :exercise, :active, :wt71,
]

nhefs = read_rdatasets_csv(nhefs_url, nhefs_cols);

(n = nrow(nhefs), variables = ncol(nhefs))
(n = 1566, variables = 11)
preview_table(nhefs)
5×11 DataFrame
Row qsmk wt82_71 sex race age school smokeintensity smokeyrs exercise active wt71
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
1 0.0 -10.09 0.0 1.0 42.0 7.0 30.0 29.0 2.0 0.0 79.04
2 0.0 2.605 0.0 0.0 36.0 9.0 20.0 24.0 0.0 0.0 58.63
3 0.0 9.414 1.0 1.0 56.0 11.0 20.0 26.0 2.0 0.0 56.81
4 0.0 4.99 0.0 1.0 68.0 5.0 3.0 53.0 2.0 1.0 59.42
5 0.0 4.989 0.0 0.0 40.0 11.0 20.0 19.0 1.0 1.0 87.09

4.2 Adjustment DAG

Start with the grouped graph

X -> A -> Y
X -----> Y

where X is the baseline covariate set, A is quitting smoking, and Y is later weight change. This graph says the baseline variables may affect both quitting and later weight change, and that there is no unmeasured common cause of quitting and later weight change after conditioning on those baseline variables.

conceptual_graph = make_graph(
    vertices = [:X, :A, :Y],
    di_edges = [(:X, :A), (:X, :Y), (:A, :Y)],
);
println("```{mermaid}")
println(to_mermaid(conceptual_graph; direction = "LR"))
println("```")

flowchart LR
  n1_X["X"]
  n2_A["A"]
  n3_Y["Y"]
  n1_X --> n2_A
  n1_X --> n3_Y
  n2_A --> n3_Y
  classDef fixed stroke:#111827,stroke-width:2px
  linkStyle 0 stroke:#2563eb,stroke-width:1.8px
  linkStyle 1 stroke:#2563eb,stroke-width:1.8px
  linkStyle 2 stroke:#2563eb,stroke-width:1.8px

For estimation, expand X into the observed NHEFS columns. The expanded graph is still an adjustment DAG: each baseline variable is allowed to point into quitting and into later weight change.

nhefs_vertices = [
    :sex, :race, :age, :school,
    :smokeintensity, :smokeyrs,
    :exercise, :active, :wt71,
    :qsmk, :wt82_71,
]

baseline = setdiff(nhefs_vertices, [:qsmk, :wt82_71])

nhefs_edges = vcat(
    [(w, :qsmk) for w in baseline],
    [(w, :wt82_71) for w in baseline],
    [(:qsmk, :wt82_71)],
)

nhefs_graph = make_graph(vertices = nhefs_vertices, di_edges = nhefs_edges);

(vertices = length(nhefs_vertices), directed_edges = length(nhefs_edges))
(vertices = 11, directed_edges = 19)

4.3 Identification

CausalGraphs.identify checks the graph before estimation. In this adjustment DAG, the quitting node is a-fixable. The Markov pillow is the sufficient adjustment set:

nhefs_id = identify(nhefs_graph, :qsmk, :wt82_71)

DataFrame(
    strategy = [string(nhefs_id.strategy)],
    markov_pillow = [join(string.(markov_pillow(nhefs_graph, :qsmk; treatment = :qsmk)), ", ")],
)
1×2 DataFrame
Row strategy markov_pillow
String String
1 a_fixable sex, race, age, school, smokeintensity, smokeyrs, exercise, active, wt71

The graph has therefore identified the causal mean by the adjustment formula

\[ E[Y(a)] = E_X\{E(Y \mid A=a, X)\}. \]

This formula is not a modeling claim by itself. It is the observed-data functional implied by the graph. Estimation still requires nuisance models for the outcome regression and treatment mechanism.

4.4 Estimation with CausalEstimate

The book keeps identification and estimation as separate steps. Here CausalGraphs supplies the graph result and CausalEstimate estimates the identified functional. Because the graph is a-fixable, GraphID routes to the AIPW average treatment effect machinery.

Random.seed!(2026)

nhefs_result = estimate(
    ATE(outcome = :wt82_71, treatment = :qsmk),
    GraphID(graph = nhefs_graph),
    AIPW(crossfit = 2),
    nhefs,
)

effect_table("Quit smoking vs continued smoking", nhefs_result)
1×5 DataFrame
Row estimand estimate lower_95 upper_95 se
String Float64 Float64 Float64 Float64
1 Quit smoking vs continued smoking 3.866 1.257 6.475 1.331

Interpreted under the DAG, the estimate is the average effect of quitting smoking versus continuing smoking on weight change in the complete-case NHEFS population. The positive estimate is consistent with weight gain after quitting. The causal interpretation depends on the graph, especially the no unmeasured confounding assumption after adjustment for the baseline variables.

4.5 Sensitivity Graph

Now encode a different assumption: quitting and later weight change share an unobserved common cause even after conditioning on the measured baseline variables. In an ADMG this is a bidirected edge, qsmk <-> wt82_71.

sensitivity_graph = make_graph(
    vertices = nhefs_vertices,
    di_edges = nhefs_edges,
    bi_edges = [(:qsmk, :wt82_71)],
)

sensitivity_id = identify(sensitivity_graph, :qsmk, :wt82_71);
println("```{mermaid}")
println(to_mermaid(sensitivity_graph; direction = "LR"))
println("```")

flowchart LR
  n1_sex["sex"]
  n2_race["race"]
  n3_age["age"]
  n4_school["school"]
  n5_smokeintensity["smokeintensity"]
  n6_smokeyrs["smokeyrs"]
  n7_exercise["exercise"]
  n8_active["active"]
  n9_wt71["wt71"]
  n10_qsmk["qsmk"]
  n11_wt82_71["wt82_71"]
  n1_sex --> n10_qsmk
  n2_race --> n10_qsmk
  n3_age --> n10_qsmk
  n4_school --> n10_qsmk
  n5_smokeintensity --> n10_qsmk
  n6_smokeyrs --> n10_qsmk
  n7_exercise --> n10_qsmk
  n8_active --> n10_qsmk
  n9_wt71 --> n10_qsmk
  n1_sex --> n11_wt82_71
  n2_race --> n11_wt82_71
  n3_age --> n11_wt82_71
  n4_school --> n11_wt82_71
  n5_smokeintensity --> n11_wt82_71
  n6_smokeyrs --> n11_wt82_71
  n7_exercise --> n11_wt82_71
  n8_active --> n11_wt82_71
  n9_wt71 --> n11_wt82_71
  n10_qsmk --> n11_wt82_71
  n10_qsmk <--> n11_wt82_71
  classDef fixed stroke:#111827,stroke-width:2px
  linkStyle 0 stroke:#2563eb,stroke-width:1.8px
  linkStyle 1 stroke:#2563eb,stroke-width:1.8px
  linkStyle 2 stroke:#2563eb,stroke-width:1.8px
  linkStyle 3 stroke:#2563eb,stroke-width:1.8px
  linkStyle 4 stroke:#2563eb,stroke-width:1.8px
  linkStyle 5 stroke:#2563eb,stroke-width:1.8px
  linkStyle 6 stroke:#2563eb,stroke-width:1.8px
  linkStyle 7 stroke:#2563eb,stroke-width:1.8px
  linkStyle 8 stroke:#2563eb,stroke-width:1.8px
  linkStyle 9 stroke:#2563eb,stroke-width:1.8px
  linkStyle 10 stroke:#2563eb,stroke-width:1.8px
  linkStyle 11 stroke:#2563eb,stroke-width:1.8px
  linkStyle 12 stroke:#2563eb,stroke-width:1.8px
  linkStyle 13 stroke:#2563eb,stroke-width:1.8px
  linkStyle 14 stroke:#2563eb,stroke-width:1.8px
  linkStyle 15 stroke:#2563eb,stroke-width:1.8px
  linkStyle 16 stroke:#2563eb,stroke-width:1.8px
  linkStyle 17 stroke:#2563eb,stroke-width:1.8px
  linkStyle 18 stroke:#2563eb,stroke-width:1.8px
  linkStyle 19 stroke:#dc2626,stroke-width:1.8px

DataFrame(strategy = [string(sensitivity_id.strategy)])
1×1 DataFrame
Row strategy
String
1 not_identified

The identifying conclusion changes. Under this graph, the observed NHEFS variables are not enough to identify the effect. This is the practical value of drawing the graph before estimating: the same data can support different causal answers under different assumptions.

4.6 Structured Baseline Graph

The simple adjustment graph is often the clearest starting point. A richer DAG can record more assumptions about baseline temporal ordering and relationships: demographics may precede education, smoking history, activity, and baseline weight; those variables may then affect quitting and later weight change.

structured_edges = [
    (:age, :school), (:age, :smokeyrs), (:age, :wt71),
    (:age, :active), (:age, :exercise), (:age, :qsmk), (:age, :wt82_71),

    (:sex, :smokeintensity), (:sex, :smokeyrs), (:sex, :wt71),
    (:sex, :qsmk), (:sex, :wt82_71),

    (:race, :school), (:race, :smokeintensity),
    (:race, :qsmk), (:race, :wt82_71),

    (:school, :smokeintensity), (:school, :qsmk), (:school, :wt82_71),

    (:smokeyrs, :smokeintensity), (:smokeyrs, :qsmk), (:smokeyrs, :wt82_71),
    (:smokeintensity, :qsmk), (:smokeintensity, :wt82_71),

    (:exercise, :wt71), (:exercise, :qsmk), (:exercise, :wt82_71),
    (:active, :wt71), (:active, :qsmk), (:active, :wt82_71),
    (:wt71, :qsmk), (:wt71, :wt82_71),

    (:qsmk, :wt82_71),
]

structured_graph = make_graph(vertices = nhefs_vertices, di_edges = structured_edges);
structured_id = identify(structured_graph, :qsmk, :wt82_71)

DataFrame(
    strategy = [string(structured_id.strategy)],
    same_adjustment_set = [
        Set(markov_pillow(structured_graph, :qsmk; treatment = :qsmk)) == Set(baseline)
    ],
)
1×2 DataFrame
Row strategy same_adjustment_set
String Bool
1 a_fixable true

For this treatment effect, the richer baseline DAG is adjustment-equivalent to the simpler expanded DAG: it gives the same measured parents for qsmk and therefore the same identifying functional. The extra arrows among baseline variables may be useful for documentation, but they are additional causal assumptions. If those extra assumptions are not needed for the estimand, the grouped adjustment graph is usually easier to defend.