using CausalInference
using RecursiveCausalDiscovery
using CausalGraphs
using Graphs
using CairoMakie
using Statistics
using LinearAlgebra
using Random
using Printf
using DataFrames
using Distributions: Normal, quantile22 Causal Discovery: Observed Variables
In earlier chapters we assumed the causal graph was known, or at least that a defensible DAG could be drawn from domain knowledge. But what if we want to learn the causal structure from observational data? This is the causal discovery problem.
22.1 The Problem
Given \(n\) i.i.d. observations of \(p\) variables \((X_1, \ldots, X_p)\), we want to recover the directed acyclic graph (DAG) \(G\) that generated the data.
There is a fundamental limit: observational data alone cannot distinguish between DAGs that encode the same conditional independence (CI) structure. The set of DAGs that are observationally equivalent is called a Markov equivalence class (MEC). The best we can do from data is to identify the MEC, represented as a Completed Partially Directed Acyclic Graph (CPDAG).
A CPDAG has the same skeleton (undirected edges) as all DAGs in its MEC. Some edges can be oriented (they have the same direction in every member of the MEC); others remain undirected because both directions are consistent with the CI structure.
22.1.1 Why some edges cannot be oriented
Three DAGs are observationally equivalent — they produce the same set of conditional independences — and therefore indistinguishable from data:
\[X \to Y \to Z \qquad X \leftarrow Y \leftarrow Z \qquad X \leftarrow Y \to Z\]
All three imply \(X \perp\!\!\!\perp Z \mid Y\) and no other CI. Their CPDAG shows \(X - Y - Z\) with no arrowheads.
The structure \(X \to Y \leftarrow Z\) (a collider at \(Y\)) is not equivalent: here \(X \perp\!\!\!\perp Z\) but \(X \not\!\perp\!\!\!\perp Z \mid Y\). Because it has a unique CI signature, the \(\to Y \leftarrow\) orientation is identifiable from data. Algorithms orient these colliders; non-collider edges often remain undirected.
22.1.2 What a CPDAG tells you
A directed edge \(X \to Y\) in the CPDAG means: in every DAG consistent with the data, \(X\) causes \(Y\). An undirected edge \(X - Y\) means some consistent DAGs have \(X \to Y\) and others have \(X \leftarrow Y\) — data alone cannot decide.
For causal estimation: once you have a CPDAG, you can use background knowledge (temporal ordering, experimental context) to orient the remaining undirected edges, and then apply identification strategies from earlier chapters.
22.2 Simulation
We generate a random Gaussian linear DAG using an Erdős–Rényi random graph with 8 nodes and edge probability 0.35. Each edge \(j \to i\) generates a structural equation
\[X_i = \sum_{j \in \text{pa}(i)} \beta_{ji} X_j + \varepsilon_i, \quad \varepsilon_i \sim N(0,1)\]
where coefficients \(\beta_{ji}\) are drawn uniformly from \([-1, -0.25] \cup [0.25, 1]\).
Random.seed!(2025)
n_vars = 8
n_samples = 2000
edge_prob = 0.35
labels = ["X$i" for i in 1:n_vars]
adj_mat = gen_er_dag_adj_mat(n_vars, edge_prob)
data_mat = gen_gaussian_data(adj_mat, n_samples)
true_dag = SimpleDiGraph(adj_mat)
true_skel = SimpleGraph(true_dag)
@printf "Variables: %d Samples: %d True edges: %d\n" n_vars n_samples ne(true_skel)Variables: 8 Samples: 2000 True edges: 9
draw_graph(adjmat_to_admg(adj_mat, labels); direction="TB")22.3 CI Test Infrastructure
Both algorithms use a Fisher-Z conditional independence test appropriate for Gaussian data. Given the partial correlation \(\hat\rho_{XY \mid Z}\), the test statistic is
\[T = \sqrt{n - |Z| - 3} \cdot \operatorname{atanh}(\hat\rho_{XY \mid Z}) \;\sim\; N(0,1) \quad \text{under } H_0: X \perp\!\!\!\perp Y \mid Z\]
Choosing the significance level \(\alpha\) involves a bias-variance tradeoff on the skeleton:
- Low \(\alpha\) (e.g., 0.001): conservative, keeps more edges. Fewer false removals (false negatives) but may retain spurious edges (false positives). Produces a denser skeleton.
- High \(\alpha\) (e.g., 0.05): liberal, removes more edges. Fewer spurious edges but may discard real ones. Produces a sparser skeleton.
With \(n = 2000\) and \(p = 8\) we use \(\alpha = 0.01\), a common default. With smaller \(n\) or larger \(p\), consider lowering \(\alpha\) to avoid over-pruning.
We wrap fisher_z in a counter to track exactly how many CI tests each algorithm calls:
function make_counted_ci(data, sig)
count = Ref(0)
function ci(x, y, S, d)
count[] += 1
fisher_z(x, y, collect(Int, S), Matrix{Float64}(d), sig)
end
ci, count
endmake_counted_ci (generic function with 1 method)
22.4 PC Algorithm
The PC algorithm (Spirtes & Glymour, 1991) starts from a complete undirected graph and removes edges whenever a conditional independence is found. It searches for separating sets \(Z\) of increasing size, conditioning on subsets of the current neighbors. After skeleton discovery, Meek’s rules orient as many edges as possible.
How it works, step by step:
- Start with a complete undirected graph (every pair connected).
- For each pair \((X, Y)\), test \(X \perp\!\!\!\perp Y \mid Z\) for all \(Z \subseteq \text{neighbors}(X) \setminus \{Y\}\), starting with \(|Z|=0\). Remove the edge if any such test passes.
- Repeat with conditioning sets of increasing size until no new edges are removed.
- Identify colliders: if \(X - Z - Y\) and \(X, Y\) non-adjacent, orient as \(X \to Z \leftarrow Y\) only if \(Z\) was not in the separating set of \(X\) and \(Y\).
- Apply Meek’s orientation rules to propagate orientations without creating new colliders or cycles.
sig_level = 0.01
ci_pc, ctr_pc = make_counted_ci(data_mat, sig_level)
cpdag_pc = pcalg(n_vars, ci_pc, data_mat)
skel_pc = SimpleGraph(cpdag_pc)
f1_skel_pc = f1_score(true_skel, skel_pc)
f1_cpdag_pc = f1_score(true_dag, cpdag_pc)
@printf "PC: skeleton F1 = %.3f CPDAG F1 = %.3f CI tests = %d\n" f1_skel_pc f1_cpdag_pc ctr_pc[]PC: skeleton F1 = 0.750 CPDAG F1 = 0.500 CI tests = 205
Reading the output: F1 scores
The F1 score is the harmonic mean of precision and recall, ranging from 0 (worst) to 1 (perfect).
- Skeleton F1 measures whether the algorithm found the right edges regardless of direction. An edge is a true positive if it exists in both the estimated skeleton and the true one.
- CPDAG F1 is stricter: a directed edge \(X \to Y\) is only a true positive if the true DAG also has \(X \to Y\); an undirected edge \(X - Y\) counts as correct if the true DAG has either direction.
A high skeleton F1 with lower CPDAG F1 means the algorithm found the right adjacencies but left some edges unoriented — the typical case when the MEC contains many equivalent DAGs.
Computational complexity: In the worst case, PC tests all subsets of neighbors, giving exponential growth with the maximum degree. In sparse graphs this is manageable; in dense graphs it becomes prohibitive.
22.5 RSL-D Algorithm
RSL-D (Recursive Structure Learning — Diamond-free; Mokhtarian et al., JMLR 2025) works by recursively removing variables that satisfy a removability criterion.
Two sets are central to the criterion:
- Markov boundary \(\text{Mb}(X)\): the minimal set \(S\) such that \(X \perp\!\!\!\perp (\text{all other variables}) \mid S\). Conditioning on \(\text{Mb}(X)\) screens \(X\) off from everything else in the graph. In a Gaussian DAG with no hidden variables, \(\text{Mb}(X)\) consists of \(X\)’s parents, children, and co-parents (other parents of \(X\)’s children) — equivalently, all variables directly adjacent to \(X\) in the skeleton.
- Neighbourhood \(\text{Ne}(X)\): the set of variables directly adjacent to \(X\) in the current skeleton (connected by an undirected edge at this stage of the algorithm).
Under causal sufficiency and faithfulness, \(\text{Mb}(X) = \text{Ne}(X)\); the distinction arises during the algorithm’s recursive steps as variables are removed one by one.
Variable \(X\) is removable if, for every pair \(Y \in \text{Mb}(X)\), \(Z \in \text{Ne}(X)\):
\[\exists\, W \subseteq \text{Mb}(X) \setminus \{Y, Z\} \text{ such that } Y \perp\!\!\!\perp Z \mid W\]
(Lemma 3 of the paper). Removing variables one by one and recording the local neighbourhood structure recovers the full CPDAG.
The intuition behind removability: a variable \(X\) is “removable” if its local neighbourhood structure can be fully characterized without creating ambiguity in the remaining graph. Think of it as peeling off leaf-like variables one at a time — each removal simplifies the problem rather than complicating it.
How it works, step by step:
- Estimate the Markov boundary \(\text{Mb}(X)\) of each variable (all variables that, conditioned on, make \(X\) independent of everything else).
- Find a removable variable \(X\) — one whose neighbourhood satisfies the criterion above.
- Record the local skeleton structure around \(X\), then remove \(X\) from the graph.
- Repeat on the reduced variable set until all variables have been processed.
- Reconstruct the full CPDAG from the recorded local structures.
The key efficiency advantage: RSL-D uses at most \(O(p \cdot d \cdot |\text{MB}|^2)\) CI tests (where \(d\) is the max neighbourhood size). PC can require exponentially many in the worst case as the separator-set search grows.
ci_rsl, ctr_rsl = make_counted_ci(data_mat, sig_level)
cpdag_rsl = rsld(data_mat, ci_rsl)
skel_rsl = SimpleGraph(cpdag_rsl)
f1_skel_rsl = f1_score(true_skel, skel_rsl)
f1_cpdag_rsl = f1_score(true_dag, cpdag_rsl)
@printf "RSL-D: skeleton F1 = %.3f CPDAG F1 = %.3f CI tests = %d\n" f1_skel_rsl f1_cpdag_rsl ctr_rsl[]RSL-D: skeleton F1 = 1.000 CPDAG F1 = 0.947 CI tests = 43
22.6 Comparison
Reading the CPDAG figure
In the side-by-side graph display:
- A directed edge \(X_i \to X_j\) means this orientation is invariant across all equivalent DAGs — the algorithm is confident about the direction.
- A bidirected (red) edge \(X_i \leftrightarrow X_j\) in the CPDAG display represents an unoriented edge: both \(X_i \to X_j\) and \(X_i \leftarrow X_j\) are statistically equivalent. This is not a hidden common cause — it means the data cannot determine direction. (Hidden common causes are a latent-variable concept, covered in the next chapter.)
- A missing edge means the algorithm found a conditional independence separating those two variables.
Compare the estimated CPDAG to the true DAG: extra edges are false positives (the algorithm wrongly retained a spurious association); missing edges are false negatives (a true relationship was incorrectly removed by a CI test).
DataFrame(
Algorithm = ["PC", "RSL-D"],
Skeleton_F1 = round.([f1_skel_pc, f1_skel_rsl], digits=3),
CPDAG_F1 = round.([f1_cpdag_pc, f1_cpdag_rsl], digits=3),
CI_Tests = [ctr_pc[], ctr_rsl[]],
)| Row | Algorithm | Skeleton_F1 | CPDAG_F1 | CI_Tests |
|---|---|---|---|---|
| String | Float64 | Float64 | Int64 | |
| 1 | PC | 0.75 | 0.5 | 205 |
| 2 | RSL-D | 1.0 | 0.947 | 43 |
side_by_side([
(adjmat_to_admg(adj_mat, labels), "True DAG"),
(cpdag_to_admg(cpdag_pc, labels), "PC — estimated CPDAG"),
(cpdag_to_admg(cpdag_rsl, labels), "RSL-D — estimated CPDAG"),
]; note="Red ↔ edges in the CPDAGs are unoriented (not confounded) — both directions are statistically equivalent.")True DAG
PC — estimated CPDAG
RSL-D — estimated CPDAG
Red ↔ edges in the CPDAGs are unoriented (not confounded) — both directions are statistically equivalent.
Interpreting the comparison table
- Skeleton F1 is the primary metric for comparing algorithms: it measures edge recovery independent of orientation, since orientation accuracy is bounded by the MEC.
- CPDAG F1 rewards correctly oriented edges and penalizes leaving orientable edges undirected.
- CI tests is the computational cost metric. Fewer tests means faster runtime, especially important when \(n\) is large and each test involves inverting covariance matrices.
On this dataset RSL-D typically matches or exceeds PC’s accuracy while using substantially fewer CI tests — the advantage grows with \(p\) and graph density.
22.7 Monte Carlo Evaluation
A single dataset can be lucky. We average over 100 replications to get stable estimates.
What to look for in the MC results:
- The mean skeleton F1 is the headline: how often does each algorithm recover the true adjacencies?
- The distribution of CI test counts (shown in Figure 22.3) shows whether the efficiency advantage is consistent or only appears on average.
- High variance in F1 across replications suggests the algorithms are sensitive to the particular random graph — denser or sparser graphs drawn from the same Erdős–Rényi model can be very different problems.
function evaluate_once(seed; n_vars=8, n_samples=2000, edge_prob=0.35, sig=0.01)
Random.seed!(seed)
adj = gen_er_dag_adj_mat(n_vars, edge_prob)
data = gen_gaussian_data(adj, n_samples)
skel = SimpleGraph(SimpleDiGraph(adj))
ci_pc, ctr_pc = make_counted_ci(data, sig)
ci_rsl, ctr_rsl = make_counted_ci(data, sig)
cp_pc = pcalg(n_vars, ci_pc, data)
cp_rsl = rsld(data, ci_rsl)
(
f1_pc = f1_score(skel, SimpleGraph(cp_pc)),
f1_rsl = f1_score(skel, SimpleGraph(cp_rsl)),
ci_pc = ctr_pc[],
ci_rsl = ctr_rsl[],
)
end
mc = [evaluate_once(s) for s in 1:100]
@printf "\nMonte Carlo averages (100 replications, n=%d, p=%d)\n" n_samples n_vars
@printf "%-8s Skeleton F1 CI tests\n" "Method"
@printf "%-8s %10.3f %8.1f\n" "PC" mean(r.f1_pc for r in mc) mean(r.ci_pc for r in mc)
@printf "%-8s %10.3f %8.1f\n" "RSL-D" mean(r.f1_rsl for r in mc) mean(r.ci_rsl for r in mc)
Monte Carlo averages (100 replications, n=2000, p=8)
Method Skeleton F1 CI tests
PC 0.894 237.2
RSL-D 0.946 57.2
let
fig = Figure(size=(650, 380))
ax = Axis(fig[1,1];
xlabel = "CI tests",
ylabel = "Frequency",
title = "CI Test Counts: PC vs RSL-D (100 replications)")
hist!(ax, [r.ci_pc for r in mc]; bins=20, color=(:steelblue, 0.65), label="PC")
hist!(ax, [r.ci_rsl for r in mc]; bins=20, color=(:tomato, 0.65), label="RSL-D")
axislegend(ax)
fig
end┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions. └ @ Makie ~/.julia/packages/Makie/UjJJY/src/scenes.jl:238
22.8 Summary
Both PC and RSL-D recover the Markov equivalence class from Gaussian observational data. The key practical difference is CI-test efficiency: RSL-D’s recursive removal approach uses far fewer tests than PC’s growing-set search, especially as \(p\) and graph density increase.
From CPDAG to causal estimation: Once a CPDAG is in hand, the workflow continues as follows:
- Use background knowledge to orient undirected edges. Temporal ordering is the most common source: if \(X\) is measured before \(Y\), the edge must be \(X \to Y\). Domain expertise, experimental context, or exclusion restrictions can orient others.
- Check identifiability. With a fully oriented DAG, apply the backdoor, front-door, or ID algorithm (Chapter 3) to determine whether the effect of interest is identifiable.
- Estimate. Use the identified functional with the estimators from Chapter 4 (RA, IPW, AIPW).
If some edges remain undirected after exhausting background knowledge, you can either report the CPDAG and bound the estimand over the equivalence class, or collect additional interventional data to break the remaining equivalences.
What cannot be recovered from observational data alone
Both algorithms identify the CPDAG — the equivalence class — not the true DAG. Certain edge orientations are fundamentally unidentifiable without additional assumptions such as non-Gaussian errors, temporal ordering, or interventional data.
The next chapter addresses a harder problem: recovering the causal skeleton when some variables are unobserved.