19  Causal Discovery: Observed Variables

In earlier chapters we assumed the causal graph was known, or at least that a defensible DAG could be drawn from domain knowledge. But what if we want to learn the causal structure from observational data? This is the causal discovery problem.

19.1 The Problem

Given \(n\) i.i.d. observations of \(p\) variables \((X_1, \ldots, X_p)\), we want to recover the directed acyclic graph (DAG) \(G\) that generated the data.

There is a fundamental limit: observational data alone cannot distinguish between DAGs that encode the same conditional independence (CI) structure. The set of DAGs that are observationally equivalent is called a Markov equivalence class (MEC). The best we can do from data is to identify the MEC, represented as a Completed Partially Directed Acyclic Graph (CPDAG).

A CPDAG has the same skeleton (undirected edges) as all DAGs in its MEC. Some edges can be oriented (they have the same direction in every member of the MEC); others remain undirected because both directions are consistent with the CI structure.

19.1.1 Why some edges cannot be oriented

Three DAGs are observationally equivalent — they produce the same set of conditional independences — and therefore indistinguishable from data:

\[X \to Y \to Z \qquad X \leftarrow Y \leftarrow Z \qquad X \leftarrow Y \to Z\]

All three imply \(X \perp\!\!\!\perp Z \mid Y\) and no other CI. Their CPDAG shows \(X - Y - Z\) with no arrowheads.

The structure \(X \to Y \leftarrow Z\) (a collider at \(Y\)) is not equivalent: here \(X \perp\!\!\!\perp Z\) but \(X \not\!\perp\!\!\!\perp Z \mid Y\). Because it has a unique CI signature, the \(\to Y \leftarrow\) orientation is identifiable from data. Algorithms orient these colliders; non-collider edges often remain undirected.

19.1.2 What a CPDAG tells you

A directed edge \(X \to Y\) in the CPDAG means: in every DAG consistent with the data, \(X\) causes \(Y\). An undirected edge \(X - Y\) means some consistent DAGs have \(X \to Y\) and others have \(X \leftarrow Y\) — data alone cannot decide.

For causal estimation: once you have a CPDAG, you can use background knowledge (temporal ordering, experimental context) to orient the remaining undirected edges, and then apply identification strategies from earlier chapters.

19.2 Two Discovery Paradigms in R

R’s canonical causal-discovery toolkit is pcalg. It implements two complementary algorithm families:

Paradigm Algorithm Decision rule
Constraint-based PC (Spirtes & Glymour, 1991) Remove an edge when a CI test reports independence
Score-based GES (Chickering, 2002) Add/remove/reverse the edge that most improves a model score (BIC)

Both target the same CPDAG, but they err in different ways. PC inherits the variance of finite-sample CI tests; GES inherits the bias of a parametric score. Comparing them on the same dataset is the standard way to gauge robustness.

19.3 Simulation

We generate a random Gaussian linear DAG using pcalg::randomDAG with 8 nodes and edge probability 0.35. Each edge \(j \to i\) generates a structural equation \[X_i = \sum_{j \in \text{pa}(i)} \beta_{ji} X_j + \varepsilon_i, \quad \varepsilon_i \sim N(0,1),\] with \(|\beta_{ji}|\) drawn from \([0.25, 1]\).

set.seed(2025)

n_vars    <- 8
n_samples <- 2000
edge_prob <- 0.35
labels    <- paste0("X", 1:n_vars)

true_dag  <- randomDAG(n_vars, prob = edge_prob, lB = 0.25, uB = 1)
graph::nodes(true_dag) <- labels          # rename from "1".."8" to "X1".."X8"
data_mat  <- rmvDAG(n_samples, true_dag, errDist = "normal")
colnames(data_mat) <- labels

# Adjacency matrix from the true DAG (entry [i,j] = 1 means i -> j)
adj_true  <- t(as(true_dag, "matrix") != 0) * 1

# igraph representation + shared layout for all comparison panels
ig_true <- graphnel_to_ig(true_dag)
dag_layout <- igraph::layout_with_sugiyama(ig_true)$layout
sprintf("Variables: %d   Samples: %d   True edges: %d",
        n_vars, n_samples, sum(adj_true))
[1] "Variables: 8   Samples: 2000   True edges: 11"
plot_ig(ig_true, layout = dag_layout)
Figure 19.1: True DAG used for simulation.

19.4 CI-Test and Score Infrastructure

PC uses a Fisher-Z conditional independence test appropriate for Gaussian data. Given the partial correlation \(\hat\rho_{XY \mid Z}\), the test statistic is \[T = \sqrt{n - |Z| - 3} \cdot \operatorname{atanh}(\hat\rho_{XY \mid Z}) \;\sim\; N(0,1) \quad \text{under } H_0: X \perp\!\!\!\perp Y \mid Z.\]

Choosing the significance level \(\alpha\) involves a bias–variance tradeoff on the skeleton:

  • Low \(\alpha\) (e.g., 0.001): conservative, keeps more edges. Fewer false removals but may retain spurious edges. Denser skeleton.
  • High \(\alpha\) (e.g., 0.05): liberal, removes more edges. Fewer spurious edges but may discard real ones. Sparser skeleton.

With \(n = 2000\) and \(p = 8\) we use \(\alpha = 0.01\), a common default.

GES does not use a significance level. It maximizes the Gaussian BIC score \[\text{BIC}(G) = -2 \,\ell(\hat\theta; G) + |G| \log n,\] which penalizes graph complexity directly. Implemented in pcalg as GaussL0penObsScore.

We wrap gaussCItest in a counter to track exactly how many CI tests PC calls:

make_counted_ci <- function(suff_stat) {
  count <- 0L
  ci <- function(x, y, S, suffStat) {
    count <<- count + 1L
    gaussCItest(x, y, S, suffStat)
  }
  list(ci = ci, count = function() count)
}

suff_stat <- list(C = cor(data_mat), n = n_samples)

19.5 PC Algorithm

The PC algorithm (Spirtes & Glymour, 1991) starts from a complete undirected graph and removes edges whenever a conditional independence is found. It searches for separating sets \(Z\) of increasing size, conditioning on subsets of the current neighbors. After skeleton discovery, Meek’s rules orient as many edges as possible.

How it works, step by step:

  1. Start with a complete undirected graph (every pair connected).
  2. For each pair \((X, Y)\), test \(X \perp\!\!\!\perp Y \mid Z\) for all \(Z \subseteq \text{neighbors}(X) \setminus \{Y\}\), starting with \(|Z| = 0\). Remove the edge if any such test passes.
  3. Repeat with conditioning sets of increasing size until no new edges are removed.
  4. Identify colliders: if \(X - Z - Y\) and \(X, Y\) non-adjacent, orient as \(X \to Z \leftarrow Y\) only if \(Z\) was not in the separating set of \(X\) and \(Y\).
  5. Apply Meek’s orientation rules to propagate orientations without creating new colliders or cycles.
sig_level <- 0.01

ci_pc  <- make_counted_ci(suff_stat)
pc_fit <- pc(suffStat = suff_stat,
             indepTest = ci_pc$ci,
             labels = labels,
             alpha = sig_level,
             verbose = FALSE)

cpdag_pc <- pc_fit@graph
ci_tests_pc <- ci_pc$count()

Reading the output: F1 scores

The F1 score is the harmonic mean of precision and recall, ranging from 0 (worst) to 1 (perfect).

  • Skeleton F1 measures whether the algorithm found the right edges regardless of direction. An edge is a true positive if it exists in both the estimated skeleton and the true one.
  • CPDAG F1 is stricter: a directed edge \(X \to Y\) is only a true positive if the true DAG also has \(X \to Y\); an undirected edge \(X - Y\) counts as correct if the true DAG has either direction.

A high skeleton F1 with lower CPDAG F1 means the algorithm found the right adjacencies but left some edges unoriented — the typical case when the MEC contains many equivalent DAGs.

# Helper: extract directed adjacency matrix from a graphNEL CPDAG.
# pcalg convention: amat[i,j] = 1 means an edge that ENDS at i,
# i.e. a directed edge j -> i (or one endpoint of an undirected edge).
amat_from_graph <- function(g) {
  m <- as(g, "matrix") != 0
  storage.mode(m) <- "integer"
  m
}

# Skeleton: symmetrize (edge exists either way).
amat_to_skel <- function(amat) (amat | t(amat)) * 1L

# F1 scores: skeleton F1 (ignore direction) and CPDAG F1
# (directed edge in estimate must match a directed edge in the truth;
#  undirected in estimate matches either direction in truth).
f1_scores <- function(amat_true, amat_est) {
  skel_t <- amat_to_skel(amat_true)
  skel_e <- amat_to_skel(amat_est)
  diag(skel_t) <- diag(skel_e) <- 0
  tp <- sum(skel_t & skel_e) / 2
  fp <- sum(!skel_t & skel_e) / 2
  fn <- sum(skel_t & !skel_e) / 2
  f1_skel <- if (tp == 0) 0 else 2 * tp / (2 * tp + fp + fn)

  # CPDAG F1: walk ordered pairs (i,j) with i < j
  tp <- fp <- fn <- 0
  p <- nrow(amat_true)
  for (i in seq_len(p - 1)) for (j in seq.int(i + 1, p)) {
    t_ij <- amat_true[j, i] == 1   # truth has i -> j?
    t_ji <- amat_true[i, j] == 1   # truth has j -> i?
    e_ij <- amat_est[j, i] == 1
    e_ji <- amat_est[i, j] == 1
    if (!t_ij && !t_ji && !e_ij && !e_ji) next
    truth_undirected <- t_ij && t_ji
    est_undirected   <- e_ij && e_ji
    has_t <- t_ij || t_ji
    has_e <- e_ij || e_ji
    if (has_t && has_e) {
      if (est_undirected || truth_undirected ||
          (t_ij == e_ij && t_ji == e_ji)) {
        tp <- tp + 1
      } else {
        fp <- fp + 1; fn <- fn + 1
      }
    } else if (has_e) {
      fp <- fp + 1
    } else {
      fn <- fn + 1
    }
  }
  f1_cpdag <- if (tp == 0) 0 else 2 * tp / (2 * tp + fp + fn)

  c(skeleton = f1_skel, cpdag = f1_cpdag)
}

amat_pc  <- amat_from_graph(cpdag_pc)
f1_pc    <- f1_scores(adj_true, amat_pc)
ig_cpdag_pc <- graphnel_to_ig(cpdag_pc)

sprintf("PC:    skeleton F1 = %.3f   CPDAG F1 = %.3f   CI tests = %d",
        f1_pc["skeleton"], f1_pc["cpdag"], ci_tests_pc)
[1] "PC:    skeleton F1 = 0.952   CPDAG F1 = 0.190   CI tests = 293"

Computational complexity: In the worst case PC tests all subsets of neighbors, giving exponential growth with the maximum degree. In sparse graphs this is manageable; in dense graphs it becomes prohibitive.

19.6 GES Algorithm

GES (Greedy Equivalence Search; Chickering, 2002) takes a fundamentally different approach. Instead of testing independences and removing edges, it searches the space of CPDAGs by greedily adding, then removing edges that improve a model score. Under faithfulness and a consistent score (BIC qualifies), GES is asymptotically consistent — it recovers the true CPDAG as \(n \to \infty\).

How it works, step by step:

  1. Forward phase. Start from the empty graph. At each step, add the single edge whose addition most increases the BIC score, restricted to additions that stay within a CPDAG. Stop when no addition improves the score.
  2. Backward phase. From the forward-phase output, remove the single edge whose removal most increases the score. Stop when no removal improves the score.

The whole search operates on CPDAGs (equivalence classes), not individual DAGs, so it explores a much smaller space than naïve DAG search.

score_ges <- new("GaussL0penObsScore", data_mat)
ges_fit   <- ges(score_ges, verbose = FALSE)
cpdag_ges <- as(ges_fit$essgraph, "graphNEL")

amat_ges  <- amat_from_graph(cpdag_ges)
f1_ges    <- f1_scores(adj_true, amat_ges)
ig_cpdag_ges <- graphnel_to_ig(cpdag_ges)

sprintf("GES:   skeleton F1 = %.3f   CPDAG F1 = %.3f   (score-based, no CI tests)",
        f1_ges["skeleton"], f1_ges["cpdag"])
[1] "GES:   skeleton F1 = 0.800   CPDAG F1 = 0.240   (score-based, no CI tests)"

Why GES vs. PC matters in practice: PC errors propagate — a single false-positive CI test can cascade through the orientation phase. GES errors are local — a wrong edge addition or removal costs at most one edge. With moderate \(n\) and well-behaved noise, GES is often more stable; with very large \(n\) and possible non-Gaussian errors, PC’s nonparametric flavor of independence can be more robust.

19.7 Comparison

Reading the CPDAG figure

In the side-by-side graph display:

  • A directed edge \(X_i \to X_j\) means this orientation is invariant across all equivalent DAGs.
  • An undirected edge \(X_i - X_j\) in the CPDAG display represents an unoriented edge: both \(X_i \to X_j\) and \(X_i \leftarrow X_j\) are statistically equivalent. This is not a hidden common cause — it means the data cannot determine direction.
  • A missing edge means the algorithm found a conditional independence separating those two variables.

Compare the estimated CPDAGs to the true DAG: extra edges are false positives; missing edges are false negatives.

kable(data.frame(
  Algorithm   = c("PC", "GES"),
  Paradigm    = c("Constraint-based", "Score-based"),
  Skeleton_F1 = round(c(f1_pc["skeleton"], f1_ges["skeleton"]), 3),
  CPDAG_F1    = round(c(f1_pc["cpdag"],    f1_ges["cpdag"]),    3),
  CI_Tests    = c(ci_tests_pc, NA)
), row.names = FALSE)
Table 19.1: Algorithm comparison on the simulated 8-node DAG (n = 2000, alpha = 0.01)
Algorithm Paradigm Skeleton_F1 CPDAG_F1 CI_Tests
PC Constraint-based 0.952 0.19 293
GES Score-based 0.800 0.24 NA
op <- par(mfrow = c(1, 3), mar = c(1, 1, 3, 1))
plot_ig(ig_true,     layout = dag_layout, main = "True DAG")
plot_ig(ig_cpdag_pc, layout = dag_layout, main = "PC — estimated CPDAG")
plot_ig(ig_cpdag_ges,layout = dag_layout, main = "GES — estimated CPDAG")
par(op)
Figure 19.2: True DAG vs. estimated CPDAGs. Arrows denote oriented edges; lines without arrowheads are unoriented (both directions are CI-equivalent).

Interpreting the comparison table

  • Skeleton F1 is the primary metric for comparing algorithms: it measures edge recovery independent of orientation, since orientation accuracy is bounded by the MEC.
  • CPDAG F1 rewards correctly oriented edges and penalizes leaving orientable edges undirected.
  • CI tests is reported only for PC — GES is score-based and does not perform CI tests.

On this dataset PC and GES typically achieve similar skeleton F1. GES often pulls ahead on CPDAG F1 because the BIC score gives it a directional signal that PC’s CI tests do not — score-based methods can break some ties that constraint-based methods leave as undirected edges.

19.8 Monte Carlo Evaluation

A single dataset can be lucky. We average over 100 replications to get stable estimates.

What to look for in the MC results:

  • The mean skeleton F1 is the headline: how often does each algorithm recover the true adjacencies?
  • The distribution of CI test counts (shown in Figure 19.3) tracks PC’s computational variability; GES has no analogous quantity, so we plot PC alone.
  • High variance in F1 across replications suggests the algorithms are sensitive to the particular random graph.
evaluate_once <- function(seed, n_vars = 8, n_samples = 2000,
                          edge_prob = 0.35, sig = 0.01) {
  set.seed(seed)
  dag      <- randomDAG(n_vars, prob = edge_prob, lB = 0.25, uB = 1)
  labs     <- paste0("X", 1:n_vars)
  graph::nodes(dag) <- labs
  d        <- rmvDAG(n_samples, dag, errDist = "normal")
  colnames(d) <- labs
  adj      <- t(as(dag, "matrix") != 0) * 1L
  ss       <- list(C = cor(d), n = n_samples)

  ci       <- make_counted_ci(ss)
  pc_f     <- pc(ss, ci$ci, labels = labs, alpha = sig, verbose = FALSE)
  amat_p   <- amat_from_graph(pc_f@graph)

  ges_f    <- ges(new("GaussL0penObsScore", d), verbose = FALSE)
  amat_g   <- amat_from_graph(as(ges_f$essgraph, "graphNEL"))

  f_p <- f1_scores(adj, amat_p); f_g <- f1_scores(adj, amat_g)
  c(f1_skel_pc  = f_p["skeleton"], f1_skel_ges = f_g["skeleton"],
    f1_cpdag_pc = f_p["cpdag"],    f1_cpdag_ges = f_g["cpdag"],
    ci_pc       = ci$count())
}

mc <- as.data.frame(t(sapply(1:100, evaluate_once)))
names(mc) <- c("f1_skel_pc","f1_skel_ges","f1_cpdag_pc","f1_cpdag_ges","ci_pc")

mc_summary <- data.frame(
  Method      = c("PC", "GES"),
  Skeleton_F1 = round(c(mean(mc$f1_skel_pc),  mean(mc$f1_skel_ges)),  3),
  CPDAG_F1    = round(c(mean(mc$f1_cpdag_pc), mean(mc$f1_cpdag_ges)), 3),
  CI_Tests    = c(round(mean(mc$ci_pc), 1), NA)
)
kable(mc_summary,
      caption = "Monte Carlo averages (100 replications, n = 2000, p = 8)",
      row.names = FALSE)
Monte Carlo averages (100 replications, n = 2000, p = 8)
Method Skeleton_F1 CPDAG_F1 CI_Tests
PC 0.947 0.342 230.6
GES 0.933 0.359 NA
ggplot(mc, aes(x = ci_pc)) +
  geom_histogram(bins = 20, fill = "steelblue", alpha = 0.75, color = "white") +
  labs(x = "CI tests (PC)", y = "Frequency",
       title = "PC CI-test counts across 100 replications") +
  theme_minimal()
Figure 19.3: Distribution of PC’s CI-test counts across 100 replications. The spread reflects how PC’s cost depends on the realized graph density and the order of CI decisions.

19.9 Summary

PC and GES recover the same target — the Markov equivalence class of the data-generating DAG — but they reach it from opposite directions. PC reasons about conditional independences; GES reasons about model fit. Reporting both is the standard sanity check in R-based applied work: if they agree on the skeleton, the recovered structure is robust to the choice of paradigm; if they disagree, that is a signal to dig in (sample size, faithfulness violations, non-Gaussian errors).

From CPDAG to causal estimation: Once a CPDAG is in hand, the workflow continues as follows:

  1. Use background knowledge to orient undirected edges. Temporal ordering is the most common source: if \(X\) is measured before \(Y\), the edge must be \(X \to Y\). Domain expertise, experimental context, or exclusion restrictions can orient others.
  2. Check identifiability. With a fully oriented DAG, apply the backdoor, front-door, or ID algorithm to determine whether the effect of interest is identifiable.
  3. Estimate. Use the identified functional with the estimators from earlier chapters (RA, IPW, AIPW, TMLE).

If some edges remain undirected after exhausting background knowledge, you can either report the CPDAG and bound the estimand over the equivalence class, or collect additional interventional data to break the remaining equivalences.

What cannot be recovered from observational data alone

Both algorithms identify the CPDAG — the equivalence class — not the true DAG. Certain edge orientations are fundamentally unidentifiable without additional assumptions such as non-Gaussian errors, temporal ordering, or interventional data.

The next chapter addresses a harder problem: recovering the causal skeleton when some variables are unobserved.