23  Causal Discovery: Latent Variables

using CausalInference
using RecursiveCausalDiscovery
using CausalGraphs
using Graphs
using CairoMakie
using Statistics
using LinearAlgebra
using Random
using Printf
using DataFrames
using Combinatorics: powerset
using Distributions: Normal, quantile

In the previous chapter we assumed all causally relevant variables were observed. This assumption — called causal sufficiency — rarely holds in economics: unobserved ability confounds wage regressions, unobserved demand shocks confound price–quantity relationships, unobserved peer effects confound social network studies.

When latent confounders exist, constraint-based discovery algorithms like PC and RSL-D return incorrect results. This chapter introduces two algorithms that handle the latent variable case.

23.0.1 Why PC fails with latent variables

Suppose the true graph is \(X \leftarrow U \rightarrow Y\) where \(U\) is unobserved. In the observed data, \(X\) and \(Y\) are marginally correlated (through \(U\)) and no observed conditioning set separates them. PC will incorrectly draw an edge \(X - Y\) and may orient it as \(X \to Y\) or \(X \leftarrow Y\), neither of which exists in the truth.

More subtly, conditioning on a collider that is a descendant of \(U\) can open a path between \(X\) and \(Y\), further distorting the CI structure that PC observes. With even a single hidden common cause, the d-separation statements in the observed data no longer correspond to those of any DAG over just the observed variables — they require the richer MAG representation.

23.1 Maximal Ancestral Graphs and PAGs

With latent variables, the appropriate representation is a Maximal Ancestral Graph (MAG). Over a set of observed variables \(\mathbf{O}\), a MAG encodes:

  • \(X \to Y\): \(X\) is an ancestor of \(Y\) in the full DAG
  • \(X \leftrightarrow Y\): \(X\) and \(Y\) have a common hidden ancestor (hidden confounder)
  • No edge: \(X \perp\!\!\!\perp Y \mid Z\) for some \(Z \subseteq \mathbf{O} \setminus \{X, Y\}\)

The MAG over observed variables is the “shadow” of the full DAG: it summarizes all causal and confounding relationships that are visible in the observed data, without requiring us to know or name the latent variables.

Just as DAGs have Markov equivalence classes represented by CPDAGs, MAGs have equivalence classes represented by Partial Ancestral Graphs (PAGs). In a PAG, the mark \(\circ\) on an edge endpoint means “could be arrowhead or tail in some member of the equivalence class.”

PAG mark Meaning Economic interpretation
\(X \to Y\) \(X\) causes \(Y\) in every equivalent MAG Robust causal direction
\(X \leftarrow\!\!\!\!\circ\; Y\) Some MAGs have \(X \to Y\), others \(X \leftarrow Y\) Direction uncertain
\(X \leftrightarrow Y\) Hidden common cause in every equivalent MAG Definite unmeasured confounder
\(X \;\circ\!\!\!\!-\!\!\!\!\circ\; Y\) Could be \(\to\), \(\leftarrow\), or \(\leftrightarrow\) Maximally uncertain

Reading a PAG for policy purposes:

  • A definite \(X \to Y\) edge is the most useful finding: it means any intervention on \(X\) propagates to \(Y\) regardless of which specific MAG the data came from.
  • A definite \(X \leftrightarrow Y\) edge is a warning: before using regression of \(Y\) on \(X\) to estimate a causal effect, you must address the hidden confounder — through an instrument, a proxy, or a natural experiment.
  • \(\circ\) marks indicate what you don’t know. Additional data, temporal ordering, or experimental variation can resolve them.

23.2 Simulation with Hidden Variables

We reuse the same 8-node Gaussian linear DAG from the previous chapter and hide 2 nodes, treating them as unobserved confounders.

Random.seed!(2025)

n_vars    = 8
n_samples = 2000
edge_prob = 0.35
latent    = [3, 6]
obs_idx   = setdiff(1:n_vars, latent)
all_labels = ["X$i" for i in 1:n_vars]
obs_labels = ["X$(obs_idx[i])" for i in 1:length(obs_idx)]

adj_mat   = gen_er_dag_adj_mat(n_vars, edge_prob)
data_full = gen_gaussian_data(adj_mat, n_samples)
data_obs  = data_full[:, obs_idx]
n_obs     = length(obs_idx)

@printf "Total variables: %d   Hidden: %s   Observed: %d\n" n_vars latent n_obs
Total variables: 8   Hidden: [3, 6]   Observed: 6

23.2.1 True structure over observed variables

Two observed variables are adjacent in the MAG iff no subset of the observed variables d-separates them in the full DAG.

true_dag = SimpleDiGraph(adj_mat)

function true_mag_skeleton(dag, observed)
    p = length(observed)
    skel = SimpleGraph(p)
    for i in 1:(p-1), j in (i+1):p
        u, v = observed[i], observed[j]
        other_obs = setdiff(observed, [u, v])
        separated = any(dsep(dag, u, v, S) for S in powerset(other_obs))
        !separated && add_edge!(skel, i, j)
    end
    return skel
end

true_obs_skel = true_mag_skeleton(true_dag, obs_idx)
@printf "True MAG skeleton edges (over %d observed nodes): %d\n" n_obs ne(true_obs_skel)
True MAG skeleton edges (over 6 observed nodes): 5
side_by_side([
    (admg_with_latent(adj_mat, all_labels, latent), "Full DAG (square = hidden)"),
    (skeleton_to_admg(true_obs_skel, obs_labels),   "True MAG skeleton (observed)"),
]; note="Square nodes are unobserved (latent). Plaintext nodes are observed.")

Full DAG (square = hidden)

%3 X1 X1 X2 X2 X2->X1 X3 X3 X2->X3 X5 X5 X2->X5 X6 X6 X3->X6 X4 X4 X7 X7 X5->X7 X7->X1 X8 X8 X8->X3 X8->X6 X8->X7

True MAG skeleton (observed)

%3 X1 X1 X2 X2 X1->X2 X7 X7 X1->X7 X5 X5 X2->X5 X4 X4 X5->X7 X8 X8 X7->X8

Square nodes are unobserved (latent). Plaintext nodes are observed.

Figure 23.1: Full DAG (square nodes are hidden) and true MAG skeleton over observed variables. Skeleton edges include both direct paths and paths through hidden variables.

23.3 FCI Algorithm

The FCI algorithm (Fast Causal Inference; Spirtes, Meek & Richardson, 1995) is the standard extension of PC to the latent variable case. Like PC, it starts from a complete graph and removes edges via CI tests. After skeleton discovery, it applies a richer set of orientation rules that can produce bidirected edges \(X \leftrightarrow Y\) indicating hidden common causes.

How FCI extends PC:

  1. Phase 1 (skeleton): Same as PC — remove edges via CI tests with growing conditioning sets.
  2. Phase 2 (initial orientation): Mark all edge endpoints as \(\circ\) (uncertain).
  3. Phase 3 (collider detection): For each unshielded triple \(X \circ\!-\!\circ Z \circ\!-\!\circ Y\) where \(X, Y\) non-adjacent: orient as \(X \circ\!\to Z \leftarrow\!\circ Y\) if \(Z\) is not in the separating set of \(X\) and \(Y\).
  4. Phase 4 (rule propagation): Apply 10 orientation rules that propagate known marks without creating contradictions, including rules that can produce definite \(\to\) and \(\leftrightarrow\) edges.

The richer mark set (\(\to\), \(\leftarrow\), \(\circ\), \(\leftrightarrow\)) allows FCI to express what is known vs. unknown — at the cost of more complex output to interpret.

sig_level = 0.01

count_fci = Ref(0)
ci_fci = (x, y, S, d) -> begin
    count_fci[] += 1
    fisher_z(x, y, collect(Int, S), Matrix{Float64}(d), sig_level)
end

pag_fci  = fcialg(n_obs, ci_fci, data_obs)
skel_fci = SimpleGraph(pag_fci)

f1_fci = f1_score(true_obs_skel, skel_fci)
@printf "FCI:     skeleton F1 = %.3f   CI tests = %d\n" f1_fci count_fci[]
FCI:     skeleton F1 = 0.889   CI tests = 82
println(plot_fci_graph_text(pag_fci, ["X$(obs_idx[i])" for i in 1:n_obs]))
X1<-X2   X1<-X7   X2o-oX5  X5o-oX7                                      nothing
Figure 23.2
Note

Reading the FCI text output

The text representation uses ASCII edge marks:

  • X1 --> X2: definite direct cause (\(X_1 \to X_2\) in every equivalent MAG)
  • X1 <-> X2: definite hidden common cause (\(X_1 \leftrightarrow X_2\))
  • X1 o-> X2: the \(X_1\) endpoint is uncertain; \(X_2\) endpoint is definitely an arrowhead
  • X1 o-o X2: both endpoints uncertain; could be any of \(\to\), \(\leftarrow\), \(\leftrightarrow\)

In practice for economic research: focus first on <-> edges — these flag definite confounding and tell you where IV or proxy strategies are needed. Then examine --> edges — these are causal claims that survive across all statistically equivalent structures.

When FCI gives wrong answers: FCI assumes the CI tests are perfectly accurate (no finite-sample error). In practice, with small \(n\) or many variables, some false CI decisions propagate through the orientation rules. The skeleton quality (F1) is generally more reliable than the orientation quality.

23.4 L-MARVEL Algorithm

L-MARVEL (Latent Markov Boundary Variable Elimination; Mokhtarian et al., JMLR 2025) extends the recursive MARVEL approach to handle latent confounders. Instead of exhaustive CI-test-based Markov boundary discovery, L-MARVEL uses a Gaussian precision matrix estimator:

\[\widehat{\text{Mb}}(X) = \left\{Y : \frac{|\hat P_{XY}|}{\sqrt{\hat P_{XX} \hat P_{YY}}} > \tau\right\}\]

where \(\hat P = \hat\Sigma^{-1}\) is the precision matrix and \(\tau\) is a threshold from the Fisher-Z distribution.

Why the precision matrix? In a Gaussian linear model, the \((i,j)\) entry of the precision matrix \(P = \Sigma^{-1}\) is zero if and only if \(X_i \perp\!\!\!\perp X_j \mid \text{all other variables}\). The partial correlation \(P_{ij}/\sqrt{P_{ii}P_{jj}}\) is therefore a one-shot test for Markov boundary membership, replacing the many conditioning-set CI tests that PC or FCI require to find the same information. This is the key computational saving.

The removability criterion changes for the latent case (Theorem 32). Recall: \(\text{Mb}(X)\) is the Markov boundary of \(X\) — the minimal conditioning set that screens \(X\) off from all other variables (estimated here via the precision matrix above rather than by exhaustive CI tests) — and \(\text{Ne}(X)\) is the set of variables directly adjacent to \(X\) in the current skeleton. Variable \(X\) is removable iff for every \(Y \in \text{Mb}(X)\) and \(Z \in \text{Ne}(X)\):

\[\underbrace{\exists\, W \subseteq \text{Mb}(X) \setminus \{Y,Z\}: Y \perp\!\!\!\perp Z \mid W}_{\text{Condition 1}} \quad\text{or}\quad \underbrace{\forall\, W \subseteq \text{Mb}(X) \setminus \{Y,Z\}: Y \not\!\perp\!\!\!\perp Z \mid W \cup \{X\}}_{\text{Condition 2}}\]

Intuition for the two conditions:

  • Condition 1 is the same as the observed-variable (RSL-D) removability criterion: \(Y\) and \(Z\) can be separated by some subset of the Markov boundary.
  • Condition 2 is new and handles latent variables: if \(Y\) and \(Z\) are connected through a hidden common cause, conditioning on any subset of Mb\((X)\) won’t separate them — but conditioning on \(X\) itself would (because \(X\) may block the latent path). Condition 2 catches this case.

Together, the two conditions ensure that removing \(X\) does not destroy information about the latent structure connecting its neighbours.

count_lm = Ref(0)
ci_lm = (x, y, S, d) -> begin
    count_lm[] += 1
    fisher_z(x, y, collect(Int, S), Matrix{Float64}(d), sig_level)
end

lm_skel = l_marvel(data_obs, ci_lm)

f1_lm = f1_score(true_obs_skel, lm_skel)
@printf "L-MARVEL: skeleton F1 = %.3f   CI tests = %d\n" f1_lm count_lm[]
L-MARVEL: skeleton F1 = 1.000   CI tests = 14
Note

L-MARVEL output: skeleton only

Unlike FCI, L-MARVEL returns only the skeleton — undirected edges with no arrowhead or bidirected marks. This is a deliberate design choice: the recursive removal procedure identifies adjacencies efficiently but does not run the full PAG orientation phase.

This is still valuable. Knowing which pairs of variables are adjacent tells you which potential causal relationships to investigate further (with instruments, experiments, or domain knowledge). Variables with no edge are conditionally independent given some observed set — no causal relationship need be posited between them.

If you need orientations as well as skeleton, run FCI. If you only need to screen out non-adjacent pairs (the most common use case in large-\(p\) problems), L-MARVEL’s precision matrix approach is substantially faster.

23.5 Comparison

DataFrame(
    Algorithm   = ["FCI", "L-MARVEL"],
    Output      = ["PAG (orientations + skeleton)", "Skeleton only"],
    Skeleton_F1 = round.([f1_fci, f1_lm], digits=3),
    CI_Tests    = [count_fci[], count_lm[]],
)
Table 23.1: Algorithm comparison on the latent-variable scenario (2 hidden nodes, n = 2000)
2×4 DataFrame
Row Algorithm Output Skeleton_F1 CI_Tests
String String Float64 Int64
1 FCI PAG (orientations + skeleton) 0.889 82
2 L-MARVEL Skeleton only 1.0 14
side_by_side([
    (skeleton_to_admg(true_obs_skel, obs_labels), "True MAG skeleton"),
    (skeleton_to_admg(skel_fci,      obs_labels), "FCI — estimated skeleton"),
    (skeleton_to_admg(lm_skel,       obs_labels), "L-MARVEL — estimated skeleton"),
]; note="All edges shown as undirected (skeleton comparison only).")

True MAG skeleton

%3 X1 X1 X2 X2 X1->X2 X7 X7 X1->X7 X5 X5 X2->X5 X4 X4 X5->X7 X8 X8 X7->X8

FCI — estimated skeleton

%3 X1 X1 X2 X2 X1->X2 X7 X7 X1->X7 X5 X5 X2->X5 X4 X4 X5->X7 X8 X8

L-MARVEL — estimated skeleton

%3 X1 X1 X2 X2 X1->X2 X7 X7 X1->X7 X5 X5 X2->X5 X4 X4 X5->X7 X8 X8 X7->X8

All edges shown as undirected (skeleton comparison only).

Figure 23.3: True MAG skeleton vs. skeletons recovered by FCI and L-MARVEL over the 6 observed nodes.

23.6 Monte Carlo Evaluation

function evaluate_latent(seed; n_vars=8, n_samples=2000, edge_prob=0.35, n_latent=2, sig=0.01)
    Random.seed!(seed)
    adj       = gen_er_dag_adj_mat(n_vars, edge_prob)
    data_full = gen_gaussian_data(adj, n_samples)
    dag       = SimpleDiGraph(adj)

    latent_vars = sort(randperm(n_vars)[1:n_latent])
    obs_vars    = setdiff(1:n_vars, latent_vars)
    data_obs    = data_full[:, obs_vars]
    n_obs       = length(obs_vars)

    true_skel = true_mag_skeleton(dag, obs_vars)

    cnt_fci = Ref(0)
    ci_fci  = (x, y, S, d) -> begin cnt_fci[] += 1; fisher_z(x, y, collect(Int, S), Matrix{Float64}(d), sig) end
    pag     = fcialg(n_obs, ci_fci, data_obs)
    f1f     = f1_score(true_skel, SimpleGraph(pag))

    cnt_lm  = Ref(0)
    ci_lm   = (x, y, S, d) -> begin cnt_lm[] += 1; fisher_z(x, y, collect(Int, S), Matrix{Float64}(d), sig) end
    skel_lm = l_marvel(data_obs, ci_lm)
    f1l     = f1_score(true_skel, skel_lm)

    (f1_fci=f1f, f1_lm=f1l, ci_fci=cnt_fci[], ci_lm=cnt_lm[])
end

mc = [evaluate_latent(s) for s in 1:100]

@printf "\nMonte Carlo averages (100 replications, n=%d, p=%d, %d latent)\n" n_samples n_vars 2
@printf "%-10s  Skeleton F1  CI tests\n" "Method"
@printf "%-10s  %10.3f  %8.1f\n" "FCI"      mean(r.f1_fci for r in mc) mean(r.ci_fci for r in mc)
@printf "%-10s  %10.3f  %8.1f\n" "L-MARVEL" mean(r.f1_lm  for r in mc) mean(r.ci_lm  for r in mc)

Monte Carlo averages (100 replications, n=2000, p=8, 2 latent)
Method      Skeleton F1  CI tests
FCI              0.912     171.8
L-MARVEL         0.934      80.9
let
    fig = Figure(size=(650, 380))
    ax  = Axis(fig[1,1];
        xlabel = "CI tests",
        ylabel = "Frequency",
        title  = "CI Test Counts: FCI vs L-MARVEL (100 replications)")
    hist!(ax, [r.ci_fci for r in mc]; bins=20, color=(:steelblue, 0.65), label="FCI")
    hist!(ax, [r.ci_lm  for r in mc]; bins=20, color=(:tomato, 0.65),    label="L-MARVEL")
    axislegend(ax)
    fig
end
Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.

@ Makie ~/.julia/packages/Makie/UjJJY/src/scenes.jl:238
Figure 23.4: CI test count distributions across 100 replications.

23.7 Summary

Property PC / RSL-D FCI L-MARVEL
Latent confounders ✗ assumes none ✓ handles ✓ handles
Output CPDAG PAG Skeleton
Orientations
MB estimation CI tests CI tests Precision matrix
CI complexity PC: \(O(p^2 2^p)\) worst case; RSL-D: \(O(p \cdot d \cdot |\text{MB}|^2)\) Same as PC Fewer (Gaussian MB pre-filters)

FCI and L-MARVEL address complementary needs. If you need to orient edges, use FCI. If you only need the skeleton and efficiency matters, L-MARVEL’s precision-matrix Markov boundary estimator reduces the CI test burden.

23.7.1 Practical decision guide

Use PC or RSL-D when: - You are confident in causal sufficiency (all relevant variables are measured) - You want a CPDAG that you can orient further with background knowledge - RSL-D is preferred over PC when \(p > 10\) or the graph may be moderately dense

Use FCI when: - You suspect hidden confounders but don’t know where - You need edge orientations and \(\leftrightarrow\) marks to guide IV or proxy strategies - Sample size is large enough that CI-test errors don’t cascade badly through orientation rules

Use L-MARVEL when: - You suspect latent confounders and have many variables (\(p > 20\)) - The goal is to prune the adjacency graph before applying domain knowledge or estimation - Gaussian data is appropriate (precision matrix estimator requires this)

23.7.2 From discovery to estimation

Causal discovery is a first step, not a final answer. The typical research workflow is:

  1. Discover — run FCI or L-MARVEL to get candidate adjacencies and, if using FCI, some edge marks
  2. Refine — apply temporal ordering, institutional knowledge, and exclusion restrictions to orient remaining edges
  3. Identify — check whether the effect of interest is identified given the refined graph (backdoor, front-door, ID algorithm)
  4. Estimate — use TMLE, AIPW, or the estimators in earlier chapters

Causal discovery narrows the space of possible structures; domain knowledge finishes the job.

Warning

Sample size requirements

L-MARVEL’s Gaussian MB estimator requires \(n > p + 1\) samples. With many variables and limited data, fall back to CI-test-based MB estimation by passing a custom mb_estimator to l_marvel.

Note

Going further: orientation in the latent case

FCI’s PAG output distinguishes definite causes (\(\to\)), definite hidden common causes (\(\leftrightarrow\)), and uncertain cases (\(\circ\)). The PAG can be combined with background knowledge such as temporal ordering to further orient edges before estimation.

23.8 How Much Does Causal Discovery Help in Economics?

An honest assessment: recovering even the true MAG skeleton is far from recovering the full DAG. Several layers of information remain missing:

What is recovered What is still missing
L-MARVEL skeleton Edge directions; direct cause vs. hidden common cause; latent nodes
FCI PAG Unique MAG; latent nodes; most edges still carry \(\circ\) marks
True MAG Latent nodes and their connections
Full DAG Nothing — this is the goal

The core econometric question is almost always “what is the causal effect of \(X\) on \(Y\)?” Causal discovery with latent variables does not answer this directly. A definite \(X \leftrightarrow Y\) in a PAG tells you confounding exists — but not how to remove it. Without oriented edges you cannot even check whether the effect is identified, let alone estimate it.

Where these methods do add value in economics:

  1. Falsifying structural models. If your model implies \(X \perp\!\!\!\perp Z \mid W\), test it. A rejection means the model’s independence assumptions are inconsistent with the data — discovery tells you which assumptions fail before you commit to a full structural estimation.

  2. Flagging where instruments are needed. A definite \(X \leftrightarrow Y\) in the PAG is a data-driven diagnostic: OLS of \(Y\) on \(X\) is biased, and you need an instrument, proxy, or natural experiment. Discovery does not supply the instrument but tells you where to look for one.

  3. High-dimensional variable selection. With many candidate controls, knowing which pairs are conditionally independent prunes the problem before applying identification strategies. This is most useful in settings with \(p > 20\) variables where theory does not specify the full graph.

  4. Hypothesis generation when theory is silent. For new economic phenomena — platform markets, fintech, peer effects in novel settings — where theory does not give a strong causal ordering, discovery algorithms generate hypotheses worth investigating with better-powered research designs.

The bottom line: causal discovery is a complement to the standard econometric toolkit — IV, RD, DiD, synthetic control — not a substitute. It is most useful as a diagnostic and hypothesis-generating tool early in a research project, before the hard work of finding credible identifying variation begins.