using CausalInference
using RecursiveCausalDiscovery
using CausalGraphs
using Graphs
using CairoMakie
using Statistics
using LinearAlgebra
using Random
using Printf
using DataFrames
using Combinatorics: powerset
using Distributions: Normal, quantile23 Causal Discovery: Latent Variables
In the previous chapter we assumed all causally relevant variables were observed. This assumption — called causal sufficiency — rarely holds in economics: unobserved ability confounds wage regressions, unobserved demand shocks confound price–quantity relationships, unobserved peer effects confound social network studies.
When latent confounders exist, constraint-based discovery algorithms like PC and RSL-D return incorrect results. This chapter introduces two algorithms that handle the latent variable case.
23.0.1 Why PC fails with latent variables
Suppose the true graph is \(X \leftarrow U \rightarrow Y\) where \(U\) is unobserved. In the observed data, \(X\) and \(Y\) are marginally correlated (through \(U\)) and no observed conditioning set separates them. PC will incorrectly draw an edge \(X - Y\) and may orient it as \(X \to Y\) or \(X \leftarrow Y\), neither of which exists in the truth.
More subtly, conditioning on a collider that is a descendant of \(U\) can open a path between \(X\) and \(Y\), further distorting the CI structure that PC observes. With even a single hidden common cause, the d-separation statements in the observed data no longer correspond to those of any DAG over just the observed variables — they require the richer MAG representation.
23.1 Maximal Ancestral Graphs and PAGs
With latent variables, the appropriate representation is a Maximal Ancestral Graph (MAG). Over a set of observed variables \(\mathbf{O}\), a MAG encodes:
- \(X \to Y\): \(X\) is an ancestor of \(Y\) in the full DAG
- \(X \leftrightarrow Y\): \(X\) and \(Y\) have a common hidden ancestor (hidden confounder)
- No edge: \(X \perp\!\!\!\perp Y \mid Z\) for some \(Z \subseteq \mathbf{O} \setminus \{X, Y\}\)
The MAG over observed variables is the “shadow” of the full DAG: it summarizes all causal and confounding relationships that are visible in the observed data, without requiring us to know or name the latent variables.
Just as DAGs have Markov equivalence classes represented by CPDAGs, MAGs have equivalence classes represented by Partial Ancestral Graphs (PAGs). In a PAG, the mark \(\circ\) on an edge endpoint means “could be arrowhead or tail in some member of the equivalence class.”
| PAG mark | Meaning | Economic interpretation |
|---|---|---|
| \(X \to Y\) | \(X\) causes \(Y\) in every equivalent MAG | Robust causal direction |
| \(X \leftarrow\!\!\!\!\circ\; Y\) | Some MAGs have \(X \to Y\), others \(X \leftarrow Y\) | Direction uncertain |
| \(X \leftrightarrow Y\) | Hidden common cause in every equivalent MAG | Definite unmeasured confounder |
| \(X \;\circ\!\!\!\!-\!\!\!\!\circ\; Y\) | Could be \(\to\), \(\leftarrow\), or \(\leftrightarrow\) | Maximally uncertain |
Reading a PAG for policy purposes:
- A definite \(X \to Y\) edge is the most useful finding: it means any intervention on \(X\) propagates to \(Y\) regardless of which specific MAG the data came from.
- A definite \(X \leftrightarrow Y\) edge is a warning: before using regression of \(Y\) on \(X\) to estimate a causal effect, you must address the hidden confounder — through an instrument, a proxy, or a natural experiment.
- \(\circ\) marks indicate what you don’t know. Additional data, temporal ordering, or experimental variation can resolve them.
23.3 FCI Algorithm
The FCI algorithm (Fast Causal Inference; Spirtes, Meek & Richardson, 1995) is the standard extension of PC to the latent variable case. Like PC, it starts from a complete graph and removes edges via CI tests. After skeleton discovery, it applies a richer set of orientation rules that can produce bidirected edges \(X \leftrightarrow Y\) indicating hidden common causes.
How FCI extends PC:
- Phase 1 (skeleton): Same as PC — remove edges via CI tests with growing conditioning sets.
- Phase 2 (initial orientation): Mark all edge endpoints as \(\circ\) (uncertain).
- Phase 3 (collider detection): For each unshielded triple \(X \circ\!-\!\circ Z \circ\!-\!\circ Y\) where \(X, Y\) non-adjacent: orient as \(X \circ\!\to Z \leftarrow\!\circ Y\) if \(Z\) is not in the separating set of \(X\) and \(Y\).
- Phase 4 (rule propagation): Apply 10 orientation rules that propagate known marks without creating contradictions, including rules that can produce definite \(\to\) and \(\leftrightarrow\) edges.
The richer mark set (\(\to\), \(\leftarrow\), \(\circ\), \(\leftrightarrow\)) allows FCI to express what is known vs. unknown — at the cost of more complex output to interpret.
sig_level = 0.01
count_fci = Ref(0)
ci_fci = (x, y, S, d) -> begin
count_fci[] += 1
fisher_z(x, y, collect(Int, S), Matrix{Float64}(d), sig_level)
end
pag_fci = fcialg(n_obs, ci_fci, data_obs)
skel_fci = SimpleGraph(pag_fci)
f1_fci = f1_score(true_obs_skel, skel_fci)
@printf "FCI: skeleton F1 = %.3f CI tests = %d\n" f1_fci count_fci[]FCI: skeleton F1 = 0.889 CI tests = 82
println(plot_fci_graph_text(pag_fci, ["X$(obs_idx[i])" for i in 1:n_obs]))X1<-X2 X1<-X7 X2o-oX5 X5o-oX7 nothing
Reading the FCI text output
The text representation uses ASCII edge marks:
X1 --> X2: definite direct cause (\(X_1 \to X_2\) in every equivalent MAG)X1 <-> X2: definite hidden common cause (\(X_1 \leftrightarrow X_2\))X1 o-> X2: the \(X_1\) endpoint is uncertain; \(X_2\) endpoint is definitely an arrowheadX1 o-o X2: both endpoints uncertain; could be any of \(\to\), \(\leftarrow\), \(\leftrightarrow\)
In practice for economic research: focus first on <-> edges — these flag definite confounding and tell you where IV or proxy strategies are needed. Then examine --> edges — these are causal claims that survive across all statistically equivalent structures.
When FCI gives wrong answers: FCI assumes the CI tests are perfectly accurate (no finite-sample error). In practice, with small \(n\) or many variables, some false CI decisions propagate through the orientation rules. The skeleton quality (F1) is generally more reliable than the orientation quality.
23.4 L-MARVEL Algorithm
L-MARVEL (Latent Markov Boundary Variable Elimination; Mokhtarian et al., JMLR 2025) extends the recursive MARVEL approach to handle latent confounders. Instead of exhaustive CI-test-based Markov boundary discovery, L-MARVEL uses a Gaussian precision matrix estimator:
\[\widehat{\text{Mb}}(X) = \left\{Y : \frac{|\hat P_{XY}|}{\sqrt{\hat P_{XX} \hat P_{YY}}} > \tau\right\}\]
where \(\hat P = \hat\Sigma^{-1}\) is the precision matrix and \(\tau\) is a threshold from the Fisher-Z distribution.
Why the precision matrix? In a Gaussian linear model, the \((i,j)\) entry of the precision matrix \(P = \Sigma^{-1}\) is zero if and only if \(X_i \perp\!\!\!\perp X_j \mid \text{all other variables}\). The partial correlation \(P_{ij}/\sqrt{P_{ii}P_{jj}}\) is therefore a one-shot test for Markov boundary membership, replacing the many conditioning-set CI tests that PC or FCI require to find the same information. This is the key computational saving.
The removability criterion changes for the latent case (Theorem 32). Recall: \(\text{Mb}(X)\) is the Markov boundary of \(X\) — the minimal conditioning set that screens \(X\) off from all other variables (estimated here via the precision matrix above rather than by exhaustive CI tests) — and \(\text{Ne}(X)\) is the set of variables directly adjacent to \(X\) in the current skeleton. Variable \(X\) is removable iff for every \(Y \in \text{Mb}(X)\) and \(Z \in \text{Ne}(X)\):
\[\underbrace{\exists\, W \subseteq \text{Mb}(X) \setminus \{Y,Z\}: Y \perp\!\!\!\perp Z \mid W}_{\text{Condition 1}} \quad\text{or}\quad \underbrace{\forall\, W \subseteq \text{Mb}(X) \setminus \{Y,Z\}: Y \not\!\perp\!\!\!\perp Z \mid W \cup \{X\}}_{\text{Condition 2}}\]
Intuition for the two conditions:
- Condition 1 is the same as the observed-variable (RSL-D) removability criterion: \(Y\) and \(Z\) can be separated by some subset of the Markov boundary.
- Condition 2 is new and handles latent variables: if \(Y\) and \(Z\) are connected through a hidden common cause, conditioning on any subset of Mb\((X)\) won’t separate them — but conditioning on \(X\) itself would (because \(X\) may block the latent path). Condition 2 catches this case.
Together, the two conditions ensure that removing \(X\) does not destroy information about the latent structure connecting its neighbours.
count_lm = Ref(0)
ci_lm = (x, y, S, d) -> begin
count_lm[] += 1
fisher_z(x, y, collect(Int, S), Matrix{Float64}(d), sig_level)
end
lm_skel = l_marvel(data_obs, ci_lm)
f1_lm = f1_score(true_obs_skel, lm_skel)
@printf "L-MARVEL: skeleton F1 = %.3f CI tests = %d\n" f1_lm count_lm[]L-MARVEL: skeleton F1 = 1.000 CI tests = 14
L-MARVEL output: skeleton only
Unlike FCI, L-MARVEL returns only the skeleton — undirected edges with no arrowhead or bidirected marks. This is a deliberate design choice: the recursive removal procedure identifies adjacencies efficiently but does not run the full PAG orientation phase.
This is still valuable. Knowing which pairs of variables are adjacent tells you which potential causal relationships to investigate further (with instruments, experiments, or domain knowledge). Variables with no edge are conditionally independent given some observed set — no causal relationship need be posited between them.
If you need orientations as well as skeleton, run FCI. If you only need to screen out non-adjacent pairs (the most common use case in large-\(p\) problems), L-MARVEL’s precision matrix approach is substantially faster.
23.5 Comparison
DataFrame(
Algorithm = ["FCI", "L-MARVEL"],
Output = ["PAG (orientations + skeleton)", "Skeleton only"],
Skeleton_F1 = round.([f1_fci, f1_lm], digits=3),
CI_Tests = [count_fci[], count_lm[]],
)| Row | Algorithm | Output | Skeleton_F1 | CI_Tests |
|---|---|---|---|---|
| String | String | Float64 | Int64 | |
| 1 | FCI | PAG (orientations + skeleton) | 0.889 | 82 |
| 2 | L-MARVEL | Skeleton only | 1.0 | 14 |
side_by_side([
(skeleton_to_admg(true_obs_skel, obs_labels), "True MAG skeleton"),
(skeleton_to_admg(skel_fci, obs_labels), "FCI — estimated skeleton"),
(skeleton_to_admg(lm_skel, obs_labels), "L-MARVEL — estimated skeleton"),
]; note="All edges shown as undirected (skeleton comparison only).")True MAG skeleton
FCI — estimated skeleton
L-MARVEL — estimated skeleton
All edges shown as undirected (skeleton comparison only).
23.6 Monte Carlo Evaluation
function evaluate_latent(seed; n_vars=8, n_samples=2000, edge_prob=0.35, n_latent=2, sig=0.01)
Random.seed!(seed)
adj = gen_er_dag_adj_mat(n_vars, edge_prob)
data_full = gen_gaussian_data(adj, n_samples)
dag = SimpleDiGraph(adj)
latent_vars = sort(randperm(n_vars)[1:n_latent])
obs_vars = setdiff(1:n_vars, latent_vars)
data_obs = data_full[:, obs_vars]
n_obs = length(obs_vars)
true_skel = true_mag_skeleton(dag, obs_vars)
cnt_fci = Ref(0)
ci_fci = (x, y, S, d) -> begin cnt_fci[] += 1; fisher_z(x, y, collect(Int, S), Matrix{Float64}(d), sig) end
pag = fcialg(n_obs, ci_fci, data_obs)
f1f = f1_score(true_skel, SimpleGraph(pag))
cnt_lm = Ref(0)
ci_lm = (x, y, S, d) -> begin cnt_lm[] += 1; fisher_z(x, y, collect(Int, S), Matrix{Float64}(d), sig) end
skel_lm = l_marvel(data_obs, ci_lm)
f1l = f1_score(true_skel, skel_lm)
(f1_fci=f1f, f1_lm=f1l, ci_fci=cnt_fci[], ci_lm=cnt_lm[])
end
mc = [evaluate_latent(s) for s in 1:100]
@printf "\nMonte Carlo averages (100 replications, n=%d, p=%d, %d latent)\n" n_samples n_vars 2
@printf "%-10s Skeleton F1 CI tests\n" "Method"
@printf "%-10s %10.3f %8.1f\n" "FCI" mean(r.f1_fci for r in mc) mean(r.ci_fci for r in mc)
@printf "%-10s %10.3f %8.1f\n" "L-MARVEL" mean(r.f1_lm for r in mc) mean(r.ci_lm for r in mc)
Monte Carlo averages (100 replications, n=2000, p=8, 2 latent)
Method Skeleton F1 CI tests
FCI 0.912 171.8
L-MARVEL 0.934 80.9
let
fig = Figure(size=(650, 380))
ax = Axis(fig[1,1];
xlabel = "CI tests",
ylabel = "Frequency",
title = "CI Test Counts: FCI vs L-MARVEL (100 replications)")
hist!(ax, [r.ci_fci for r in mc]; bins=20, color=(:steelblue, 0.65), label="FCI")
hist!(ax, [r.ci_lm for r in mc]; bins=20, color=(:tomato, 0.65), label="L-MARVEL")
axislegend(ax)
fig
end┌ Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions. └ @ Makie ~/.julia/packages/Makie/UjJJY/src/scenes.jl:238
23.7 Summary
| Property | PC / RSL-D | FCI | L-MARVEL |
|---|---|---|---|
| Latent confounders | ✗ assumes none | ✓ handles | ✓ handles |
| Output | CPDAG | PAG | Skeleton |
| Orientations | ✓ | ✓ | ✗ |
| MB estimation | CI tests | CI tests | Precision matrix |
| CI complexity | PC: \(O(p^2 2^p)\) worst case; RSL-D: \(O(p \cdot d \cdot |\text{MB}|^2)\) | Same as PC | Fewer (Gaussian MB pre-filters) |
FCI and L-MARVEL address complementary needs. If you need to orient edges, use FCI. If you only need the skeleton and efficiency matters, L-MARVEL’s precision-matrix Markov boundary estimator reduces the CI test burden.
23.7.1 Practical decision guide
Use PC or RSL-D when: - You are confident in causal sufficiency (all relevant variables are measured) - You want a CPDAG that you can orient further with background knowledge - RSL-D is preferred over PC when \(p > 10\) or the graph may be moderately dense
Use FCI when: - You suspect hidden confounders but don’t know where - You need edge orientations and \(\leftrightarrow\) marks to guide IV or proxy strategies - Sample size is large enough that CI-test errors don’t cascade badly through orientation rules
Use L-MARVEL when: - You suspect latent confounders and have many variables (\(p > 20\)) - The goal is to prune the adjacency graph before applying domain knowledge or estimation - Gaussian data is appropriate (precision matrix estimator requires this)
23.7.2 From discovery to estimation
Causal discovery is a first step, not a final answer. The typical research workflow is:
- Discover — run FCI or L-MARVEL to get candidate adjacencies and, if using FCI, some edge marks
- Refine — apply temporal ordering, institutional knowledge, and exclusion restrictions to orient remaining edges
- Identify — check whether the effect of interest is identified given the refined graph (backdoor, front-door, ID algorithm)
- Estimate — use TMLE, AIPW, or the estimators in earlier chapters
Causal discovery narrows the space of possible structures; domain knowledge finishes the job.
Sample size requirements
L-MARVEL’s Gaussian MB estimator requires \(n > p + 1\) samples. With many variables and limited data, fall back to CI-test-based MB estimation by passing a custom mb_estimator to l_marvel.
Going further: orientation in the latent case
FCI’s PAG output distinguishes definite causes (\(\to\)), definite hidden common causes (\(\leftrightarrow\)), and uncertain cases (\(\circ\)). The PAG can be combined with background knowledge such as temporal ordering to further orient edges before estimation.
23.8 How Much Does Causal Discovery Help in Economics?
An honest assessment: recovering even the true MAG skeleton is far from recovering the full DAG. Several layers of information remain missing:
| What is recovered | What is still missing |
|---|---|
| L-MARVEL skeleton | Edge directions; direct cause vs. hidden common cause; latent nodes |
| FCI PAG | Unique MAG; latent nodes; most edges still carry \(\circ\) marks |
| True MAG | Latent nodes and their connections |
| Full DAG | Nothing — this is the goal |
The core econometric question is almost always “what is the causal effect of \(X\) on \(Y\)?” Causal discovery with latent variables does not answer this directly. A definite \(X \leftrightarrow Y\) in a PAG tells you confounding exists — but not how to remove it. Without oriented edges you cannot even check whether the effect is identified, let alone estimate it.
Where these methods do add value in economics:
Falsifying structural models. If your model implies \(X \perp\!\!\!\perp Z \mid W\), test it. A rejection means the model’s independence assumptions are inconsistent with the data — discovery tells you which assumptions fail before you commit to a full structural estimation.
Flagging where instruments are needed. A definite \(X \leftrightarrow Y\) in the PAG is a data-driven diagnostic: OLS of \(Y\) on \(X\) is biased, and you need an instrument, proxy, or natural experiment. Discovery does not supply the instrument but tells you where to look for one.
High-dimensional variable selection. With many candidate controls, knowing which pairs are conditionally independent prunes the problem before applying identification strategies. This is most useful in settings with \(p > 20\) variables where theory does not specify the full graph.
Hypothesis generation when theory is silent. For new economic phenomena — platform markets, fintech, peer effects in novel settings — where theory does not give a strong causal ordering, discovery algorithms generate hypotheses worth investigating with better-powered research designs.
The bottom line: causal discovery is a complement to the standard econometric toolkit — IV, RD, DiD, synthetic control — not a substitute. It is most useful as a diagnostic and hypothesis-generating tool early in a research project, before the hard work of finding credible identifying variation begins.