11 Bayesian Causal Inference

using DataFrames
using Distributions
using GLM
using Statistics
using Random
using LinearAlgebra
using Printf
using CairoMakie

Bayesian methods compute a full posterior distribution over the unknown quantities, providing credible intervals and natural uncertainty propagation. For causal inference, this means uncertainty about treatment effects can be reported at the individual level — not just aggregate ATEs — and downstream decisions (treatment recommendation, policy evaluation) can use the posterior directly.

The two workhorses in this area are:

BART for causal inference (Hill 2011) — Bayesian Additive Regression Trees as a flexible outcome model.
Bayesian Causal Forests (BCF; Hahn-Murray-Carvalho 2020) — BART extended to separate prognostic and treatment-effect components.

Both have mature R packages (BART, bcf). The Julia ecosystem currently lacks a comparable BART implementation; this chapter sketches the concepts with a simpler Bayesian linear-regression-based g-computation that captures the essential idea, and points to R for production work.

Related reading: For a hands-on Bayesian model with Numpyro, see the Using Numpyro chapter in Topics on Econometrics and Causal Inference. The R companion to this chapter walks through BART and BCF in detail.

11.1 Setup

Random.seed!(42)
n = 1500
p = 3

X = randn(n, p)
ps = @. 1 / (1 + exp(-(-0.3 + 0.5 * X[:, 1] + 0.3 * X[:, 2])))
D  = Float64.(rand(n) .< ps)
tau = @. 1 + 2 * X[:, 1]   # true CATE
Y0  = @. 0.5 * X[:, 2] + 0.3 * X[:, 3] + randn()
Y1  = Y0 .+ tau
Y   = ifelse.(D .== 1, Y1, Y0)

df = DataFrame(Y = Y, D = D, X1 = X[:, 1], X2 = X[:, 2], X3 = X[:, 3])
@printf("True ATE = %.3f\n", mean(tau))

11.2 Bayesian g-computation

The simplest Bayesian causal inference recipe is Bayesian g-computation:

Specify a Bayesian model for \(\mathbb{E}[Y \mid D, X]\).
Sample from the posterior of the regression parameters.
For each posterior draw, compute counterfactual outcomes \(\hat Y(1), \hat Y(0)\).
The posterior of the ATE is the distribution of \(\mathbb{E}[\hat Y(1) - \hat Y(0)]\) across draws.

A Bayesian linear regression with a flat prior implements this cleanly:

"""
    bayes_linreg(X, y; n_samples=2000)
Bayesian linear regression with a Normal-Inverse-Gamma conjugate prior
(flat → posterior matches frequentist OLS). Returns `n_samples` posterior
draws of (β, σ²).
"""
function bayes_linreg(X::Matrix, y::Vector; n_samples::Int = 2000)
    n, p = size(X)
    XtX_inv = inv(X' * X)
    beta_hat = XtX_inv * X' * y
    sse = sum(abs2.(y .- X * beta_hat))
    # σ² ~ InverseGamma((n-p)/2, sse/2)
    σ²_post = [rand(InverseGamma((n - p) / 2, sse / 2)) for _ in 1:n_samples]
    # β | σ² ~ MultivariateNormal(beta_hat, σ² * XtX_inv)
    betas = zeros(n_samples, p)
    for i in 1:n_samples
        cov_mat = σ²_post[i] .* XtX_inv
        betas[i, :] = rand(MvNormal(beta_hat, Symmetric(cov_mat)))
    end
    return betas, σ²_post
end

# Fit on (D, X1, X2, X3, D*X1, D*X2, D*X3) — allow treatment effect heterogeneity
DesignMat = hcat(ones(n), df.D, df.X1, df.X2, df.X3,
                 df.D .* df.X1, df.D .* df.X2, df.D .* df.X3)
betas, σ²_draws = bayes_linreg(DesignMat, df.Y; n_samples = 2000)

# Counterfactual prediction matrices
X_treat   = hcat(ones(n), ones(n),  df.X1, df.X2, df.X3,
                 df.X1, df.X2, df.X3)           # D = 1, so D*X = X
X_control = hcat(ones(n), zeros(n), df.X1, df.X2, df.X3,
                 zeros(n), zeros(n), zeros(n))  # D = 0

# Posterior ITEs: n_samples x n
ite_post = (X_treat * betas')' .- (X_control * betas')'
# ATE posterior: average over n at each draw
ate_post = vec(mean(ite_post, dims = 2))

@printf("Bayesian g-computation:\n")
@printf("  Posterior mean ATE:  %.3f\n", mean(ate_post))
@printf("  Posterior SD:        %.3f\n", std(ate_post))
@printf("  95%% CrI:             [%.3f, %.3f]\n",
        quantile(ate_post, 0.025), quantile(ate_post, 0.975))
@printf("  True ATE:            %.3f\n", mean(tau))

The posterior 95% credible interval reports uncertainty about the ATE directly. With a flat prior the credible interval matches the frequentist confidence interval; with informative priors they diverge, and the Bayesian interval reflects prior + data.

11.2.1 Individual-level credible intervals

ite_mean = vec(mean(ite_post, dims = 1))
ite_lo   = [quantile(ite_post[:, i], 0.025) for i in 1:n]
ite_hi   = [quantile(ite_post[:, i], 0.975) for i in 1:n]

# Sort by X1 for plotting
ord = sortperm(df.X1)

fig = Figure(size = (700, 400))
ax = Axis(fig[1, 1], xlabel = "X1", ylabel = "ITE",
          title = "Posterior ITEs vs truth")
band!(ax, df.X1[ord], ite_lo[ord], ite_hi[ord],
      color = (:steelblue, 0.2), label = "95% CrI")
scatter!(ax, df.X1, ite_mean, color = (:steelblue, 0.4), markersize = 3,
         label = "Posterior mean ITE")
lines!(ax, [-3, 3], [1 - 6, 1 + 6], color = :firebrick,
       linestyle = :dash, linewidth = 2, label = "True τ(x)")
axislegend(ax, position = :rb, framevisible = false)
fig

The blue ribbon shows pointwise 95% credible intervals; the red dashed line is the truth. Where the ribbon covers the truth, the posterior is well-calibrated.

11.3 Why use BART/BCF in R instead?

The Bayesian linear g-computation above suffices when:

The outcome model is well-approximated by a small number of interactions.
The effect heterogeneity is along covariates the analyst has pre-specified.

For genuinely flexible nonparametric Bayesian causal inference, R’s BART and bcf are the workhorses. They use sum-of-trees ensembles with weakly informative priors on tree structure, providing:

Automatic flexibility: no need to specify interactions.
Posterior over individual effects: native uncertainty at the unit level, including for ITEs not detectable by linear models.
BCF’s separation prior: regularises treatment-effect heterogeneity separately from prognostic variation, reducing spurious effects.

Compare the rough sketch:

Capability	Julia (this chapter)	R (BART/BCF)
Bayesian point estimate of ATE	✓	✓
Posterior credible intervals	✓	✓
Flexible nonparametric outcome model	(limited to linear + interactions)	✓
Native ITE posterior with regularisation	(linear-model heterogeneity only)	✓
Separation of prognostic and treatment effects	–	✓ (BCF)

For now, Julia users wanting the full flexibility of BART/BCF should either call out to R or write a domain-specific package using Turing.jl or Mamba.jl (the latter is a substantial project — a future contribution to the Julia ecosystem).

11.4 When to use Bayesian methods

Reason to use Bayesian	Reason to use Frequentist
Prior information matters	Asymptotic theory is well-understood
Need full posterior for downstream decisions	Computational cost matters at scale
Small sample, weak identification	Effects are well-identified
Want native individual-level CrIs	Existing frequentist toolkit suffices

The Bayesian approach offers outputs (posterior over individual effects, principled propagation of uncertainty into derived quantities) that are not natural for frequentist methods. For decision-theoretic applications — treatment-recommendation under cost-benefit trade-offs — the Bayesian posterior is the natural input.

11.5 Connections

The Heterogeneous Treatment Effects chapter covers frequentist meta-learners. BCF is the Bayesian counterpart to causal forests.
The Sensitivity Analysis chapter’s Cinelli-Hazlett bounds are frequentist; the Bayesian analog puts a prior on the unmeasured-confounder strength (McCandless et al. 2007).
For a hands-on Numpyro example of Bayesian modelling, see Using Numpyro.

11.6 Summary

Bayesian causal inference computes full posteriors over treatment effects, providing credible intervals for both aggregate and individual-level effects.
Bayesian g-computation with a Normal-Inverse-Gamma conjugate prior is implementable in Julia in ~30 lines and gives valid credible intervals for the ATE.
For nonparametric flexibility (BART, BCF), R’s BART and bcf packages remain the workhorses — see the R companion chapter.
For Numpyro-based Bayesian models in Python, see the Using Numpyro blog chapter.