This vignette demonstrates advanced usage of the
causaldef package, focusing on diagnostics for unobserved
confounding and sensitivity analysis. We utilize real data incorporated
in the package to showcase these features.
library(causaldef)
library(ggplot2)
# DAG Helper
plot_dag <- function(coords, edges, title = NULL) {
edges_df <- merge(edges, coords, by.x = "from", by.y = "name")
colnames(edges_df)[c(3,4)] <- c("x_start", "y_start")
edges_df <- merge(edges_df, coords, by.x = "to", by.y = "name")
colnames(edges_df)[c(5,6)] <- c("x_end", "y_end")
ggplot2::ggplot(coords, ggplot2::aes(x = x, y = y)) +
ggplot2::geom_segment(data = edges_df, ggplot2::aes(x = x_start, y = y_start, xend = x_end, yend = y_end),
arrow = ggplot2::arrow(length = ggplot2::unit(0.3, "cm"), type = "closed"),
color = "gray40", size = 1, alpha = 0.8) +
ggplot2::geom_point(size = 14, color = "white", fill = "#E7B800", shape = 21, stroke = 1.5) + # different color
ggplot2::geom_text(ggplot2::aes(label = name), fontface = "bold", size = 3, color = "black") +
ggplot2::ggtitle(title) +
ggplot2::theme_void(base_size = 14) +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, face = "bold", margin = ggplot2::margin(b = 10))) +
ggplot2::coord_fixed()
}Unobserved confounding is a major threat to causal inference. The
causaldef package implements “Negative Control Outcomes” to
detect residual confounding. A negative control outcome is a variable
that is affected by unobserved confounders but is known not to be
causally affected by the treatment.
We will use the gene_perturbation dataset provided in
the package. This dataset comes from a CRISPR knockout experiment.
knockout_status): Whether a
gene was knocked out (Treatment) or not (Control).target_expression):
Expression level of the target gene.library_size and batch.housekeeping_gene):
A housekeeping gene that should not be affected by the specific
knockout but shares technical confounders.data("gene_perturbation")
head(gene_perturbation)
#> sample_id batch library_size knockout_status target_expression
#> 1 S1 3 948877 Knockout 8.643
#> 2 S2 4 984900 Control 11.545
#> 3 S3 4 868332 Control 12.325
#> 4 S4 2 1052105 Control 12.941
#> 5 S5 2 880271 Control 8.289
#> 6 S6 2 1001246 Knockout 9.400
#> housekeeping_gene
#> 1 6.594
#> 2 9.424
#> 3 8.568
#> 4 9.224
#> 5 6.380
#> 6 6.706We first specify the causal model, explicitly identifying the negative control outcome.
spec_nc <- causal_spec(
data = gene_perturbation,
treatment = "knockout_status",
outcome = "target_expression",
covariates = c("library_size", "batch"),
negative_control = "housekeeping_gene"
)
#> ✔ Created causal specification: n=500, 2 covariate(s)
# Visualize the Structural Assumption
coords <- data.frame(
name = c("Unobserved", "Treatment", "Target", "Housekp"),
x = c(0, -1.5, 1.5, 0),
y = c(1, 0, 0, -1)
)
edges <- data.frame(
from = c("Unobserved", "Unobserved", "Unobserved", "Treatment"),
to = c("Treatment", "Target", "Housekp", "Target")
)
plot_dag(coords, edges, title = "Why Negative Controls Work:\nShared Unobserved Confounding")
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once per session.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.Now we run the diagnostic using nc_diagnostic(). We test
if the Inverse Probability of Treatment Weighting (IPTW) method
successfully removes confounding bias.
set.seed(123)
# Check if IPTW approach effectively removes confounding using the negative control
nc_res <- nc_diagnostic(spec_nc, method = "iptw", n_boot = 100)
#> ℹ Using kappa = 1 (conservative). Consider domain-specific estimation or sensitivity analysis via kappa_range.
#> ✔ No evidence against causal assumptions (p = 0.14851 )
print(nc_res)
#>
#> -- Negative Control Diagnostic ----------------------------------------
#>
#> * screening statistic (weighted corr): 0.0821
#> * delta_NC (association proxy): 0.0821
#> * delta bound (under kappa alignment): 0.0821 (kappa = 1 )
#> * screening p-value: 0.14851
#> * screening method: weighted_permutation_correlation
#>
#> RESULT: NOT REJECTED. This is a screening result, not proof that confounding is absent.
#> NOTE: Your effect estimate must exceed the Noise Floor (delta_bound) to be meaningful.Interpreting Negative Control Diagnostics:
This diagnostic tests a crucial assumption: “Did my adjustment remove ALL confounding?”
Key Results to Check:
0.05: [PASS] – No evidence that treatment affects the negative control after adjustment
Scientific Implications:
The negative control is powerful because:
If falsified:
If the diagnostic fails (as it might with unobserved batch effects not fully captured), it indicates that your current adjustment set is insufficient to block all back-door paths to the negative control (and likely the outcome).
Recommended Actions:
confounding_frontier() to
determine how strong this confounding would need to be to invalidate
your results.When we suspect unobserved confounding, we can quantify how sensitive
our causal conclusions are to potential unmeasured confounders \(U\). The
confounding_frontier() function maps the “deficiency”
(information loss) as a function of the strength of confounding paths
\(U \to A\) (\(\alpha\)) and \(U
\to Y\) (\(\gamma\)).
We will demonstrate this using the hct_outcomes dataset
(Hematopoietic Cell Transplantation).
data("hct_outcomes")
# Convert treatment to numeric for the Gaussian model approximation
hct_outcomes$conditioning_numeric <- as.numeric(hct_outcomes$conditioning_intensity) - 1
# Simplified specification focusing on conditioning intensity
spec_sens <- causal_spec(
data = hct_outcomes,
treatment = "conditioning_numeric",
outcome = "time_to_event", # Simplified for this example
covariates = c("age", "disease_status")
)
#> ✔ Created causal specification: n=800, 2 covariate(s)We compute the confounding frontier. This does not require observing \(U\), but rather exploring the space of hypothetical confounding strengths.
frontier <- confounding_frontier(
spec_sens,
alpha_range = c(-1, 1),
gamma_range = c(-1, 1),
grid_size = 40
)
#> ℹ Computing benchmarks for observed covariates...
#> ✔ Computed confounding frontier: 40x40 grid
# The result contains a grid we can visualize
head(frontier$grid)
#> alpha gamma delta
#> 1 -1.0000000 -1 0.01464513
#> 2 -0.9487179 -1 0.01529749
#> 3 -0.8974359 -1 0.01600208
#> 4 -0.8461538 -1 0.01676353
#> 5 -0.7948718 -1 0.01758630
#> 6 -0.7435897 -1 0.01847429We can visualize the regions where deficiency is high (problematic) vs low.
ggplot(frontier$grid, aes(x = alpha, y = gamma, fill = delta)) +
geom_tile() +
scale_fill_viridis_c(option = "magma", direction = -1) +
labs(
title = "Confounding Frontier",
x = "Confounding Strength U -> A (alpha)",
y = "Confounding Strength U -> Y (gamma)",
fill = "Deficiency\n(Delta)"
) +
theme_minimal() +
geom_contour(aes(z = delta), color = "white", breaks = c(0.01, 0.05, 0.1))
#> Warning: The following aesthetics were dropped during statistical transformation: fill.
#> ℹ This can happen when ggplot fails to infer the correct grouping structure in
#> the data.
#> ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
#> variable into a factor?Interpreting the Confounding Frontier:
This visualization maps the “danger zones” of unobserved confounding:
The Axes:
Reading the Plot:
Dark regions (low \(\delta\) \(\approx\) 0):
Bright regions (high \(\delta\) > 0.15):
Contour lines (white lines at \(\delta\) = 0.01, 0.05, 0.1):
Scientific Decision-Making:
Use domain knowledge to assess plausibility:
Example Interpretation:
“Our sensitivity analysis shows that if unobserved confounding U
-> A and U -> Y are both weak (|\(\alpha\)|, |\(\gamma\)| < 0.2), the deficiency remains
below 0.05, suggesting robust identification. However, if either
confounding strength exceeds 0.5, identification breaks down (\(\delta\) > 0.1). Given the biological
plausibility of weak confounding in this system, we conclude our
estimates are reasonably robust.”
The confounding frontier shows the boundary: if we are outside the “safe” zone defined by the frontier, we cannot claim causal identification without further assumptions (see Akdemir 2026, DOI: 10.5281/zenodo.18367347 for theoretical details).
The causaldef package provides tools not just for
estimation, but for falsification and
sensitivity analysis.