M-estimation, influence functions, and semiparametric efficiency theory for causal inference
Derives asymptotic properties of estimators including influence functions and semiparametric efficiency bounds.
npx claudepluginhub data-wise/scholarThis skill inherits all available tools. When active, it can use any tool Claude has access to.
Rigorous framework for statistical inference and efficiency in modern methodology
Use this skill when working on: asymptotic properties of estimators, influence functions, semiparametric efficiency, double robustness, variance estimation, confidence intervals, hypothesis testing, M-estimation, or deriving limiting distributions.
Cramér-Rao Lower Bound: For any unbiased estimator, $$\text{Var}(\hat{\theta}) \geq \frac{1}{nI(\theta)}$$
where $I(\theta)$ is the Fisher information.
Semiparametric Efficiency Bound: The variance of the efficient influence function: $$V_{eff} = E[\phi^*(\theta_0)^2]$$
where $\phi^*$ is the efficient influence function (EIF).
Influence Function Notation: $IF(O; \theta, P)$ represents the influence of observation $O$ on parameter $\theta$ under distribution $P$: $$IF(O; \theta, P) = \lim_{\epsilon \to 0} \frac{T((1-\epsilon)P + \epsilon \delta_O) - T(P)}{\epsilon}$$
Semiparametric Variance: For RAL estimators, $$\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} N(0, E[IF(O)^2])$$
Estimating Equations: M-estimators solve $\sum_{i=1}^n \psi(O_i; \theta) = 0$, with asymptotic variance: $$V = \left(\frac{\partial}{\partial \theta} E[\psi(O; \theta)]\right)^{-1} E[\psi(O; \theta)\psi(O; \theta)^T] \left(\frac{\partial}{\partial \theta} E[\psi(O; \theta)]\right)^{-T}$$
| Estimand | Efficient Influence Function | Efficiency Bound |
|---|---|---|
| ATE | $\phi_{ATE} = \frac{A}{\pi}(Y-\mu_1) - \frac{1-A}{1-\pi}(Y-\mu_0) + \mu_1 - \mu_0 - \psi$ | $V_{ATE} = E[\phi_{ATE}^2]$ |
| NDE | Complex (VanderWeele & Tchetgen, 2014) | Higher than ATE |
| NIE | Complex (VanderWeele & Tchetgen, 2014) | Higher than ATE |
# Compute semiparametric efficiency bound
compute_efficiency_bound <- function(data, estimand = "ATE") {
n <- nrow(data)
if (estimand == "ATE") {
# Estimate nuisance functions
ps_model <- glm(A ~ X, data = data, family = binomial)
pi_hat <- predict(ps_model, type = "response")
mu1_model <- lm(Y ~ X, data = subset(data, A == 1))
mu0_model <- lm(Y ~ X, data = subset(data, A == 0))
mu1_hat <- predict(mu1_model, newdata = data)
mu0_hat <- predict(mu0_model, newdata = data)
# Efficient influence function
psi_hat <- mean(mu1_hat - mu0_hat)
phi <- with(data, {
A/pi_hat * (Y - mu1_hat) -
(1-A)/(1-pi_hat) * (Y - mu0_hat) +
mu1_hat - mu0_hat - psi_hat
})
# Efficiency bound = variance of EIF
list(
efficiency_bound = var(phi),
standard_error = sqrt(var(phi) / n),
eif_values = phi
)
}
}
Empirical Process: $\mathbb{G}_n(f) = \sqrt{n}(\mathbb{P}n - P)f = \frac{1}{\sqrt{n}}\sum{i=1}^n (f(O_i) - Pf)$
Uniform Convergence: For function class $\mathcal{F}$, $$\sup_{f \in \mathcal{F}} |\mathbb{G}n(f)| \xrightarrow{d} \sup{f \in \mathcal{F}} |\mathbb{G}(f)|$$
where $\mathbb{G}$ is a Gaussian process.
| Measure | Definition | Use |
|---|---|---|
| VC dimension | Max shattered set size | Classification |
| Covering number | $N(\epsilon, \mathcal{F}, |\cdot|)$ | General classes |
| Bracketing number | $N_{[]}(\epsilon, \mathcal{F}, L_2)$ | Entropy bounds |
| Rademacher complexity | $\mathcal{R}n(\mathcal{F}) = E[\sup{f \in \mathcal{F}} | \frac{1}{n}\sum_i \epsilon_i f(X_i) |
# Estimate Rademacher complexity via Monte Carlo
estimate_rademacher <- function(f_class, data, n_reps = 1000) {
n <- nrow(data)
sup_values <- replicate(n_reps, {
# Random Rademacher variables
epsilon <- sample(c(-1, 1), n, replace = TRUE)
# Compute supremum over function class
sup_f <- max(sapply(f_class, function(f) {
abs(mean(epsilon * f(data)))
}))
sup_f
})
mean(sup_values)
}
A function class $\mathcal{F}$ is Donsker if $\mathbb{G}_n \rightsquigarrow \mathbb{G}$ in $\ell^\infty(\mathcal{F})$, where $\mathbb{G}$ is a tight Gaussian process.
| Class | Description | Application |
|---|---|---|
| VC classes | Finite VC dimension | Classification functions |
| Smooth functions | Bounded derivatives | Regression estimators |
| Monotone functions | Single crossings | Distribution functions |
| Lipschitz functions | Bounded variation | M-estimators |
For M-estimation: If $\psi(O, \theta)$ belongs to a Donsker class, then $$\sqrt{n}(\hat{\theta} - \theta_0) \xrightarrow{d} N(0, V)$$
where $V = (\partial_\theta E[\psi])^{-1} \text{Var}(\psi) (\partial_\theta E[\psi])^{-T}$
# Verify Donsker conditions for empirical process
check_donsker_conditions <- function(psi_class, data) {
# Estimate bracketing entropy integral
epsilon_grid <- seq(0.01, 1, by = 0.01)
bracket_numbers <- sapply(epsilon_grid, function(eps) {
# Estimate N_[](eps, F, L_2)
estimate_bracketing_number(psi_class, data, eps)
})
# Donsker if integral converges
entropy_integral <- integrate(
function(eps) sqrt(log(approxfun(epsilon_grid, bracket_numbers)(eps))),
lower = 0, upper = 1
)
list(
is_donsker = entropy_integral$value < Inf,
entropy_integral = entropy_integral$value,
bracket_numbers = data.frame(epsilon = epsilon_grid, N = bracket_numbers)
)
}
Estimator θ̂ₙ → Consistency → Asymptotic Normality → Efficiency → Inference
↓ ↓ ↓ ↓
θ̂ₙ →ᵖ θ₀ √n(θ̂ₙ-θ₀) →ᵈ N(0,V) V = V_eff CIs, tests
$X_n \xrightarrow{p} X$ if $\forall \epsilon > 0$: $P(|X_n - X| > \epsilon) \to 0$
Consistency: $\hat{\theta}_n \xrightarrow{p} \theta_0$
$X_n \xrightarrow{d} X$ if $F_{X_n}(x) \to F_X(x)$ at all continuity points
Asymptotic normality: $\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, V)$
$X_n \xrightarrow{a.s.} X$ if $P(\lim_{n\to\infty} X_n = X) = 1$
Relationship: $\xrightarrow{a.s.} \Rightarrow \xrightarrow{p} \Rightarrow \xrightarrow{d}$
| Notation | Meaning | Example |
|---|---|---|
| $O_p(1)$ | Bounded in probability | $\hat{\theta}_n = O_p(1)$ |
| $o_p(1)$ | Converges to 0 in probability | $\hat{\theta}_n - \theta_0 = o_p(1)$ |
| $O_p(a_n)$ | $X_n/a_n = O_p(1)$ | $\hat{\theta}_n - \theta_0 = O_p(n^{-1/2})$ |
| $o_p(a_n)$ | $X_n/a_n = o_p(1)$ | Remainder terms |
Weak LLN: If $X_1, \ldots, X_n$ iid with $E|X| < \infty$: $$\bar{X}_n \xrightarrow{p} E[X]$$
Strong LLN: If $X_1, \ldots, X_n$ iid with $E|X| < \infty$: $$\bar{X}_n \xrightarrow{a.s.} E[X]$$
Uniform LLN: For $\sup_{\theta \in \Theta}$ convergence, need additional conditions (compactness, envelope).
Classical CLT: If $X_1, \ldots, X_n$ iid with $E[X] = \mu$, $Var(X) = \sigma^2 < \infty$: $$\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$$
Lindeberg-Feller CLT: For triangular arrays with: $$\sum_{i=1}^n E[X_{ni}^2 \mathbf{1}(|X_{ni}| > \epsilon)] \to 0 \quad \forall \epsilon > 0$$
Multivariate CLT: $$\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \Sigma)$$
If $X_n \xrightarrow{d} X$ and $Y_n \xrightarrow{p} c$ (constant):
If $X_n \xrightarrow{d} X$ and $g$ continuous: $$g(X_n) \xrightarrow{d} g(X)$$
If $\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, V)$ and $g$ differentiable at $\theta_0$: $$\sqrt{n}(g(\hat{\theta}_n) - g(\theta_0)) \xrightarrow{d} N(0, g'(\theta_0)^\top V g'(\theta_0))$$
Multivariate: Replace $g'(\theta_0)$ with Jacobian matrix.
Estimator $\hat{\theta}_n$ solves: $$\hat{\theta}n = \arg\max{\theta \in \Theta} M_n(\theta)$$
where $M_n(\theta) = n^{-1} \sum_{i=1}^n m(O_i; \theta)$
Result: $\hat{\theta}_n \xrightarrow{p} \theta_0$
Result: $$\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, [-\ddot{M}(\theta_0)]^{-1} V [-\ddot{M}(\theta_0)]^{-1})$$
Sandwich estimator: $$\hat{V} = \hat{A}^{-1} \hat{B} \hat{A}^{-1}$$
where:
The influence function of a functional $T(P)$ at distribution $P$ is: $$\phi(o) = \lim_{\epsilon \to 0} \frac{T((1-\epsilon)P + \epsilon \delta_o) - T(P)}{\epsilon}$$
where $\delta_o$ is point mass at $o$.
| Functional | Influence Function |
|---|---|
| Mean $E[Y]$ | $\phi(y) = y - E[Y]$ |
| Variance $Var(Y)$ | $\phi(y) = (y - \mu)^2 - \sigma^2$ |
| Quantile $Q_p$ | $\phi(y) = \frac{p - \mathbf{1}(y \leq Q_p)}{f(Q_p)}$ |
| Regression coefficient | $\phi = (X^\top X)^{-1} X(Y - X^\top\beta)$ |
Method 1: Gateaux derivative (definition)
Method 2: Estimating equation approach If $\hat{\theta}$ solves $\mathbb{P}n[\psi(O; \theta)] = 0$, then: $$\phi(O) = -E[\partial\theta \psi]^{-1} \psi(O; \theta_0)$$
Method 3: Functional delta method For $\psi = g(T_1, T_2, \ldots)$: $$\phi_\psi = \sum_j \frac{\partial g}{\partial T_j} \phi_{T_j}$$
Model $\mathcal{P}$ contains distributions satisfying: $$\theta = \Psi(P), \quad P \in \mathcal{P}$$
The "nuisance" is infinite-dimensional (e.g., unknown baseline distribution).
Parametric submodels: One-dimensional smooth paths ${P_t : t \in \mathbb{R}}$ through $P_0$.
Score: $S = \partial_t \log p_t \big|_{t=0}$
Tangent space $\mathcal{T}$: Closed linear span of all such scores.
The efficient influence function (EIF) is the projection of any influence function onto the tangent space.
Semiparametric efficiency bound: $$V_{eff} = E[\phi_{eff}(O)^2]$$
No regular estimator can have asymptotic variance smaller than $V_{eff}$.
An estimator is semiparametrically efficient if its influence function equals the EIF: $$\phi_{\hat{\theta}} = \phi_{eff}$$
Strategies:
An estimator is doubly robust if it is consistent when either:
For ATE $\psi = E[Y(1) - Y(0)]$:
$$\hat{\psi}_{DR} = \mathbb{P}_n\left[\frac{A(Y - \hat{\mu}_1(X))}{\hat{\pi}(X)} + \hat{\mu}_1(X)\right] - \mathbb{P}_n\left[\frac{(1-A)(Y - \hat{\mu}_0(X))}{1-\hat{\pi}(X)} + \hat{\mu}_0(X)\right]$$
where:
Bias decomposition: $$\hat{\psi}_{DR} - \psi = \text{(outcome error)} \times \text{(propensity error)} + o_p(n^{-1/2})$$
If either error is zero, bias is zero.
When both models correct:
When one model wrong:
$$\hat{V} = \frac{1}{n} \sum_{i=1}^n \hat{\phi}(O_i)^2$$
where $\hat{\phi}$ is estimated influence function.
Nonparametric bootstrap:
Bootstrap validity: Requires $\sqrt{n}$-consistent, regular estimators.
More stable than full recomputation: $$\hat{\theta}^b = \hat{\theta} + n^{-1} \sum{i=1}^n (W_i^ - 1) \hat{\phi}(O_i)$$
where $W_i^*$ are bootstrap weights.
Wald interval: $$\hat{\theta} \pm z_{1-\alpha/2} \cdot \hat{SE}$$
Percentile bootstrap: $$[\hat{\theta}^_{(\alpha/2)}, \hat{\theta}^_{(1-\alpha/2)}]$$
BCa bootstrap (bias-corrected accelerated): Corrects for bias and skewness.
Wald test: $W = (\hat{\theta} - \theta_0)^2 / \hat{V} \sim \chi^2_1$
Score test: Based on score at null.
Likelihood ratio test: $2(\ell(\hat{\theta}) - \ell(\theta_0)) \sim \chi^2_k$
Mediation effect = $\alpha \beta$ (or $\alpha_1 \beta_1 \gamma_2$ for sequential)
Not normal: Product of normals is NOT normal.
Exact distribution: Complex (involves Bessel functions for two normals).
Approximations:
For $\psi = \alpha\beta$: $$Var(\hat{\alpha}\hat{\beta}) \approx \beta^2 Var(\hat{\alpha}) + \alpha^2 Var(\hat{\beta}) + Var(\hat{\alpha})Var(\hat{\beta})$$
The last term often omitted (Sobel) but matters when effects are small.
For sequential mediation $\psi = \alpha_1 \beta_1 \gamma_2$:
Wrong: Treat $\hat{\eta}$ as known when computing variance. Right: Account for $\hat{\eta}$ uncertainty or use cross-fitting.
For doubly robust estimators, need: $$|\hat{\mu} - \mu_0| \cdot |\hat{\pi} - \pi_0| = o_p(n^{-1/2})$$
If both converge at $n^{-1/4}$, product is $n^{-1/2}$.
Bootstrap can fail for:
Sandwich estimator assumes correct influence function. Model misspecification → wrong variance.
\begin{theorem}[Asymptotic Distribution]
Under Assumptions \ref{A1}--\ref{An}:
\begin{enumerate}
\item (Consistency) $\hat{\theta}_n \xrightarrow{p} \theta_0$
\item (Asymptotic normality) $\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N(0, V)$
\item (Variance) $V = E[\phi(O)^2]$ where $\phi$ is the influence function
\item (Variance estimation) $\hat{V} \xrightarrow{p} V$
\end{enumerate}
\end{theorem}
\begin{proof}
\textbf{Step 1 (Consistency):}
[Apply M-estimation or direct argument]
\textbf{Step 2 (Expansion):}
Taylor expand around $\theta_0$:
\[
0 = \mathbb{P}_n[\psi(O; \hat{\theta})] = \mathbb{P}_n[\psi(O; \theta_0)]
+ \mathbb{P}_n[\dot{\psi}(\tilde{\theta})](\hat{\theta} - \theta_0)
\]
\textbf{Step 3 (Rearrangement):}
\[
\sqrt{n}(\hat{\theta} - \theta_0) = -[\mathbb{P}_n[\dot{\psi}]]^{-1} \sqrt{n}\mathbb{P}_n[\psi(O; \theta_0)]
\]
\textbf{Step 4 (CLT):}
$\sqrt{n}\mathbb{P}_n[\psi(O; \theta_0)] \xrightarrow{d} N(0, E[\psi\psi^\top])$ by CLT.
\textbf{Step 5 (Slutsky):}
$\mathbb{P}_n[\dot{\psi}] \xrightarrow{p} E[\dot{\psi}]$ by WLLN. Apply Slutsky.
\textbf{Step 6 (Identify $V$):}
$V = E[\dot{\psi}]^{-1} E[\psi\psi^\top] E[\dot{\psi}]^{-\top}$.
\end{proof}
This skill works with:
Bickel
Newey
Robins
van der Vaart, A.W. (1998). Asymptotic Statistics
Tsiatis, A.A. (2006). Semiparametric Theory and Missing Data
Kennedy, E.H. (2016). Semiparametric Theory and Empirical Processes
van der Laan, M.J. & Rose, S. (2011). Targeted Learning
Version: 1.0 Created: 2025-12-08 Domain: Asymptotic Statistics, Semiparametric Inference
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.