From wber-skills
Organizes robustness checks for World Bank Economic Review manuscripts by identification threat and developing-country data-quality risks, helping structure a logical appendix rather than a mechanical checklist.
How this skill is triggered — by the user, by Claude, or both
Slash command
/wber-skills:wber-robustnessThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
- The headline result moves under reasonable alternative specifications
WBER referees are sophisticated about both econometric threats and the realities of developing-country data — surveys with recall and measurement error, administrative records with coverage gaps, sampling frames that miss the informal sector, attrition in panels. So robustness here has two axes: the standard identification-threat axis (does the estimate survive plausible violations of the design's key assumption?) and a data-quality axis (does the result survive how the data were actually constructed and measured?). Organize the section by threat, not by a checklist; each check should answer "if a skeptic believed X, would my conclusion change?"
| Threat the referee has in mind | The check that answers it |
|---|---|
| "Your design assumption is violated" | Design-specific sensitivity: honest-DiD bounds (parallel trends), bandwidth/donut (RD), Anderson–Rubin (weak IV), Oster δ / coefficient stability (selection on unobservables) |
| "It's driven by a few units/regions/years" | Leave-one-out (drop each cluster/region/wave); influential-observation checks |
| "Your key variable is mismeasured" | Alternative survey waves/sources; reconcile admin vs. survey; bound classical and non-classical measurement error |
| "The sample is selected / undercovers" | Reweight to a known population; bound for non-coverage of the informal/rural sector; differential-attrition bounds |
| "Inference is too optimistic" | Wild-cluster bootstrap (few clusters); spatial-HAC (Conley) for geographic correlation; multiple-hypothesis adjustment (Romano–Wolf / sharpened q-values) |
| "Results are p-hacked across specs" | Specification curve / multiverse showing the headline is modal, not cherry-picked |
Order matters for how a WBER referee reads the section:
State in the main text which one or two checks are load-bearing; relegate the mechanical remainder to the appendix (which still counts against the 40-page cap).
A poverty-targeting paper finds a transfer raises consumption by 11%. A referee suspects the result is an artifact of consumption being measured with a 7-day recall in treated rounds and a 30-day recall in control rounds. Rather than add a generic robustness row, the authors re-estimate within rounds that share a recall window, show the effect holds (10%, illustrative), and bound the recall-induced bias. They then run leave-one-region-out (effect stable except in one district they flag), wild-cluster bootstrap for the 14 clusters, and a specification curve showing the 11% is modal across deflator and outlier-trim choices. Each check is tied to a named skeptic.
WBER referees separate two things the appendix often conflates:
Both belong in a WBER paper, but they answer different referee worries; label them as such. A long list of point-estimate-stable specifications does not address an identification-violation worry, and a single sensitivity bound does not show the result is not specification-mined.
【Headline result】point estimate + inference
【Threats addressed】design-violation / few-units / measurement / coverage / inference / p-hacking
【Design sensitivity】honest-DiD / RD bandwidth / weak-IV / Oster δ
【Data-quality checks】recall/source/coverage/PPP/seasonality results
【Inference hardening】wild bootstrap / Conley / multiple-testing
【Load-bearing checks】the 1–2 that matter most
【Next step】wber-tables-figures
npx claudepluginhub brycewang-stanford/awesome-journal-skills --plugin wber-skillsOrganizes robustness checks by threat for World Development manuscripts. Guides quantitative sensitivity analysis (specification, sample, inference, attrition) and qualitative trustworthiness (triangulation, negative-case analysis).
Organizes robustness checks for IER papers by threat to load-bearing assumption, without running regressions. Helps structure responses to referee concerns.
Builds a robustness suite for REStat manuscripts: tests whether headline estimates survive specification, sample, measurement, identification, and inference alternatives.