ALWAYS LOAD THIS SKILL before writing any policyengine.py microsimulation code. Contains correct import paths, environment setup, dataset loading, and analysis patterns. Triggers: "write a script", "policyengine.py", "microsimulation script", "run a simulation", "load the dataset", "FRS", "EFRS", "enhanced FRS", "CPS", "enhanced CPS", "by income decile", "by tenure", "by region", "energy spending", "domestic energy", "household net income", "output_dataset", "ensure_datasets", "uk_datasets", "us_datasets", "import datasets", "from policyengine", "Simulation(dataset=", "uk_latest", "us_latest", "plotly", "analysis script", "decile breakdown", "percentile", "groupby", "weighted", "mean", "median", "p25", "p75", "tenure type", "income band", "policy reform script".
From essentialnpx claudepluginhub policyengine/policyengine-claude --plugin data-scienceThis skill uses the workspace's default tool permissions.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Optimizes cloud costs on AWS, Azure, GCP via rightsizing, tagging strategies, reserved instances, spot usage, and spending analysis. Use for expense reduction and governance.
This skill covers advanced patterns for working with policyengine.py simulations, including caching, result access, and entity mapping.
Before writing any code, check the environment. The policyengine.py package must be installed in the project's .venv.
# Always run from the policyengine.py repo root:
cd /path/to/policyengine.py
uv run python script.py
# Or activate first:
source .venv/bin/activate
python script.py
# NEVER use bare `pip install` — always:
uv pip install -e ".[uk]" # for UK work
uv pip install -e ".[us]" # for US work
If from policyengine.core import Simulation fails:
cd /path/to/policyengine.py
uv pip install -e ".[uk]"
# Then re-run with: uv run python script.py
Only these imports exist — do not guess others:
# Core simulation
from policyengine.core import Simulation
# UK model
from policyengine.tax_benefit_models.uk import (
uk_latest, # The model version (pass as tax_benefit_model_version=)
uk_model, # The model itself
PolicyEngineUKDataset,
UKYearData,
create_datasets, # Create & cache datasets from HF source
load_datasets, # Load cached datasets from disk
ensure_datasets, # Create if missing, load if present (recommended)
)
# US model
from policyengine.tax_benefit_models.us import (
us_latest,
PolicyEngineUSDataset,
ensure_datasets,
)
# Outputs
from policyengine.outputs.aggregate import Aggregate, AggregateType
from policyengine.outputs.change_aggregate import ChangeAggregate, ChangeAggregateType
# Plotting
from policyengine.utils.plotting import COLORS, format_fig
There is NO:
policyengine.core.dataset_registrypolicyengine.datasetspolicyengine.core.dataset_version.DatasetVersion.list()Use ensure_datasets() — it returns a dict[str, PolicyEngineUKDataset], building files in ./data/ on first run and loading from disk on subsequent runs.
WARNING: from policyengine.tax_benefit_models.uk import datasets gives you the Python submodule, not a dict. Never index it like a dict.
from policyengine.tax_benefit_models.uk import ensure_datasets
uk = ensure_datasets(
datasets=[
"hf://policyengine/policyengine-uk-data/frs_2023_24.h5",
"hf://policyengine/policyengine-uk-data/enhanced_frs_2023_24.h5",
],
years=[2026],
data_folder="./data",
)
efrs = uk["enhanced_frs_2023_24_2026"]
frs = uk["frs_2023_24_2026"]
Dict key format: "{stem}_{year}" e.g. "enhanced_frs_2023_24_2026"
To force regeneration: delete ./data/ and call ensure_datasets() again.
from policyengine.tax_benefit_models.us import ensure_datasets
us = ensure_datasets(
datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
years=[2026],
data_folder="./data",
)
ecps = us["enhanced_cps_2024_2026"]
Default US dataset: enhanced_cps_2024.h5 (Enhanced CPS), years 2024–2028.
Always inspect the dataset to find available variable names — never guess:
from policyengine.tax_benefit_models.uk import ensure_datasets
uk = ensure_datasets(years=[2026], data_folder="./data")
d = uk["enhanced_frs_2023_24_2026"]
# Input variables (present in raw data)
print("household:", list(d.data.household.columns))
print("person: ", list(d.data.person.columns))
print("benunit: ", list(d.data.benunit.columns))
Input variables are what's in the raw survey data — demographics, reported incomes, consumption, wealth, flags.
Computed variables (household_net_income, income_tax, universal_credit, etc.) are not in the raw dataset — they are calculated by the simulation. To see what's available after running:
from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import uk_latest
sim = Simulation(dataset=d, tax_benefit_model_version=uk_latest)
sim.run()
print("household (post-sim):", list(sim.output_dataset.data.household.columns))
print("person (post-sim): ", list(sim.output_dataset.data.person.columns))
The computed variables available are defined by uk_latest.entity_variables — inspect this to see the full list without running a simulation:
from policyengine.tax_benefit_models.uk import uk_latest
print(uk_latest.entity_variables) # dict: entity → [variable names]
When running simulations with policyengine.py (the microsimulation package, not the API client), you work with three key components:
Simulation.ensure() - Smart caching to avoid redundant computationsimulation.output_dataset.data - Accessing calculated resultsmap_to_entity() - Converting data between entity levels (person ↔ household)Note: This is for microsimulation with policyengine.py, not the policyengine Python API client (which uses Simulation(situation=...)).
from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import uk_latest
simulation = Simulation(
dataset=dataset,
tax_benefit_model_version=uk_latest,
)
# Method 1: Always run (no caching)
simulation.run()
# Method 2: Run only if needed (recommended)
simulation.ensure()
# Method 3: Save results to disk
simulation.save()
# Method 4: Load results from disk
simulation.load()
run(): Use when you need fresh results or parameters changedensure(): Use for iterative development (checks cache → disk → run)save(): Use to persist large simulation resultsload(): Use to resume from previous sessiondef ensure(self):
# 1. Check in-memory LRU cache (100 simulations)
cached = _cache.get(self.id)
if cached:
self.output_dataset = cached.output_dataset
return
# 2. Try loading from disk
try:
self.tax_benefit_model_version.load(self)
except Exception:
# 3. Only run if both cache and disk fail
self.run()
self.save()
# 4. Add to cache for next ensure() call
_cache.add(self.id, self)
Performance impact:
# Run baseline once
baseline = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
baseline.ensure() # First call: runs simulation
baseline.save() # Persist to disk
# Test multiple reforms
for reform in [reform1, reform2, reform3]:
baseline.ensure() # Instant from cache!
reform_sim = Simulation(
dataset=dataset,
tax_benefit_model_version=uk_latest,
policy=reform
)
reform_sim.run() # Only reform needs to run
# Compare results...
After running a simulation, all calculated variables are in simulation.output_dataset.data.
simulation.run()
# Access output container
output = simulation.output_dataset.data
# Entity-level MicroDataFrames
output.person # Person-level results
output.benunit # Benefit unit results
output.household # Household-level results
# US has more entities
output.person
output.tax_unit # Federal tax filing unit
output.spm_unit # Supplemental Poverty Measure unit
output.family # Census family definition
output.marital_unit # Married couple or single
output.household
Each dataframe contains input variables + calculated variables:
# Person-level (UK)
print(output.person.columns)
# ['person_id', 'person_household_id', 'age', 'employment_income',
# 'income_tax', 'national_insurance', 'net_income', ...]
# Household-level (UK)
print(output.household.columns)
# ['household_id', 'region', 'rent', 'household_net_income',
# 'household_benefits', 'household_tax', ...]
# Benunit-level (UK)
print(output.benunit.columns)
# ['benunit_id', 'universal_credit', 'child_benefit',
# 'working_tax_credit', 'child_tax_credit', ...]
# Get specific columns
incomes = output.household[["household_id", "household_net_income"]]
# Filter data
high_earners = output.person[output.person["employment_income"] > 100000]
# Calculate statistics (automatically weighted!)
mean_income = output.household["household_net_income"].mean()
total_tax = output.household["household_tax"].sum()
# Access individual values
first_hh_income = output.household["household_net_income"].iloc[0]
All operations respect survey weights automatically:
# These are all weighted calculations
total_population = output.person["person_weight"].sum()
mean_income = output.household["household_net_income"].mean()
poverty_rate = output.household["in_absolute_poverty_bhc"].mean()
# Groupby operations are weighted
by_region = output.household.groupby("region")["household_net_income"].mean()
Convert data between entity levels (e.g., sum person income to household, or broadcast household rent to persons).
output.map_to_entity(
source_entity: str, # Entity to map from
target_entity: str, # Entity to map to
columns: list[str] = None, # Columns to map (None = all)
values: np.ndarray = None, # Custom values instead
how: str = "sum" # Aggregation method
)
Person → Group (aggregation):
how="sum" (default): Sum values within each grouphow="first": Take first value in each grouphow="mean": Average valueshow="max": Maximum valuehow="min": Minimum valueGroup → Person (expansion):
how="project" (default): Broadcast group value to all membershow="divide": Split group value equally among members# Sum employment income across all people in each household
household_employment = output.map_to_entity(
source_entity="person",
target_entity="household",
columns=["employment_income"],
how="sum"
)
# Result is MicroDataFrame at household level
print(household_employment.columns)
# ['household_id', 'employment_income'] # Now household total
# Give each person their household's rent value
person_rent = output.map_to_entity(
source_entity="household",
target_entity="person",
columns=["rent"],
how="project"
)
# Each person now has their household's rent
print(person_rent.columns)
# ['person_id', 'rent']
# Split household savings equally among members
person_savings_share = output.map_to_entity(
source_entity="household",
target_entity="person",
columns=["total_savings"],
how="divide"
)
# If household has £12,000 savings and 3 people, each gets £4,000
import numpy as np
# Calculate custom person-level values
custom_tax = np.where(
output.person["employment_income"] > 50000,
output.person["income_tax"] * 1.1, # 10% increase for high earners
output.person["income_tax"]
)
# Aggregate to household level
household_custom_tax = output.map_to_entity(
source_entity="person",
target_entity="household",
values=custom_tax,
how="sum"
)
# Map multiple income sources to household level
household_incomes = output.map_to_entity(
source_entity="person",
target_entity="household",
columns=[
"employment_income",
"self_employment_income",
"pension_income",
"savings_interest_income"
],
how="sum"
)
# Result has all columns at household level
# UK: Map benunit benefits to household level
# (Multiple benunits can exist in one household)
household_uc = output.map_to_entity(
source_entity="benunit",
target_entity="household",
columns=["universal_credit", "child_benefit"],
how="sum"
)
The Aggregate and ChangeAggregate classes automatically handle entity mapping when the variable and target entity don't match:
from policyengine.outputs.aggregate import Aggregate, AggregateType
# income_tax is person-level, but we want household-level sum
total_tax = Aggregate(
simulation=simulation,
variable="income_tax", # Person-level
entity="household", # Household-level aggregation
aggregate_type=AggregateType.SUM,
)
total_tax.run()
# Automatically maps income_tax from person to household using sum()
# Run both simulations
baseline = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
baseline.ensure()
reform = Simulation(
dataset=dataset,
tax_benefit_model_version=uk_latest,
policy=reform_policy
)
reform.ensure()
# Get outputs
baseline_out = baseline.output_dataset.data
reform_out = reform.output_dataset.data
# Calculate differences
baseline_income = baseline_out.household["household_net_income"]
reform_income = reform_out.household["household_net_income"]
difference = reform_income - baseline_income
# Count winners/losers (weighted)
winners = (difference > 0).sum()
losers = (difference < 0).sum()
unchanged = (difference == 0).sum()
# Calculate marginal tax rate at person level
person_data = output.person.copy()
person_data["mtr"] = (
(person_data["income_tax"] + person_data["national_insurance"])
/ person_data["employment_income"].clip(lower=1)
) * 100
# Map to household level (max MTR in household)
household_mtr = output.map_to_entity(
source_entity="person",
target_entity="household",
values=person_data["mtr"].values,
how="max"
)
# Get London households with children
london_hh = output.household[output.household["region"] == "LONDON"]
households_with_children = output.person.groupby("person_household_id")["age"].apply(
lambda ages: (ages < 18).any()
)
# Combine filters
london_ids = set(london_hh["household_id"])
hh_with_kids_ids = set(households_with_children[households_with_children].index)
target_ids = london_ids & hh_with_kids_ids
# Extract subset
subset_hh = output.household[output.household["household_id"].isin(target_ids)]
subset_persons = output.person[output.person["person_household_id"].isin(target_ids)]
# Run baseline once
baseline = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
baseline.ensure()
baseline.save()
# Test multiple reforms efficiently
reforms = [reform1, reform2, reform3]
results = {}
for reform in reforms:
baseline.ensure() # Instant from cache
reform_sim = Simulation(
dataset=dataset,
tax_benefit_model_version=uk_latest,
policy=reform
)
reform_sim.run()
# Calculate impact
from policyengine.outputs.change_aggregate import ChangeAggregate, ChangeAggregateType
revenue = ChangeAggregate(
baseline_simulation=baseline,
reform_simulation=reform_sim,
variable="household_tax",
aggregate_type=ChangeAggregateType.SUM,
)
revenue.run()
results[reform.name] = revenue.result
For custom analyses (decile breakdowns, percentiles, groupby), work directly with output_dataset.data after running the simulation. This is often simpler than using Aggregate.
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from policyengine.core import Simulation
from policyengine.tax_benefit_models.uk import PolicyEngineUKDataset, uk_latest
# Load dataset
dataset = PolicyEngineUKDataset(
name="Enhanced FRS 2026",
description="EFRS 2026",
filepath="./data/enhanced_frs_2023_24_year_2026.h5",
year=2026,
)
dataset.load()
# Run simulation to compute income variables
simulation = Simulation(dataset=dataset, tax_benefit_model_version=uk_latest)
simulation.ensure() # caches to disk after first run
# Access results as DataFrames
hh = simulation.output_dataset.data.household
# Assign income decile (weighted quantile)
hh["income_decile"] = pd.qcut(
hh["household_net_income"],
q=10,
labels=[f"D{i}" for i in range(1, 11)],
)
# Group and calculate stats
stats = (
hh.groupby(["income_decile", "tenure_type"])["domestic_energy_consumption"]
.agg(
mean="mean",
p25=lambda x: np.percentile(x, 25),
p75=lambda x: np.percentile(x, 75),
)
.reset_index()
)
Key points:
simulation.output_dataset.data.household is a MicroDataFrame with weightsdomestic_energy_consumption is household-level (annual £)tenure_type values: OWNED_OUTRIGHT, OWNED_WITH_MORTGAGE, RENT_FROM_COUNCIL, RENT_PRIVATELY, RENT_FROM_HAensure() for iterative work: Can save minutes when re-running analysesAggregate classes: Optimised implementations for common operations# Good: Filter then map
high_earners = output.person[output.person["employment_income"] > 100000]
high_earner_hh_income = output.map_to_entity(
source_entity="person",
target_entity="household",
values=high_earners["employment_income"].values,
how="sum"
)
# Less efficient: Map then filter
all_hh_income = output.map_to_entity(
source_entity="person",
target_entity="household",
columns=["employment_income"],
how="sum"
)
high_earner_hh = all_hh_income[all_hh_income["employment_income"] > 100000]
Current implementation:
# Simulation lifecycle
cat policyengine.py/src/policyengine/core/simulation.py
# Entity mapping logic
cat policyengine.py/src/policyengine/core/dataset.py
# Cache implementation
cat policyengine.py/src/policyengine/core/cache.py
Key patterns:
Related skills:
policyengine-core-skill - Understanding simulation engine architecturemicrodf-skill - Working with weighted DataFramespolicyengine-python-client-skill - Basic simulation usageassert simulation.output_dataset is not None, "Simulation hasn't run"
# Check for expected variables
expected = ["household_net_income", "household_tax"]
actual = simulation.output_dataset.data.household.columns
assert all(v in actual for v in expected), "Missing variables"
# Verify person-household mapping is valid
person_hh_ids = set(output.person["person_household_id"])
household_ids = set(output.household["household_id"])
assert person_hh_ids.issubset(household_ids), "Invalid linkage"
# Check weights sum correctly
total_persons = output.person["person_weight"].sum()
print(f"Weighted population: {total_persons:,.0f}")
# Check for missing weights
assert not output.person["person_weight"].isna().any(), "Missing weights"
In policyengine.py repo:
.claude/policyengine-guide.md - High-level patterns.claude/quick-reference.md - Syntax cheat sheet.claude/working-with-simulations.md - Detailed simulation guideexamples/ - Full working examplesdocs/core-concepts.md - Architecture documentation