US survey data enhancement - CPS with PUF imputation patterns and cross-repo variable workflows. Triggers: "CPS", "Current Population Survey", "PUF", "Public Use File", "US data", "US microdata", "enhanced CPS", "policyengine-us-data", "cross-repo", "FINANCIAL_SUBSET"
From essentialnpx claudepluginhub policyengine/policyengine-claude --plugin data-scienceThis skill uses the workspace's default tool permissions.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
PolicyEngine US Data provides enhanced Current Population Survey (CPS) datasets with imputed variables from the IRS Public Use File (PUF).
PolicyEngine US uses the CPS ASEC as its primary microdata source. The CPS contains household demographics, income, and benefits but lacks detailed tax information. The IRS PUF provides comprehensive tax data but is restricted access. This package imputes tax-related variables from PUF to CPS.
Key datasets:
Location: PolicyEngine/policyengine-us-data
Clone:
git clone https://github.com/PolicyEngine/policyengine-us-data
cd policyengine-us-data
policyengine_us_data/
├── datasets/
│ ├── cps/ # CPS ASEC processing
│ │ ├── census_cps.py # Raw CPS loader
│ │ └── cps.py # CPS enhancement
│ └── puf/ # PUF imputation
│ ├── irs_puf.py # Raw PUF loader
│ └── puf.py # PUF-to-CPS imputation
└── storage/ # Data storage utilities
From PyPI:
uv pip install policyengine-us-data
Development:
uv pip install -e .
This is the #1 source of CI failures when adding new data-backed variables.
When you add a new variable that:
You MUST follow this workflow:
The puf.py file filters FINANCIAL_SUBSET to only include variables that exist in policyengine-us:
# In puf.py
self.available_financial_vars = [
v for v in FINANCIAL_SUBSET if v in self.variable_to_entity
]
If policyengine-us doesn't have the variable yet, it gets silently skipped during data generation.
Step 1: Create and merge the policyengine-us PR first
# In policyengine-us
1. Add variable definition (e.g., partnership_se_income.py)
2. Add to relevant formulas
3. Merge PR
4. Wait for PyPI release (automatic, check pypi.org/project/policyengine-us)
Step 2: Note the released version number
# Check latest version
curl -s https://pypi.org/pypi/policyengine-us/json | jq '.info.version'
Step 3: Create the policyengine-us-data PR with version bump
# In policyengine-us-data
1. Add data extraction in puf.py (e.g., puf["partnership_se_income"] = ...)
2. Add to FINANCIAL_SUBSET list in puf.py
3. CRITICAL: Add to IMPUTED_VARIABLES in extended_cps.py
- This is a SEPARATE list that controls what gets imputed into Enhanced CPS!
4. CRITICAL: Bump minimum version in pyproject.toml:
- "policyengine-us>=1.516.0" # Version with new variable
5. Run `uv lock` to update lockfile
6. Merge PR
IMPORTANT: There are TWO variable lists!
FINANCIAL_SUBSET in puf.py - controls what data is extracted from PUFIMPUTED_VARIABLES in extended_cps.py - controls what gets imputed into Enhanced CPSIf you only add to one, the variable will be extracted but not imputed!
The CI uses whatever policyengine-us version satisfies the pyproject.toml constraint:
# If pyproject.toml says:
"policyengine-us>=1.353.0"
# CI might install 1.499.0 (satisfies constraint but lacks new variable)
# Your variable gets silently skipped!
# Fix: bump to version with your variable
"policyengine-us>=1.516.0" # Now CI installs 1.516.0+ with your variable
Correct workflow that was followed:
policyengine-us PR #7239 - Added partnership_se_income variable
policyengine-us-data PR #481 - Added data extraction
puf["partnership_se_income"] = k1bx14p + k1bx14sFINANCIAL_SUBSETFix commit - Bumped minimum version
"policyengine-us>=1.353.0" to "policyengine-us>=1.516.0"uv lockMistake 1: Merging us-data before us releases
❌ Merge us-data PR while us PR still pending
→ Variable doesn't exist → Gets skipped → Data missing variable
Mistake 2: Not bumping the minimum version
❌ Add variable to FINANCIAL_SUBSET but keep old version constraint
→ CI installs old policyengine-us → Variable doesn't exist → Gets skipped
Mistake 3: Checking data before rebuild completes
❌ Run microsim right after merging
→ Still using old cached data → Variable shows $0
→ Need to wait for CI or `uv pip install --upgrade policyengine-us-data`
uv lock1. Identify PUF columns:
# Check PUF documentation for column names
# e.g., k1bx14p = taxpayer's K-1 Box 14 partnership SE income
2. Add extraction in puf.py:
# In _create_financial_variables method or similar
puf["my_new_variable"] = puf["puf_column"]
# Or derive from multiple columns:
puf["my_new_variable"] = puf["col1"] + puf["col2"]
3. Add to FINANCIAL_SUBSET:
FINANCIAL_SUBSET = [
# ... existing variables ...
"my_new_variable", # Add at end
]
4. Bump policyengine-us version (if new variable):
# pyproject.toml
dependencies = [
"policyengine-us>=X.Y.Z", # Version with my_new_variable
]
Local test (requires PUF access):
make test
CI test: The GitHub Actions CI has PUF access via secrets. Push to a branch and check the workflow.
| PUF Column | Description | Target Variable |
|---|---|---|
| e00200 | Wages and salaries | employment_income |
| e00300 | Taxable interest | taxable_interest_income |
| e00600 | Ordinary dividends | dividend_income |
| e00900 | Business income (Schedule C) | self_employment_income |
| e02100 | Farm income (Schedule F) | farm_income |
| k1bx14p | K-1 Box 14 (taxpayer) | partnership_se_income |
| k1bx14s | K-1 Box 14 (spouse) | partnership_se_income |
Usage flow:
1. Load raw CPS ASEC
↓
2. Load raw PUF
↓
3. Impute PUF variables to CPS using QRF
↓
4. Calibrate weights to administrative benchmarks
↓
5. Package as enhanced_cps_YYYY.h5
↓
6. Upload to HuggingFace
↓
7. Use in policyengine-us simulations
In policyengine-us:
from policyengine_us import Microsimulation
from policyengine_us_data import EnhancedCPS_2024
# Uses enhanced CPS with PUF imputations
sim = Microsimulation(dataset=EnhancedCPS_2024)
sim.calculate('self_employment_tax', period=2026)
# Uses imputed self_employment_income, farm_income, etc.
Repository: https://github.com/PolicyEngine/policyengine-us-data Dependencies: policyengine-us, policyengine-core, microdf, microimpute Data sources: