ML-based variable imputation for survey data - used in policyengine-us-data to fill missing values. Triggers: "impute", "imputation", "missing values", "donor", "recipient", "quantile forest", "statistical matching", "PUF", "microimpute", "fill missing"
From essentialnpx claudepluginhub policyengine/policyengine-claude --plugin data-scienceThis skill uses the workspace's default tool permissions.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
MicroImpute enables ML-based variable imputation through different statistical methods, with comparison and benchmarking capabilities.
When PolicyEngine calculates population impacts, the underlying survey data has missing information. MicroImpute uses machine learning to fill in those gaps intelligently.
What imputation does:
Example:
You benefit from imputation when:
uv pip install microimpute
# With image export (for plots)
uv pip install microimpute[images]
Imputation problem:
Methods available:
from microimpute import Imputer
import pandas as pd
# Donor data (complete)
donor = pd.DataFrame({
'income': [50000, 60000, 70000],
'age': [30, 40, 50],
'capital_gains': [5000, 8000, 12000] # Variable to impute
})
# Recipient data (missing capital_gains)
recipient = pd.DataFrame({
'income': [55000, 65000],
'age': [35, 45],
# capital_gains is missing
})
# Impute using quantile forest
imputer = Imputer(method='quantile_forest')
imputer.fit(
donor=donor,
donor_target='capital_gains',
common_vars=['income', 'age']
)
recipient_imputed = imputer.predict(recipient)
# Now recipient has predicted capital_gains
from microimpute import compare_methods
# Compare different imputation methods
results = compare_methods(
donor=donor,
recipient=recipient,
target_var='capital_gains',
common_vars=['income', 'age'],
methods=['linear', 'random_forest', 'quantile_forest']
)
# Shows quantile loss for each method
print(results)
Why quantile loss:
Interpretation:
# Quantile loss around 0.1 = good
# Quantile loss around 0.5 = poor
# Compare across methods to choose best
Location: PolicyEngine/microimpute
Clone:
git clone https://github.com/PolicyEngine/microimpute
cd microimpute
To see structure:
tree microimpute/
# Key modules:
ls microimpute/
# - imputer.py - Main Imputer class
# - methods/ - Different imputation methods
# - comparison.py - Method benchmarking
# - utils/ - Utilities
To see specific methods:
# Quantile forest implementation
cat microimpute/methods/quantile_forest.py
# Random forest
cat microimpute/methods/random_forest.py
# Linear regression
cat microimpute/methods/linear.py
Required:
To see all dependencies:
cat pyproject.toml
Pattern:
# microimpute/methods/my_method.py
class MyMethodImputer:
def fit(self, X_train, y_train):
"""Train on donor data."""
# Fit your model
pass
def predict(self, X_test):
"""Impute on recipient data."""
# Return predictions
pass
def get_quantile_loss(self, X_val, y_val):
"""Compute validation loss."""
# Evaluate quality
pass
To see how data pipeline uses microimpute:
cd ../policyengine-us-data
# Find usage
grep -r "microimpute" policyengine_us_data/
grep -r "Imputer" policyengine_us_data/
Typical workflow:
Run tests:
make test
# Or
pytest tests/ -v --cov=microimpute
To see test patterns:
cat tests/test_imputer.py
cat tests/test_methods.py
from microimpute import Imputer
# Create imputer
imputer = Imputer(method='quantile_forest')
# Fit on donor (complete data)
imputer.fit(
donor=donor_df,
donor_target='target_variable',
common_vars=['age', 'income', 'state']
)
# Predict on recipient (missing target_variable)
recipient_imputed = imputer.predict(recipient_df)
from microimpute import compare_methods
# Test multiple methods
methods = ['linear', 'random_forest', 'quantile_forest', 'xgboost']
results = compare_methods(
donor=donor,
recipient=recipient,
target_var='target',
common_vars=common_vars,
methods=methods
)
# Use method with lowest quantile loss
best_method = results.sort_values('quantile_loss').iloc[0]['method']
# Impute several variables
variables_to_impute = [
'short_term_capital_gains',
'long_term_capital_gains',
'qualified_dividends'
]
for var in variables_to_impute:
imputer = Imputer(method='quantile_forest')
imputer.fit(donor=irs_puf, donor_target=var, common_vars=common_vars)
cps[var] = imputer.predict(cps)
Built-in Optuna integration:
from microimpute import tune_hyperparameters
# Automatically find best hyperparameters
best_params, study = tune_hyperparameters(
donor=donor,
target_var='target',
common_vars=common_vars,
method='quantile_forest',
n_trials=100
)
# Use tuned parameters
imputer = Imputer(method='quantile_forest', **best_params)
Validate imputation quality:
from sklearn.model_selection import cross_val_score
# Split donor for validation
# Impute on validation set
# Measure accuracy
Plot imputation results:
import plotly.express as px
# Compare imputed vs actual (on donor validation set)
fig = px.scatter(
x=actual_values,
y=imputed_values,
labels={'x': 'Actual', 'y': 'Imputed'}
)
fig.add_trace(px.line(x=[min, max], y=[min, max])) # 45-degree line
Imputation preserves:
Trade-offs:
Full pipeline (policyengine-us-data):
1. Load CPS survey data
↓
2. microimpute: Fill missing variables from IRS PUF
↓
3. microcalibrate: Adjust weights to match benchmarks
↓
4. Validation: Check against administrative totals
↓
5. Package: Distribute enhanced dataset
↓
6. PolicyEngine: Use for population simulations
MicroImpute vs traditional imputation:
Traditional (mean imputation):
MicroImpute (ML methods):
Quantile forest advantage:
For large datasets:
# Use random forest (faster than quantile forest)
imputer = Imputer(method='random_forest')
# Or subsample donor
donor_sample = donor.sample(n=10000, random_state=42)
imputer.fit(donor=donor_sample, ...)
For high accuracy:
# Use quantile forest with tuning
best_params, _ = tune_hyperparameters(...)
imputer = Imputer(method='quantile_forest', **best_params)
Repository: https://github.com/PolicyEngine/microimpute PyPI: https://pypi.org/project/microimpute/ Documentation: See README and docstrings in source