Skip to content

Conversation

@baogorek
Copy link
Collaborator

@baogorek baogorek commented Feb 2, 2026

Summary

  • Fixes state/CD calibration using stale 2022-2023 targets instead of correct 2024 values
  • Removes hardcoded CBO_YEAR and TREASURY_YEAR constants from etl_national_targets.py
  • Adds --dataset CLI argument to specify the source dataset
  • Derives time_period from sim.default_calculation_period - the dataset itself is now the single source of truth

Root Cause

The ETL had hardcoded year constants:

CBO_YEAR = 2023  # was pulling 2023 CBO values
TREASURY_YEAR = 2023  # was pulling 2023 Treasury values

But the calibration runs at time_period=2024. This caused an 18% gap for income tax alone ($2,051B vs $2,426B).

The Fix

Instead of hardcoding years, we now derive the time period from the dataset:

sim = Microsimulation(dataset=args.dataset)
time_period = int(sim.default_calculation_period)  # e.g., 2024

This ensures CBO/Treasury targets always match the dataset's year, preventing future drift when updating to new base years annually.

Usage

# Default: uses HuggingFace production dataset
python policyengine_us_data/db/etl_national_targets.py

# Or specify a local dataset
python policyengine_us_data/db/etl_national_targets.py \
  --dataset /path/to/stratified_extended_cps.h5

Test plan

  • Run make database to regenerate policy_data.db
  • Verify CBO/Treasury targets now show 2024 values
  • Verify income_tax target is ~$2,426B (not $2,051B)

Closes #503

🤖 Generated with Claude Code

baogorek and others added 2 commits February 2, 2026 10:36
- Remove hardcoded CBO_YEAR and TREASURY_YEAR constants
- Add --dataset CLI argument to etl_national_targets.py
- Derive time_period from sim.default_calculation_period
- Default to HuggingFace production dataset

The dataset itself is now the single source of truth for the
calibration year, preventing future drift when updating to new
base years.

Closes #503

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The CBO income_tax parameter represents positive-only receipts (refundable
credit payments in excess of liability are classified as outlays, not
negative receipts). Using income_tax_positive matches this definition.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@baogorek baogorek force-pushed the fix-stale-calibration-targets-503 branch from ee54587 to 69406d6 Compare February 2, 2026 18:04
baogorek and others added 3 commits February 2, 2026 13:29
All ETL scripts now derive their target year from the dataset's
default_calculation_period instead of hardcoding years. This ensures
all calibration targets stay synchronized when updating to a new
base year annually.

Updated scripts:
- create_initial_strata.py
- etl_age.py
- etl_irs_soi.py (with configurable --lag for IRS data delay)
- etl_medicaid.py
- etl_snap.py
- etl_state_income_tax.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update parse_ucgid to recognize both 5001800US (118th) and 5001900US (119th Congress)
- Expand Puerto Rico and territory filters to handle both Congress code formats
- Update TERRITORY_UCGIDS and NON_VOTING_GEO_IDS with 119th Congress codes

This ensures consistent redistricting alignment: 2024 ACS data uses 119th Congress
codes natively, and IRS SOI data is converted via the 116th→119th mapping matrix.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

State calibration (policy_data.db) uses stale 2022-2023 targets for 2024 sim

2 participants