Add long-run calibration contracts by MaxGhenis · Pull Request #669 · PolicyEngine/policyengine-us-data

MaxGhenis · 2026-03-31T02:23:53Z

Summary

add explicit long-run calibration profiles, quality tiers, and audit metadata
record named target-source provenance in year sidecars and dataset manifests
add nonnegative feasibility/frontier tooling plus LP-backed fallbacks for entropy calibration

What changed

adds CalibrationProfile contracts for long-run age/SS/payroll/TOB calibration, including year-bounded approximate windows
stamps each generated artifact with calibration_quality, max_constraint_pct_error, and target-source metadata
adds assess_calibration_frontier.py for checking where exact nonnegative calibration remains feasible
adds rebuild_calibration_manifest.py to backfill manifests/sidecars with the new contract data
introduces an explicit trustees_2025_current_law long-run target-source package instead of relying on an implicit legacy file path
updates the long-run README and storage docs to describe the contract-driven flow

Why

The old long-run workflow depended on implicit flag combinations, silent fallback behavior, and ambiguous target-source provenance. This PR makes the calibration contract explicit and inspectable so downstream consumers can reject mismatched artifacts instead of trusting them implicitly.

Validation

uv run pytest policyengine_us_data/tests/test_long_term_calibration_contract.py -q
python3 -m py_compile policyengine_us_data/datasets/cps/long_term/calibration.py policyengine_us_data/datasets/cps/long_term/calibration_profiles.py policyengine_us_data/datasets/cps/long_term/calibration_artifacts.py policyengine_us_data/datasets/cps/long_term/run_household_projection.py policyengine_us_data/datasets/cps/long_term/ssa_data.py policyengine_us_data/datasets/cps/long_term/rebuild_calibration_manifest.py policyengine_us_data/datasets/cps/long_term/assess_calibration_frontier.py

Follow-up

A stacked follow-up PR will add the provisional OACT target-source package and builder script on top of this contract work.

MaxGhenis · 2026-03-31T02:28:48Z

Split this work into two draft PRs so the general calibration-contract changes can be reviewed independently from the provisional OACT source package. The stacked follow-up is #670.

MaxGhenis · 2026-03-31T11:37:59Z

Follow-up from the late-tail investigation:

I pushed 6bc34e02 onto this PR with two stable follow-ups:
- support-quality metrics in the calibration audit (positive_weight_count, positive_weight_pct, effective_sample_size, top_10_weight_share_pct, top_100_weight_share_pct)
- metadata normalization for historical LP fallback labels plus the widened 2079-2085 approximate window (10% instead of 5%)
Focused verification still passes: uv run pytest policyengine_us_data/tests/test_long_term_calibration_contract.py -q

Substantively, the new diagnostics clarify the late-year problem:

2091 in the validated Trustees build has 88 positive households, ESS 41.4, top 10 households holding 30.9% of total weight.
A tiny linear blend back toward baseline weights immediately restores thousands of positive-weight households, so the 88-household count is partly an LP extreme-point artifact.
But ESS barely improves under those blends, which means the deeper issue is not just zeros; it is true late-year concentration under the current target bundle.

So the current read is:

the tail pathology is not evidence that Trustees necessarily imply many more very old workers
the LP fallback is exaggerating the support collapse
but the repeated-cross-section support is still genuinely too concentrated by the early 2090s under age + SS + payroll + OASDI TOB

I have not pushed the experimental dense approximate entropy fallback yet. The first prototype failed numerically on 2091, so I kept that local until it actually outperforms the LP fallback. Next step is still microsim-only: prototype a denser late-year calibrator and/or support expansion without falling back to an aggregate tail.

MaxGhenis · 2026-03-31T12:15:05Z

Late-tail update from the microsim-only investigation:

Pushed 4dfa5397 (Add late-year age aggregation for calibration) to this branch.
The current late-tail cliff is still primarily payroll-driven, but one-year age constraints were making the nonnegative frontier worse than necessary.
At 2091, the nonnegative best-case error for ss-payroll drops from 18.29% with single-year age bins to 16.89% with 5-year bins.
I wired that into the approximate calibration windows so late years can aggregate age targets / age matrix into 5-year buckets while preserving the open-ended 85+ bin.
Focused verification still passes: uv run pytest policyengine_us_data/tests/test_long_term_calibration_contract.py -q

Important caveat: this is not the whole tail fix by itself. The LP approximate fallback is still overly sparse, and the deeper ESS/concentration problem remains. But this change improves the late-year feasible set with a defensible repeated-cross-section adjustment rather than another hidden tolerance bump.

I have not included the standalone support-profiling script in this commit yet; it is still local-only while I decide whether it belongs in the repo.

MaxGhenis · 2026-03-31T12:30:54Z

Follow-up pushed as 047545b0 (Add support concentration gates to calibration).

This extends the calibration contract beyond target error / negative weights and adds explicit late-tail support-quality gates:

min_positive_household_count = 1000
min_effective_sample_size = 75
max_top_10_weight_share_pct = 25
max_top_100_weight_share_pct = 95

Those thresholds are applied in both classification and validation. On the sampled years:

existing validated run stays healthy through 2073
old 2074+ concentration now gets flagged
the new age-binned 2075/2076 outputs would also be rejected despite exact target matching, because support is still too concentrated

That is intentional: the runner should now fail fast on a microsim support collapse instead of saving a misleading late-year artifact.

Focused verification still passes:
uv run pytest policyengine_us_data/tests/test_long_term_calibration_contract.py -q

I also started a one-year 2075 smoke rerun after this change so we can confirm the new validation trips where expected.

MaxGhenis · 2026-03-31T14:53:15Z

Added a support-augmentation diagnostic pass in 5525b3d2.

What landed:

support_augmentation.py with two experimental clone profiles:
- late-clone-v1: older SS-only, older SS+pay, and payroll-only donor clones
- late-clone-v2: a more aggressive version that also pushes payroll-only donors much further up the age distribution
evaluate_support_augmentation.py to compare the nonnegative feasibility frontier before and after augmentation for a single year/profile
focused tests for donor selection and clone ID remapping in test_long_term_calibration_contract.py

Key result at 2091:

ss: exact already, augmentation has no effect
ss-payroll: base best-case max error 16.88648481756073%; late-clone-v1 and late-clone-v2 both leave it unchanged up to numerical noise
ss-payroll-tob: same story; no material improvement from either clone profile

Interpretation:

the late-tail infeasibility is not solved by adding more whole-household age-shifted copies of existing support
age + SS is already feasible, so the hard tradeoff is on the payroll side (and TOB does not appear to be the distinctive blocker at 2091)
if we want a microsim tail, the next support-expansion step needs to change payroll-per-older-household composition, not just create more older versions of existing households

MaxGhenis · 2026-03-31T14:59:43Z

Follow-up support-expansion result in e3d99121:

I added a composite-household diagnostic path in support_augmentation.py and tested two profiles:

late-composite-v1: clone older beneficiary households, then graft payroll from younger payroll-only donors
late-composite-v2: same idea, but with much more aggressive payroll transfer scales to test whether the frontier is simply missing older payroll intensity

I also added a focused composite-augmentation test in test_long_term_calibration_contract.py.

What the 2091 diagnostics show:

ss is still exact with or without augmentation
ss-payroll base best-case max error is 16.88648481756073%
late-composite-v1 changes that to 16.88648477158211%
late-composite-v2 lands at the same value to numerical precision

Interpretation:

simple age-shift clones were already insufficient
composite older-beneficiary-plus-payroll synthetic households are also insufficient
even forcing substantially higher payroll into older synthetic households does not materially expand the nonnegative feasible set at 2091

So the next support-expansion step has to be more structural than whole-household cloning or payroll grafting. The late frontier does not appear to be missing only “older payroll intensity” in a way that can be fixed by splicing current-household components together.

MaxGhenis · 2026-03-31T16:47:39Z

Added appended synthetic-sample diagnostics in 5b91f1e5.

What changed:

support_augmentation.py now has explicit appended synthetic single-person older-household grid profiles:
- late-synthetic-grid-v1
- late-synthetic-grid-v2 (same idea but with much higher payroll levels)
these profiles preserve the base CPS support untouched and append tagged synthetic older households on an age/SS/payroll grid
focused tests now cover the synthetic-grid path in test_long_term_calibration_contract.py

Key result at 2091 for ss-payroll:

base best-case max error: 16.88648481756073%
late-synthetic-grid-v1: 16.88648475766269%
late-synthetic-grid-v2: same to numerical precision

Interpretation:

even appended synthetic older-worker support, including a deliberately extreme payroll grid, does not materially improve the late-year age + SS + payroll nonnegative frontier
so the problem is not just “we need more older households with higher payroll”
the next step has to be more structural than cloning, grafting, or appended payroll-heavy older-household grids

MaxGhenis · 2026-03-31T18:10:57Z

Pushed c99ccbac to this PR.

This change moves long-run TOB out of the hard calibration target bundle and into post-calibration benchmarking:

ss-payroll-tob and ss-payroll-tob-h6 now calibrate on age + OASDI benefits + taxable payroll only.
Those profiles still compute OASDI/HI TOB, but they write it under calibration_audit.benchmarks instead of constraints, so TOB no longer affects the solver or quality classification.
Added ASSUMPTION_COMPARISON.md documenting how our calibration assumptions differ from Trustees/OACT, especially on TOB.
Switched frontier / support-augmentation tooling defaults to ss-payroll so late-tail diagnostics stop conflating support feasibility with TOB.

Validation:

uv run pytest policyengine_us_data/tests/test_long_term_calibration_contract.py -q
python3 -m py_compile on the touched long-term modules

MaxGhenis · 2026-03-31T19:12:46Z

Pushed 642752cb with two late-tail fallback changes:

added a bounded-entropy approximate fallback before raw LP minimax
if bounded entropy still fails, densify the LP solution by blending back toward baseline inside the allowed error band (lp_blend) instead of saving the raw basic-feasible-point weights

Focused verification still passes:

uv run pytest policyengine_us_data/tests/test_long_term_calibration_contract.py -q
python3 -m py_compile policyengine_us_data/datasets/cps/long_term/calibration.py

Empirical result: this is better than raw LP, but still not enough for publishable late-tail microsim support.

Current diagnostics under the no-TOB-hard-target profile:

2083: lp_blend, max error 10.000%, ESS 12.25, positive households 6859, top-10 share 81.66%, top-100 share 94.14%
2091: lp_blend, max error 20.000%, ESS 13.08, positive households 6856, top-10 share 76.73%, top-100 share 97.56%

So this removes the pathological 19-household raw-LP support collapse, but it still fails the ESS / concentration gates by a wide margin. This now looks like a good checkpoint for an external code review, because we have a real alternative implementation and a concrete before/after result.

MaxGhenis · 2026-03-31T19:58:06Z

Pushed e00f6e59 addressing the most actionable review findings from the external Claude review.

What changed:

fixed approximate_window_for_year(profile, None) to prefer the open-ended tail window instead of the 2086-2095 window
made the legacy flag builder auto-upgrade use_tob=True into a GREG-derived profile instead of creating an impossible IPF+TOB contract
removed the bare except: cases in projection_utils.py; uprating failures now warn explicitly and no longer swallow KeyboardInterrupt / SystemExit
removed the duplicate bounded-entropy objective evaluation in the L-BFGS-B path (jac=True with a shared callable)
changed negative_weight_pct to measure negative weight mass, and added negative_weight_household_pct separately
added nonfatal validation support in run_household_projection.py via --allow-validation-failures / PEUD_ALLOW_INVALID_ARTIFACTS=1; validation issues are now recorded in metadata instead of necessarily crashing the run
manifest entries now include validation status / issue count and the household-count negative-weight metric

Verification:

uv run pytest policyengine_us_data/tests/test_long_term_calibration_contract.py -q
python3 -m py_compile policyengine_us_data/datasets/cps/long_term/calibration.py policyengine_us_data/datasets/cps/long_term/calibration_profiles.py policyengine_us_data/datasets/cps/long_term/projection_utils.py policyengine_us_data/datasets/cps/long_term/run_household_projection.py policyengine_us_data/datasets/cps/long_term/calibration_artifacts.py policyengine_us_data/tests/test_long_term_calibration_contract.py

I also started a one-year late-tail smoke run with --allow-validation-failures to confirm artifact writing on a failing late year; that was still in compute when I pushed this comment, so I’m not claiming a completed runner smoke yet.

MaxGhenis · 2026-03-31T21:26:53Z

Pushed 90361837 with the follow-up fixes from the second-pass Claude review.

Included in this commit:

cached objective_gradient_hessian() inside solve_with_root() so fun(z) / jac(z) at the same point do not recompute the expensive state
moved objective_with_gradient() outside the bounded-entropy start loop
normalize_metadata() now backfills validation_passed / validation_issues for older sidecars by re-running validation against the named profile
densify_lp_solution() now reports whether densification actually changed the LP point, and the audit uses lp_minimax instead of lp_blend when lambda stayed at zero
manifest now carries a top-level contains_invalid_artifacts flag

Verification:

uv run pytest policyengine_us_data/tests/test_long_term_calibration_contract.py -q
python3 -m py_compile policyengine_us_data/datasets/cps/long_term/calibration.py policyengine_us_data/datasets/cps/long_term/calibration_artifacts.py policyengine_us_data/tests/test_long_term_calibration_contract.py

The branch is now in a good state for another external review if we want one; the remaining risk looks primarily methodological rather than hidden harness bugs.

MaxGhenis · 2026-03-31T23:02:19Z

Pushed 79858d6c adding policyengine_us_data/datasets/cps/long_term/assess_publishable_horizon.py, a one-off diagnostic that runs the current calibration contract on selected milestone years and emits the same quality/support metrics the runner uses.

I used it to check the publishable cutoff under the current ss-payroll-tob profile and trustees_2025_current_law source.

Boundary result:

2073: exact, validation passes, ESS 90.35, top-10 share 22.48%
2074: exact, validation passes, ESS 80.62, top-10 share 23.86%
2075: first failing year; non-TOB targets are still essentially exact, but support gates fail (aggregate, ESS 33.23, top-10 share 47.69%)

Milestone diagnostics from the same tool:

2080: aggregate, lp_minimax_exact, ESS 13.49, top-10 76.41%
2085: aggregate, lp_blend, max constraint error 10.00%, ESS 13.65
2090: aggregate, lp_blend, max constraint error 20.00%, ESS 13.80
2095: hard failure under current window (23.72% > 20.00%)
2100: aggregate, lp_blend, max constraint error 35.00%, ESS 11.41

So the current evidence points to a publishable microsim horizon of through 2074, with 2075+ diagnostic-only under the current fixed-support repeated-cross-section methodology.

MaxGhenis · 2026-03-31T23:58:19Z

Pushed ff099fd5 adding a more structural support-augmentation diagnostic: late-mixed-household-v1 in support_augmentation.py.

This profile appends synthetic mixed-age households by taking an older beneficiary household and adding a younger payroll-rich donor person as a separate subunit in the same household. That changes the household age/payroll direction, unlike the earlier age-shift and payroll-graft rules.

I ran:

uv run python policyengine_us_data/datasets/cps/long_term/evaluate_support_augmentation.py 2091 --profile ss-payroll --target-source trustees_2025_current_law --support-augmentation late-mixed-household-v1

Result at 2091:

base best-case nonnegative max error: 16.88648481756073%
mixed-household augmented best-case nonnegative max error: 16.886484695406168%
delta: -0.000000122%

So even a genuinely mixed-age household augmentation barely moves the frontier. That makes the current conclusion stronger: the late-tail issue is not just “missing older workers” or “missing older + younger co-resident households” in a simple sense. If 2100 microsim is a hard requirement, we likely need a much more radical synthetic-support generation path than support grafting onto the 2024 CPS donor geometry.

MaxGhenis · 2026-04-01T02:40:10Z

Pushed 7a03b8b1 adding prototype_synthetic_2100_support.py, a standalone diagnostic that:

builds a coarse actual 2024 tax-unit summary (head age, spouse age, dependents, payroll, SS, pension, dividends),
generates a fully synthetic minimal-support candidate set from archetypes,
scales the income grids into 2100 nominal space using the macro SS/payroll growth factors, and
solves for the best nonnegative 2100 composition against age + SS + payroll.

Key result at 2100 with trustees_2025_current_law:

best-case max error on the minimal synthetic support: 7.66%
so this minimal archetype support is dramatically more feasible than the fixed CPS support, though still not exact

The synthetic composition it wants is informative:

prime_worker_single: 25.7%
prime_worker_family: 25.0%
mixed_retiree_worker_couple: 24.0%
older_worker_couple: 10.2%
older_worker_single: 7.8%
prime_worker_couple: 5.9%

Compared with the actual 2024 support count mix, the largest gaps are:

mixed_retiree_worker_couple: 24.0% synthetic vs 2.37% actual support count
older_worker_couple: 10.2% vs 2.54%
older_worker_single: 7.8% vs 1.93%

Notably, once we drop TOB from the hard target set and just target age + SS + payroll, this minimal synthetic solution uses zero pension/dividend income and still wants an average taxable-benefits proxy share of 85%. That reinforces the current view that the hard late-tail support need is concentrated in older-worker / mixed retiree-worker composition, not generic older households or asset-income-heavy retirees.

So this doesn’t solve 2100 microsim, but it does sharpen what the synthetic support generator would need to add.

MaxGhenis · 2026-04-01T11:26:09Z

Pushed c14bbb88 with a wider synthetic-2100 support prototype in prototype_synthetic_2100_support.py.

Main result: the richer archetype menu now makes 2100 exact-feasible for the minimal age + SS + payroll problem. The prototype found a 0.0% best-case max error at 2100.

Important caveat: the exact fit is still very sparse. The current LP solution uses 20 positive synthetic candidates out of about 1.61M, with an effective sample size of about 11.0, and the top 10 candidates hold about 87.8% of total weight.

So this changes the diagnosis in a useful way:

the minimal synthetic support can now span the 2100 targets;
the remaining problem is no longer support span, but finding a dense/usable weighting solution on that richer support.

The dominant exact-fit archetypes are now mixed-age payroll-bearing households rather than pure retiree units:

older_plus_prime_worker_family: 30.3%
prime_worker_couple: 25.7%
prime_worker_family: 17.8%
mixed_retiree_worker_couple: 15.1%
older_worker_couple: 7.2%

That is a stronger signal than the earlier failed support-augmentation attempts: the fixed CPS donor geometry was too narrow, but once the synthetic menu includes richer mixed-age/mixed-payroll structures, the late-year targets themselves are no longer infeasible.

MaxGhenis · 2026-04-01T12:37:21Z

Pushed 2d4ab08f with a donor-backed synthetic support probe in prototype_synthetic_2100_support.py.

This tests the idea of grounding the dominant 2100 exact-fit synthetic targets in real 2024 tax-unit donors rather than treating them as free-floating synthetic records.

Main result:

Among the top 20 exact-fit synthetic targets, the median best donor distance is 0.96.
10/20 have a best donor distance <= 1.0.
13/20 have a best donor distance <= 2.0.
Only 1/20 is a true outlier with best distance > 3.0.

I also built a donor-backed clone probe: for each top synthetic target, fan it out across its nearest real donors and split the original exact target weight across those donor-backed variants.

That materially improves support concentration while preserving the same target geometry:

baseline exact synthetic solution: 20 positive candidates, ESS ~11.0, top-10 share ~87.8%
donor-backed clone probe: 85 positive candidates, ESS ~42.0, top-10 share ~32.8%, top-20 share ~56.6%
only one target stayed purely synthetic because no donor was close enough (an older_plus_prime_worker_family with head/spouse/dependent structure 82/67/2, weight share only 0.21%)

This is the strongest evidence so far that the late-year support problem may be solvable with a curated donor-backed supplement rather than a fully unconstrained synthetic population. The remaining gap is now much narrower: improve donor-backed coverage for the small number of structural outliers, then rerun the real late-year pipeline on that supplemented support.

MaxGhenis · 2026-04-01T13:17:57Z

Pushed de6403db.

This integrates the donor-backed late-year support experiment into the real runner and keeps the artifacts/docs auditable:

run_household_projection.py now supports --support-augmentation-profile donor-backed-synthetic-v1 and stamps augmentation provenance into year sidecars/manifests.
projection_utils.py now accepts in-memory Dataset inputs when writing year H5s.
README.md documents the new runner mode and its current experimental status.
added a contract test covering support_augmentation metadata persistence.

Verification:

uv run pytest policyengine_us_data/tests/test_long_term_calibration_contract.py -q
python3 -m py_compile ...
direct smoke: donor-backed augmented dataset builds and loads into Microsimulation
end-to-end runner smokes for 2075 and 2100

Current result: the integration works, but donor-backed v1 is still diagnostic rather than a fix.

2075 runs cleanly but support metrics are essentially unchanged (ESS 33.235, top-10 share 47.686%).
2100 also runs end to end, but still lands at the same late-tail frontier (SS -33.760%, payroll -35.000%, ESS 11.412, top-10 share 84.895%).

So the current supplement is not adding the missing directions the calibrator needs. The next step should be a more structural donor-backed generator, not just more clones of the top synthetic targets.

MaxGhenis · 2026-04-01T14:54:43Z

Pushed d93eed87.

This adds the structural donor-composite late-tail prototype and keeps the docs current.

What changed

prototype_synthetic_2100_support.py
- added role-based donor matching (older + worker roles)
- added role-composite candidate construction with non-uniform priors
- added actual-row composite support builder (build_role_composite_augmented_*)
- made the prototype importable both as a module and as a script
run_household_projection.py
- new runner mode: --support-augmentation-profile donor-backed-composite-v1
README.md
- documented the new structural-composite path and current status
test_long_term_calibration_contract.py
- added a focused unit test for role-based donor composite construction

Verification

uv run pytest policyengine_us_data/tests/test_long_term_calibration_contract.py -q
python3 -m py_compile ...

Main result

Synthetic support lab: strong improvement.

2100, age(5y)+SS+payroll, role-based donor composites:
- exact entropy fit (best_case_max_pct_error = 0.0)
- 360 positive candidates
- ESS = 95.82
- top-10 share = 25.56%
- top-20 share = 37.21%

This is materially better than the earlier exact synthetic LP corner solution (20 positive, ESS 10.97, top-10 87.76%).

Real microdata augmentation: promising but still not enough.

Full current-code 2100 runner pass with donor-backed-composite-v1:
- support build: 41,314 -> 41,584 households, 270 structural clones, 13 successful target groups, 7 skipped
- calibrated result still hits the same late-tail bound on the hard margins:
  - SS = -33.760%
  - payroll = -35.000%
- but support quality does improve versus the prior donor-clone run:
  - ESS = 13.279 (up from 11.412)
  - top-10 share = 79.352% (down from 84.895%)
  - OASDI TOB benchmark gap = +67.309% (down from +81.365%)
  - HI TOB benchmark gap = +39.595% (down from +51.261%)

So the structural composites clearly help, but the actual-row augmentation still does not yet reproduce the candidate-space improvement. The remaining problem is now the translation from good donor-composite candidate support into sufficiently expressive real microdata rows.

MaxGhenis · 2026-04-01T15:55:01Z

Update from the late-tail support work:

Added a shared support_augmentation_report.json artifact plus a new translation diagnostic script: policyengine_us_data/datasets/cps/long_term/diagnose_support_augmentation_translation.py.
The composite augmentation report now records per-clone provenance (target synthetic candidate, donor tax units, created household/tax-unit ids, target totals).
Fixed the actual-row translation bug by switching the clone builder from global macro scaling to donor-specific inverse uprating, with worker-side payroll factors and older-side SS factors.
README.md is updated to document the new report/diagnostic flow.
Focused suite still passes: uv run pytest policyengine_us_data/tests/test_long_term_calibration_contract.py -q.

Most important result:

On the new 2100 probe, clone translation is now basically correct.
Realized clone households match the intended synthetic targets with 0.0 age-bucket loss, aggregate SS error ~0.000001%, and aggregate payroll error ~+1.67%.
That means the remaining blocker is no longer row materialization. The next lever is support coverage / how many synthetic target directions we inject, not how we translate a chosen synthetic target into actual microdata rows.

Temporary probe outputs are in:

/tmp/us_data_aug_translation_probe3/2100.h5
/tmp/us_data_aug_translation_probe3/support_augmentation_report.json
/tmp/us_data_aug_translation_probe3/2100.clone_translation.json

The full 2100 runner still does not clear the late-tail frontier (SS -33.15%, payroll -35.00%), but we now know that is a support-coverage problem rather than a translation bug.

MaxGhenis · 2026-04-02T12:47:37Z

Switched the long-run baseline over to a named core-threshold tax assumption and re-enabled hard TOB targeting in the ss-payroll-tob profiles.

What changed:

added tax_assumptions.py with trustees-core-thresholds-v1
runner now applies that tax assumption by default, stamps tax_assumption into sidecars/manifests, and supports --tax-assumption current-law-literal to opt out
ss-payroll-tob / ss-payroll-tob-h6 now hard-target TOB again
docs now record the primary-source Trustees distinction and the benchmark results that motivated the core-threshold bundle

Focused checks still pass:

uv run pytest policyengine_us_data/tests/test_long_term_calibration_contract.py -q
python3 -m py_compile ... on the touched long-term modules

2100 smoke run on the corrected stack (PYTHONPATH pointed at local policyengine-us with the wage-base fix, donor-composite augmentation enabled) now comes back:

validation: pass
quality: approximate
ss_total: essentially exact
oasdi_tob: essentially exact
hi_tob: essentially exact
payroll_total: -2.95%
ESS: 173.1
top-10 share: 17.7%
top-100 share: 54.9%

So this gives us a recorded tax-side baseline, a reproducible note/script for the TOB alignment question, and a long-run 2100 artifact that now hits TOB while keeping support quality in range.

Commit: 9f46122e

MaxGhenis · 2026-04-03T01:06:53Z

Synced the actual PR head branch to the current long-run work. The PR branch now includes the later long-run calibration, TOB-baseline, donor-composite, and post-OBBBA/OACT commits that earlier status comments referred to. This should also allow GitHub to start evaluating checks against the real branch state instead of the stale 6bc34e0 snapshot.

# Conflicts: # policyengine_us_data/datasets/cps/enhanced_cps.py

daphnehanse11 and others added 5 commits March 18, 2026 15:26

Add 2025 post-calibration ACA takeup override

6499145

Fix lint in ACA takeup tests

04acc8b

Format ACA takeup helper

e994f01

Move ACA override to Enhanced CPS path

1b0bd68

Add long-run calibration contracts

24efc95

MaxGhenis mentioned this pull request Mar 31, 2026

Add provisional OACT long-run target source #670

Draft

Add support diagnostics to long-run calibration audit

6bc34e0

Add late-year age aggregation for calibration

4dfa539

MaxGhenis added 2 commits March 31, 2026 08:21

Add long-run calibration comparison tools

2172962

Add support concentration gates to calibration

047545b

Add long-run support augmentation diagnostics

5525b3d

Probe composite long-run support augmentation

e3d9912

Test appended synthetic late-year support

5b91f1e

Benchmark long-run TOB outside calibration

c99ccba

Try denser late-tail approximate calibration

642752c

Fix review issues in long-run calibration harness

e00f6e5

Refine late-tail calibration metadata and caching

9036183

Add publishable horizon assessment tool

79858d6

Add mixed-age household support diagnostic

ff099fd

Prototype minimal synthetic support for 2100

7a03b8b

Expand synthetic 2100 support prototype

c14bbb8

Add donor-backed synthetic support probe

2d4ab08

Add donor-backed late-year support mode

de6403d

Add structural donor composite late-tail prototype

d93eed8

Diagnose and fix support-augmentation translation

5643fdd

MaxGhenis added 5 commits April 2, 2026 07:17

Add long-run TOB comparison note

0b7dfce

Add Trustees bracket-indexing TOB benchmark

40f8e45

Extend TOB tax-side benchmark scenarios

f35107d

Benchmark full IRS uprating for TOB

7ffc8ed

Adopt core-threshold TOB baseline

9f46122

Add post-OBBBA OACT target source

d287237

MaxGhenis added 3 commits April 2, 2026 21:13

Merge main into long-run calibration branch

d85bc0f

Format long-run calibration files

f618250

Merge remote-tracking branch 'upstream/main' into codex/tmp-pr669-merge

6095df6

# Conflicts: # policyengine_us_data/datasets/cps/enhanced_cps.py

MaxGhenis marked this pull request as ready for review April 4, 2026 01:50

MaxGhenis enabled auto-merge April 4, 2026 01:50

Merge branch 'main' into codex/us-data-calibration-contract

77766c0

MaxGhenis merged commit b17e083 into main Apr 4, 2026
9 checks passed

Conversation

MaxGhenis commented Mar 31, 2026

Summary

What changed

Why

Validation

Follow-up

Uh oh!

MaxGhenis commented Mar 31, 2026

Uh oh!

MaxGhenis commented Mar 31, 2026

Uh oh!

MaxGhenis commented Mar 31, 2026

Uh oh!

MaxGhenis commented Mar 31, 2026

Uh oh!

MaxGhenis commented Mar 31, 2026

Uh oh!

MaxGhenis commented Mar 31, 2026

Uh oh!

MaxGhenis commented Mar 31, 2026

Uh oh!

MaxGhenis commented Mar 31, 2026

Uh oh!

MaxGhenis commented Mar 31, 2026

Uh oh!

MaxGhenis commented Mar 31, 2026

Uh oh!

MaxGhenis commented Mar 31, 2026

Uh oh!

MaxGhenis commented Mar 31, 2026

Uh oh!

MaxGhenis commented Mar 31, 2026

Uh oh!

MaxGhenis commented Apr 1, 2026

Uh oh!

MaxGhenis commented Apr 1, 2026

Uh oh!

MaxGhenis commented Apr 1, 2026

Uh oh!

MaxGhenis commented Apr 1, 2026

Uh oh!

MaxGhenis commented Apr 1, 2026

Uh oh!

MaxGhenis commented Apr 1, 2026

Uh oh!

MaxGhenis commented Apr 2, 2026

Uh oh!

MaxGhenis commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants