Replace ad-hoc targets with structured registry and source modules#269
Replace ad-hoc targets with structured registry and source modules#269nikhilwoodruff wants to merge 6 commits intomainfrom
Conversation
The old system had ~700 lines of hardcoded targets scattered across loss.py with no provenance. This replaces it with a targets registry (sources.yaml + per-source Python scrapers) that produces 570 national targets with source URLs, and factors local area data loading into dedicated source modules. National: pydantic Target schema, YAML registry, build_loss_matrix.py computes household columns from registry. Local: 4 source modules (local_age, local_income, local_uc, local_la_extras) replace inline file reads in constituencies/loss.py and local_authorities/loss.py. Both calibrate.py files updated to import create_national_target_matrix from the new location. Full pipeline runs end-to-end with zero skipped targets. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Review
Critical
-
No caching on
get_all_targets()— triggers HTTP downloads in 6+ modules on every call, andbuild_loss_matrix.pycalls it three times. If any download fails, targets silently disappear. The old local-CSV approach was deterministic. -
Savings interest income conflict —
hmrc_spi.pynow emits savings targets that the old code deliberately excluded (SPI underestimates ~3bn vs ~98bn actual). These conflict with the correctons_savings.pytargets. -
SPI band name format change — old names included row index (
_band_3_50_000_to_70_000), new ones don't. Breaking change for downstream consumers. -
Tests require live network — no mocks, so CI will be flaky.
Duplication
_HEADERS, _STORAGE, _SOURCES_YAML/_load_config(), and _to_float() are each copy-pasted across 2–7 modules. Should be shared constants/utilities.
build_loss_matrix.py (827 lines)
The 200+ line if/elif dispatch chain on target name strings is the same problem the PR claims to solve, just relocated. The custom_compute field was designed for this but most targets don't use it. Skipped targets are logged at DEBUG — should be WARNING.
Undocumented behavioural changes
hmrc/salary_sacrifice_contributionstarget droppedobr/esano longer combinesesa_income + esa_contrib- VOA council tax population uprating removed
scottish_government.pyextrapolation anchors on 2025, not 2026
What's done well
- Pydantic schema, registry auto-discovery,
_SimContextlazy caching - Local area refactoring is a clear improvement
- Source URLs on most targets
|
Thanks! pls disapply no 3, and yes undocumented behavioural changes should all be scrapped from the pr. |
- Restore hmrc/salary_sacrifice_contributions target (24bn base, 3%/yr) - Fix obr/esa to combine esa_income + esa_contrib - Restore VOA council tax population uprating for non-base years - Extract shared HEADERS/STORAGE/load_config/to_float into _common.py - Decompose build_loss_matrix.py (828→402 lines) into targets/compute/ subpackage with domain modules: demographics, households, income, benefits, council_tax, other Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflict in utils/loss.py (keep registry delegation). Incorporate salary sacrifice headcount targets from PR #268 into the structured registry (obr.py, compute/income.py, build_loss_matrix.py). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
The old calibration system had ~700 lines of hardcoded targets in
utils/loss.pywith no source URLs or provenance, and local area loss functions loaded 8+ opaque xlsx/csv files inline. This replaces the whole thing with a structured targets registry.National targets: pydantic
Targetschema, YAML registry (sources.yaml), per-source Python scrapers (OBR, ONS, HMRC, DWP, VOA, etc.), andbuild_loss_matrix.pythat computes household-level columns from the registry. 570 targets total, zero skipped.Local area targets: 4 source modules (
local_age,local_income,local_uc,local_la_extras) replace inline file loading inconstituencies/loss.pyandlocal_authorities/loss.py. Each module has reference URLs and docstrings for provenance.Both
calibrate.pyfiles updated to importcreate_national_target_matrixfrom the new location. Full pipeline runs end-to-end: constituency calibration loss 0.056, LA loss 0.077. All 68 tests pass.Test plan
TESTING=1pipeline runs end-to-end with zero warnings/skipped targets