Skip to content

Update pipeline documentation, both public facing and internal#644

Open
juaristi22 wants to merge 5 commits intomainfrom
maria/methodology-docs
Open

Update pipeline documentation, both public facing and internal#644
juaristi22 wants to merge 5 commits intomainfrom
maria/methodology-docs

Conversation

@juaristi22
Copy link
Copy Markdown
Collaborator

@juaristi22 juaristi22 commented Mar 27, 2026

Summary

Comprehensive update to pipeline documentation — both public-facing and internal developer reference. FIxes #643 .

Internal developer reference (docs/internals/)

Three new notebooks providing thorough explanations of the calibration pipeline for developers:

  • data_build_internals.ipynb — Stage 1: PUF cloning, geography assignment (including AGI-conditional routing and the no-collision constraint), and source imputation. Corrected pipeline ordering to match implementation (PUF clone → geography → source imputation). Documents that geography is rederived per-run, not persisted.

  • calibration_package_internals.ipynb — Stage 2: Matrix construction internals including per-state simulation, clone loop, domain constraints (corrected: constraints come from stratum_constraints in policy_data.db, not target_config.yaml), takeup re-randomization (state precomputation + clone-loop draws), county-dependent variables, COO assembly, target config filtering (clarified: applied post-matrix-build, not during construction), hierarchical uprating, and calibration package serialization with initial weight computation.

  • optimization_and_local_dataset_assembly_internals.ipynb — Stages 3–4: L0 optimization (fixed sparsity demo from 20→200 records so lambda effect is visible), H5 assembly pipeline (expanded from 11→16 steps matching actual implementation), SPM threshold recalculation, takeup consistency invariant, and diagnostics including validation_results.csv.

  • README.md — Pipeline orchestration reference with run ID format, step dependency graph, Modal volumes, HuggingFace artifact paths, resume logic. Added file reference tables for calibration/ and modal_app/ with per-file descriptions and notes on legacy/standalone status.

Public-facing documentation

  • docs/methodology.md — Minor updates to reflect current implementation.
  • docs/data.md — Updated data source descriptions.

Dead code removed

  • save_geography() and load_geography() from clone_and_assign.py — defined but never called by any pipeline code. Geography is rederived each run via deterministic seeding, making serialization unnecessary.

Test plan

🤖 Generated with Claude Code

@juaristi22 juaristi22 marked this pull request as draft March 27, 2026 16:47
@juaristi22 juaristi22 force-pushed the maria/methodology-docs branch from 0ced478 to 6493e3a Compare April 1, 2026 08:17
juaristi22 and others added 4 commits April 1, 2026 19:18
…ll diagnostics to HF

- docs/methodology.md and docs/data.md updated to match current pipeline
- pipeline.py now uploads validation diagnostics after H5 builds complete,
  in addition to the existing calibration diagnostics upload

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Move docs/calibration_internals.ipynb → docs/internals/calibration_package_internals.ipynb
- Add docs/internals/data_build_internals.ipynb: Stage 1 coverage — clone creation with real assign_random_geography() on 20 records, source imputation concept demo, PUF cloning toy walkthrough
- Add docs/internals/local_dataset_assembly_internals.ipynb: Stages 3–4 — Hard Concrete L0 math, λ preset comparison, weight expansion reference, diagnostics column guide
- Add docs/internals/README.md: navigation index + §9 pipeline orchestration (run ID format, Modal volumes, step dependency graph, resume logic, HuggingFace artifact paths, meta.json structure)
- Extend calibration_package_internals with Part 4 (matrix assembly per-state, domain constraints) and Part 5 (takeup randomization cross-stage demo)
- All notebooks execute with zero errors under --allow-errors; toy inputs complete in <30s
- Add changelog fragment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@juaristi22 juaristi22 force-pushed the maria/methodology-docs branch from 6493e3a to 4479b78 Compare April 1, 2026 13:49
@juaristi22 juaristi22 marked this pull request as ready for review April 1, 2026 13:49
@juaristi22 juaristi22 requested review from anth-volk and baogorek April 1, 2026 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Complete calibration_internals.ipynb — document remaining pipeline stages

1 participant