Scale H5 pipeline to 50 workers at 1 CPU each by anth-volk · Pull Request #684 · PolicyEngine/policyengine-us-data

anth-volk · 2026-04-02T19:34:09Z

Fixes #683

Summary

Increase default H5 workers from 8 to 50 (~10 items per worker instead of ~60)
Drop worker CPU from 4 to 1 (single-threaded numpy work doesn't use extra cores)
Add max_containers=50 safety cap on the Modal function decorator
Update defaults in coordinate_publish, run_pipeline, CLI entrypoints, and pipeline.yaml workflow

Test plan

Dispatch pipeline after merge and verify 50 workers spin up
Verify H5 builds complete successfully at 1 CPU
Compare wall-clock time and cost against previous 8-worker runs

🤖 Generated with Claude Code

The H5 build work is single-threaded numpy. 4 CPUs per worker was wasted. 8 workers meant each processed ~60 items serially. - Increase default workers from 8 to 50 (~10 items per worker) - Drop worker CPU from 4 to 1 (saves 75% CPU cost) - Add max_containers=50 as safety cap - Wall-clock time drops from ~60min to ~12min - Total CPU cost drops: 8×60min×4CPU → 50×12min×1CPU Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

baogorek · 2026-04-03T18:19:27Z

Hey Anthony, nice change — the 4 CPU → 1 CPU drop makes sense for single-threaded numpy work, and parallelizing across 50 workers should cut wall-clock time significantly.

Two things to flag:

1. Minor: `max_containers` vs `num_workers` mismatch

max_containers=50 is hardcoded on the function decorator, but num_workers is a CLI parameter that someone could set higher. If someone passes --num-workers 60, Modal silently caps at 50 containers and queues the rest. Not a big deal, but worth a comment or a runtime clamp so it's not surprising.

2. Major: `pipeline.yaml` triggers twice per merge → double pipeline run

pipeline.yaml fires on every push to main with no commit-message filter:

on:
  push:
    branches: [main]

Every PR merge creates two pushes: the merge commit, then the version-bump commit from versioning.yaml. This means two full pipeline runs spawn on Modal simultaneously. We just caught this live — two build_datasets calls running at the same time (screenshots shared in Slack).

With 8 workers this was wasteful but tolerable. With 50 workers, it's 100 containers competing for the same volume. This should be fixed before or alongside this PR.

Suggested fix — add the same guard that push.yaml uses:

jobs:
  pipeline:
    runs-on: ubuntu-latest
    if: github.event.head_commit.message == 'Update package version'
    steps:
      ...

This way only the version-bump commit (which is the final, definitive push) triggers the pipeline. The merge commit is redundant since versioning always follows immediately.

anth-volk · 2026-04-03T19:42:18Z

@baogorek figures, that double pipeline issue is the same issue we had live yesterday while on a meeting with @juaristi22 when docs build ran twice. Meeting with you soon, will catch up and determine how to handle.

baogorek · 2026-04-03T22:11:42Z

Update: the pipeline just failed on the post-merge run. Two issues to address in this PR or a companion:

1. `ModuleNotFoundError: No module named 'policyengine_core'` in `coordinate_publish`

At local_area.py:764:

from policyengine_us_data.calibration.calibration_utils import STATE_CODES

This triggers the full package __init__.py (from .datasets import *), which chains into policyengine_core. The Modal image for coordinate_publish doesn't have policyengine_core (or policyengine_us, which calibration_utils.py also imports).

STATE_CODES is just a FIPS→abbreviation dict. The simplest fix is to either:

Move STATE_CODES to a lightweight module with no heavy imports (e.g., policyengine_us_data/utils/state_codes.py)
Or inline the dict / import it from us package which is already available

2. Version mismatch: Run ID says `1.74.2` but publishes `1.74.1`

The pipeline picked up a stale package version. Might be a modal deploy caching issue or the checkout not reflecting the version bump commit.

The policyengine_core import is the immediate blocker — the pipeline can't get past coordinate_publish without it.

anth-volk marked this pull request as ready for review April 2, 2026 19:35

anth-volk requested a review from juaristi22 April 2, 2026 19:35

Merge branch 'main' into fix/h5-pipeline-improvements

3b1a297

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale H5 pipeline to 50 workers at 1 CPU each#684

Scale H5 pipeline to 50 workers at 1 CPU each#684
anth-volk wants to merge 2 commits intomainfrom
fix/h5-pipeline-improvements

anth-volk commented Apr 2, 2026

Uh oh!

baogorek commented Apr 3, 2026

Uh oh!

anth-volk commented Apr 3, 2026

Uh oh!

baogorek commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

anth-volk commented Apr 2, 2026

Summary

Test plan

Uh oh!

baogorek commented Apr 3, 2026

1. Minor: max_containers vs num_workers mismatch

2. Major: pipeline.yaml triggers twice per merge → double pipeline run

Uh oh!

anth-volk commented Apr 3, 2026

Uh oh!

baogorek commented Apr 3, 2026

1. ModuleNotFoundError: No module named 'policyengine_core' in coordinate_publish

2. Version mismatch: Run ID says 1.74.2 but publishes 1.74.1

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. Minor: `max_containers` vs `num_workers` mismatch

2. Major: `pipeline.yaml` triggers twice per merge → double pipeline run

1. `ModuleNotFoundError: No module named 'policyengine_core'` in `coordinate_publish`

2. Version mismatch: Run ID says `1.74.2` but publishes `1.74.1`