Skip to content

Scale H5 pipeline to 50 workers at 1 CPU each#684

Open
anth-volk wants to merge 2 commits intomainfrom
fix/h5-pipeline-improvements
Open

Scale H5 pipeline to 50 workers at 1 CPU each#684
anth-volk wants to merge 2 commits intomainfrom
fix/h5-pipeline-improvements

Conversation

@anth-volk
Copy link
Copy Markdown
Collaborator

Fixes #683

Summary

  • Increase default H5 workers from 8 to 50 (~10 items per worker instead of ~60)
  • Drop worker CPU from 4 to 1 (single-threaded numpy work doesn't use extra cores)
  • Add max_containers=50 safety cap on the Modal function decorator
  • Update defaults in coordinate_publish, run_pipeline, CLI entrypoints, and pipeline.yaml workflow

Test plan

  • Dispatch pipeline after merge and verify 50 workers spin up
  • Verify H5 builds complete successfully at 1 CPU
  • Compare wall-clock time and cost against previous 8-worker runs

🤖 Generated with Claude Code

The H5 build work is single-threaded numpy. 4 CPUs per worker was
wasted. 8 workers meant each processed ~60 items serially.

- Increase default workers from 8 to 50 (~10 items per worker)
- Drop worker CPU from 4 to 1 (saves 75% CPU cost)
- Add max_containers=50 as safety cap
- Wall-clock time drops from ~60min to ~12min
- Total CPU cost drops: 8×60min×4CPU → 50×12min×1CPU

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@anth-volk anth-volk marked this pull request as ready for review April 2, 2026 19:35
@anth-volk anth-volk requested a review from juaristi22 April 2, 2026 19:35
@baogorek
Copy link
Copy Markdown
Collaborator

baogorek commented Apr 3, 2026

Hey Anthony, nice change — the 4 CPU → 1 CPU drop makes sense for single-threaded numpy work, and parallelizing across 50 workers should cut wall-clock time significantly.

Two things to flag:

1. Minor: max_containers vs num_workers mismatch

max_containers=50 is hardcoded on the function decorator, but num_workers is a CLI parameter that someone could set higher. If someone passes --num-workers 60, Modal silently caps at 50 containers and queues the rest. Not a big deal, but worth a comment or a runtime clamp so it's not surprising.

2. Major: pipeline.yaml triggers twice per merge → double pipeline run

pipeline.yaml fires on every push to main with no commit-message filter:

on:
  push:
    branches: [main]

Every PR merge creates two pushes: the merge commit, then the version-bump commit from versioning.yaml. This means two full pipeline runs spawn on Modal simultaneously. We just caught this live — two build_datasets calls running at the same time (screenshots shared in Slack).

With 8 workers this was wasteful but tolerable. With 50 workers, it's 100 containers competing for the same volume. This should be fixed before or alongside this PR.

Suggested fix — add the same guard that push.yaml uses:

jobs:
  pipeline:
    runs-on: ubuntu-latest
    if: github.event.head_commit.message == 'Update package version'
    steps:
      ...

This way only the version-bump commit (which is the final, definitive push) triggers the pipeline. The merge commit is redundant since versioning always follows immediately.

@anth-volk
Copy link
Copy Markdown
Collaborator Author

@baogorek figures, that double pipeline issue is the same issue we had live yesterday while on a meeting with @juaristi22 when docs build ran twice. Meeting with you soon, will catch up and determine how to handle.

@baogorek
Copy link
Copy Markdown
Collaborator

baogorek commented Apr 3, 2026

Update: the pipeline just failed on the post-merge run. Two issues to address in this PR or a companion:

1. ModuleNotFoundError: No module named 'policyengine_core' in coordinate_publish

At local_area.py:764:

from policyengine_us_data.calibration.calibration_utils import STATE_CODES

This triggers the full package __init__.py (from .datasets import *), which chains into policyengine_core. The Modal image for coordinate_publish doesn't have policyengine_core (or policyengine_us, which calibration_utils.py also imports).

STATE_CODES is just a FIPS→abbreviation dict. The simplest fix is to either:

  • Move STATE_CODES to a lightweight module with no heavy imports (e.g., policyengine_us_data/utils/state_codes.py)
  • Or inline the dict / import it from us package which is already available

2. Version mismatch: Run ID says 1.74.2 but publishes 1.74.1

The pipeline picked up a stale package version. Might be a modal deploy caching issue or the checkout not reflecting the version bump commit.

The policyengine_core import is the immediate blocker — the pipeline can't get past coordinate_publish without it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scale H5 pipeline workers from 8 to 50 and reduce CPU allocation

3 participants