Reduce MPI oversubscription: default BLAS/OpenMP threads to 1 per rank by gthyagi · Pull Request #74 · underworldcode/underworld3

gthyagi · 2026-03-07T13:55:27Z

Problem

On many systems (especially with OpenBLAS defaults), each MPI rank can spawn multiple BLAS/OpenMP threads. This causes oversubscription (e.g. np=8 with 10 BLAS threads => up to 80 runnable threads), resulting in poor scaling and large walltime regressions.

What this PR changes

./uw launcher now sets safe defaults for common thread-pool env vars only if they are not already set:
- OMP_NUM_THREADS=1
- OPENBLAS_NUM_THREADS=1
- MKL_NUM_THREADS=1
- VECLIB_MAXIMUM_THREADS=1
- NUMEXPR_NUM_THREADS=1
underworld3/__init__.py applies the same MPI-safe default policy at import time for MPI runs (size > 1) when vars are unset.
Adds a rank-0 runtime warning when MPI is used and any thread pool is explicitly configured >1.
Documents the policy and override controls in docs/advanced/parallel-computing.md.

Why this approach

Fixes the performance pitfall for users by default.
Preserves user control: explicit env settings are respected.
Keeps behavior predictable across launcher and direct Python execution paths.

Override controls

Disable automatic caps:
- UW_DISABLE_THREAD_CAPS=1
Suppress warning (while keeping explicit >1 settings):
- UW_SUPPRESS_THREAD_WARNING=1

Scope and compatibility

No solver API changes.
No change to single-rank behavior.
MPI behavior is safer by default; advanced users can opt out.

Underworld development team with AI support from Claude Code

lmoresi · 2026-03-09T21:34:10Z

Review: overlap with PR #75

Thanks @gthyagi — this is good defensive work. A few notes on how it fits with the parallel scaling investigation:

Root cause is MPICH, not just thread oversubscription

The thread capping helps (it was burning 200%+ CPU on single-proc runs due to OpenBLAS defaults), but the catastrophic parallel scaling regression you reported in #68 is actually caused by MPICH 4.x collective operations on macOS. We confirmed this with pure C MPI benchmarks — no Python, no PETSc, no OpenBLAS involved. See the detailed diagnosis in #68.

PR #75 (feature/openmpi-platform-default) switches the default MPI to OpenMPI on macOS, which fixes the parallel scaling. It also includes the thread capping via pixi.toml activation environment:

[activation.env]
OMP_NUM_THREADS = "1"
OPENBLAS_NUM_THREADS = "1"

What to keep from this PR

The __init__.py safety net is genuinely useful — it protects users who run UW3 outside of pixi (e.g. pip install, or on clusters with system MPI). The pixi activation only covers pixi-managed environments.

The uw launcher changes will conflict with #75 since both modify the same file. The pixi activation approach in #75 makes the launcher-level capping redundant (pixi sets the env vars before any process launches), so the uw script portion here can be dropped.

The docs update to parallel-computing.md is worth keeping.

Suggested path forward

Rebase this onto development (not feature/boundary-integrals)
Drop the uw script changes (covered by pixi activation in Switch default MPI to OpenMPI on macOS (fixes #68) #75)
Keep the __init__.py thread policy and the docs update
Merge Switch default MPI to OpenMPI on macOS (fixes #68) #75 first (it fixes the actual scaling regression), then this PR adds the Python-level safety net on top

Also — please try the OpenMPI build on your machine to confirm the scaling fix. Instructions are in the #68 comment and in PR #75.

Underworld development team with AI support from Claude Code

lmoresi · 2026-03-09T22:13:29Z

@gthyagi - I didn't have much luck fixing the scaling issue when I tried knocking back the thread over-subscription, but I was not a thorough with the number of settings. Does this sort out the scaling for you ?

We can combine the PRs to grab all the effective fixes. The main thing for you is to rebase onto development.

gthyagi · 2026-03-09T23:54:58Z

@lmoresi I’ll implement the proposed path-forward steps. I agree that the changes in the uw launch script are redundant.

Also, the thread over-subscription fix alone did not resolve the negative MPI scaling—it only partially addressed the CPU spike issue. The CPU spikes appear to result from a combination of MPICH communication behaviour and multiple thread subscriptions, which is why I referenced this PR in the context of the negative MPI scaling.

Wrap PETSc's DMPlexComputeBdIntegral to provide standalone boundary/surface integral capability. In 2D these are line integrals over boundary edges; in 3D they are surface integrals over boundary faces (both codimension-1). The integrand may reference the outward unit normal via mesh.Gamma / mesh.Gamma_N. Internal boundaries (e.g. AnnulusInternalBoundary, BoxInternalBoundary) work out of the box through the same DMLabel infrastructure. New API: bd_int = uw.maths.BdIntegral(mesh, fn=1.0, boundary="Top") length = bd_int.evaluate() Tests cover: constant/coordinate/sympy/mesh-variable integrands, normal-vector integrands, perimeter checks, BoxInternalBoundary, and AnnulusInternalBoundary circumference. Underworld development team with AI support from Claude Code

Use the same team attribution wording for both commits and PR descriptions. No emoji in PR bodies. Underworld development team with AI support from Claude Code

DMPlexComputeBdIntegral (unlike DMPlexComputeIntegralFEM) does not perform an MPI reduction — it returns only the local process contribution. Add MPIU_Allreduce in UW_DMPlexComputeBdIntegral to sum across ranks. Found by testing with mpirun -np 5..8 on AnnulusInternalBoundary. Note: a pre-existing PETSc internal-facet orientation issue can cause small errors (~1.5%) on internal boundaries at certain partition counts (e.g. np=6 with the default annulus mesh). External boundaries are unaffected. This is upstream of our wrapper. Underworld development team with AI support from Claude Code

DMPlexComputeBdIntegral iterates over ALL label stratum points including ghost facets, so shared internal boundary facets get integrated on both the owning rank and the ghost rank. External boundaries are unaffected because external facets have only one supporting cell (no ghosting). The fix creates a temporary DMLabel containing only owned (non-ghost) boundary points by checking against the PetscSF leaf set, then passes this filtered label to DMPlexComputeBdIntegral. Also restructures tests to use lazy initialization for BoxInternalBoundary (which has a pre-existing MPI bug) so it doesn't block other tests under mpirun. Verified at np=1,5,6,7,8 — all produce identical results. Underworld development team with AI support from Claude Code

DMPlexComputeBdResidual_Internal and DMPlexComputeBdIntegral do not exclude ghost facets (SF leaves) from the facet IS, unlike the volume residual code which checks the ghost label. For internal boundaries at partition junctions, this causes shared facets to be integrated on multiple ranks, producing O(1) errors after MPI summation. The patch filters ghost facets using ISDifference, following the same pattern used by DMPlexComputeBdFaceGeomFVM for FVM face flux. It applies cleanly to PETSc v3.18.0 through v3.24.5. Includes: - petsc-custom/patches/plexfem-ghost-facet-fix.patch - build-petsc.sh integration (auto-applies after clone) - MR description for upstream PETSc submission Upstream MR branch pushed to gitlab.com/lmoresi/petsc (fix/plexfem-ghost-facet-boundary-residual) Underworld development team with AI support from Claude Code

PETSc changed DMPlexComputeBdIntegral from void (*func)(...) to void (**funcs)(...) in v3.22.0. CI uses PETSc 3.21.5 (pinned in environment.yaml), so the wrapper must handle both signatures. Underworld development team with AI support from Claude Code

…or bug) Underworld development team with AI support from Claude Code

Apply MPI-safe default thread caps in the uw launcher and add a runtime oversubscription warning/policy in underworld3 import logic; document behavior and override controls in parallel-computing docs. Underworld development team with AI support from Claude Code

gthyagi · 2026-03-10T03:31:34Z

@lmoresi I have rebased onto the development branch and tested the OpenMPI branch on Mac. Both PRs appear ready to merge.

gthyagi · 2026-03-10T03:42:32Z

This PR also includes the boundary-integral branch, which I am still testing. I created a new PR (#76) that cherry-picks only the BLAS/OpenMP thread changes. Therefore, this PR is no longer needed and will be closed.

gthyagi requested a review from lmoresi as a code owner March 7, 2026 13:55

gthyagi changed the title ~~perf(mpi): cap thread pools for MPI and document policy~~ Reduce MPI oversubscription: default BLAS/OpenMP threads to 1 per rank Mar 7, 2026

gthyagi mentioned this pull request Mar 7, 2026

Severe negative MPI scaling with the latest dev branch #68

Closed

lmoresi and others added 9 commits March 10, 2026 14:21

Clarify AI attribution convention for PRs in CLAUDE.md

8c057b6

Use the same team attribution wording for both commits and PR descriptions. No emoji in PR bodies. Underworld development team with AI support from Claude Code

Skip BoxInternalBoundary tests under MPI (pre-existing mesh construct…

dcc9306

…or bug) Underworld development team with AI support from Claude Code

revert: restore uw wrapper from development

d42f6b2

gthyagi force-pushed the feature/boundary-integrals branch from 2c1da1f to d42f6b2 Compare March 10, 2026 03:21

gthyagi changed the base branch from feature/boundary-integrals to development March 10, 2026 03:25

gthyagi closed this Mar 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce MPI oversubscription: default BLAS/OpenMP threads to 1 per rank#74

Reduce MPI oversubscription: default BLAS/OpenMP threads to 1 per rank#74
gthyagi wants to merge 9 commits intounderworldcode:developmentfrom
gthyagi:feature/boundary-integrals

gthyagi commented Mar 7, 2026 •

edited

Loading

Uh oh!

lmoresi commented Mar 9, 2026

Uh oh!

lmoresi commented Mar 9, 2026

Uh oh!

gthyagi commented Mar 9, 2026

Uh oh!

gthyagi commented Mar 10, 2026

Uh oh!

gthyagi commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gthyagi commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What this PR changes

Why this approach

Override controls

Scope and compatibility

Uh oh!

lmoresi commented Mar 9, 2026

Review: overlap with PR #75

Root cause is MPICH, not just thread oversubscription

What to keep from this PR

Suggested path forward

Uh oh!

lmoresi commented Mar 9, 2026

Uh oh!

gthyagi commented Mar 9, 2026

Uh oh!

gthyagi commented Mar 10, 2026

Uh oh!

gthyagi commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gthyagi commented Mar 7, 2026 •

edited

Loading