Skip to content

Reduce MPI oversubscription: default BLAS/OpenMP threads to 1 per rank#74

Closed
gthyagi wants to merge 9 commits intounderworldcode:developmentfrom
gthyagi:feature/boundary-integrals
Closed

Reduce MPI oversubscription: default BLAS/OpenMP threads to 1 per rank#74
gthyagi wants to merge 9 commits intounderworldcode:developmentfrom
gthyagi:feature/boundary-integrals

Conversation

@gthyagi
Copy link
Contributor

@gthyagi gthyagi commented Mar 7, 2026

Problem

On many systems (especially with OpenBLAS defaults), each MPI rank can spawn multiple BLAS/OpenMP threads. This causes oversubscription (e.g. np=8 with 10 BLAS threads => up to 80 runnable threads), resulting in poor scaling and large walltime regressions.

What this PR changes

  1. ./uw launcher now sets safe defaults for common thread-pool env vars only if they are not already set:
    • OMP_NUM_THREADS=1
    • OPENBLAS_NUM_THREADS=1
    • MKL_NUM_THREADS=1
    • VECLIB_MAXIMUM_THREADS=1
    • NUMEXPR_NUM_THREADS=1
  2. underworld3/__init__.py applies the same MPI-safe default policy at import time for MPI runs (size > 1) when vars are unset.
  3. Adds a rank-0 runtime warning when MPI is used and any thread pool is explicitly configured >1.
  4. Documents the policy and override controls in docs/advanced/parallel-computing.md.

Why this approach

  • Fixes the performance pitfall for users by default.
  • Preserves user control: explicit env settings are respected.
  • Keeps behavior predictable across launcher and direct Python execution paths.

Override controls

  • Disable automatic caps:
    • UW_DISABLE_THREAD_CAPS=1
  • Suppress warning (while keeping explicit >1 settings):
    • UW_SUPPRESS_THREAD_WARNING=1

Scope and compatibility

  • No solver API changes.
  • No change to single-rank behavior.
  • MPI behavior is safer by default; advanced users can opt out.

Underworld development team with AI support from Claude Code

@gthyagi gthyagi requested a review from lmoresi as a code owner March 7, 2026 13:55
@gthyagi gthyagi changed the title perf(mpi): cap thread pools for MPI and document policy Reduce MPI oversubscription: default BLAS/OpenMP threads to 1 per rank Mar 7, 2026
@lmoresi
Copy link
Member

lmoresi commented Mar 9, 2026

Review: overlap with PR #75

Thanks @gthyagi — this is good defensive work. A few notes on how it fits with the parallel scaling investigation:

Root cause is MPICH, not just thread oversubscription

The thread capping helps (it was burning 200%+ CPU on single-proc runs due to OpenBLAS defaults), but the catastrophic parallel scaling regression you reported in #68 is actually caused by MPICH 4.x collective operations on macOS. We confirmed this with pure C MPI benchmarks — no Python, no PETSc, no OpenBLAS involved. See the detailed diagnosis in #68.

PR #75 (feature/openmpi-platform-default) switches the default MPI to OpenMPI on macOS, which fixes the parallel scaling. It also includes the thread capping via pixi.toml activation environment:

[activation.env]
OMP_NUM_THREADS = "1"
OPENBLAS_NUM_THREADS = "1"

What to keep from this PR

The __init__.py safety net is genuinely useful — it protects users who run UW3 outside of pixi (e.g. pip install, or on clusters with system MPI). The pixi activation only covers pixi-managed environments.

The uw launcher changes will conflict with #75 since both modify the same file. The pixi activation approach in #75 makes the launcher-level capping redundant (pixi sets the env vars before any process launches), so the uw script portion here can be dropped.

The docs update to parallel-computing.md is worth keeping.

Suggested path forward

  1. Rebase this onto development (not feature/boundary-integrals)
  2. Drop the uw script changes (covered by pixi activation in Switch default MPI to OpenMPI on macOS (fixes #68) #75)
  3. Keep the __init__.py thread policy and the docs update
  4. Merge Switch default MPI to OpenMPI on macOS (fixes #68) #75 first (it fixes the actual scaling regression), then this PR adds the Python-level safety net on top

Also — please try the OpenMPI build on your machine to confirm the scaling fix. Instructions are in the #68 comment and in PR #75.

Underworld development team with AI support from Claude Code

@lmoresi
Copy link
Member

lmoresi commented Mar 9, 2026

@gthyagi - I didn't have much luck fixing the scaling issue when I tried knocking back the thread over-subscription, but I was not a thorough with the number of settings. Does this sort out the scaling for you ?

We can combine the PRs to grab all the effective fixes. The main thing for you is to rebase onto development.

@gthyagi
Copy link
Contributor Author

gthyagi commented Mar 9, 2026

@lmoresi I’ll implement the proposed path-forward steps. I agree that the changes in the uw launch script are redundant.

Also, the thread over-subscription fix alone did not resolve the negative MPI scaling—it only partially addressed the CPU spike issue. The CPU spikes appear to result from a combination of MPICH communication behaviour and multiple thread subscriptions, which is why I referenced this PR in the context of the negative MPI scaling.

lmoresi and others added 9 commits March 10, 2026 14:21
Wrap PETSc's DMPlexComputeBdIntegral to provide standalone boundary/surface
integral capability. In 2D these are line integrals over boundary edges;
in 3D they are surface integrals over boundary faces (both codimension-1).

The integrand may reference the outward unit normal via mesh.Gamma / mesh.Gamma_N.
Internal boundaries (e.g. AnnulusInternalBoundary, BoxInternalBoundary) work
out of the box through the same DMLabel infrastructure.

New API:
    bd_int = uw.maths.BdIntegral(mesh, fn=1.0, boundary="Top")
    length = bd_int.evaluate()

Tests cover: constant/coordinate/sympy/mesh-variable integrands, normal-vector
integrands, perimeter checks, BoxInternalBoundary, and AnnulusInternalBoundary
circumference.

Underworld development team with AI support from Claude Code
Use the same team attribution wording for both commits and PR descriptions.
No emoji in PR bodies.

Underworld development team with AI support from Claude Code
DMPlexComputeBdIntegral (unlike DMPlexComputeIntegralFEM) does not
perform an MPI reduction — it returns only the local process contribution.
Add MPIU_Allreduce in UW_DMPlexComputeBdIntegral to sum across ranks.

Found by testing with mpirun -np 5..8 on AnnulusInternalBoundary.

Note: a pre-existing PETSc internal-facet orientation issue can cause
small errors (~1.5%) on internal boundaries at certain partition counts
(e.g. np=6 with the default annulus mesh). External boundaries are
unaffected. This is upstream of our wrapper.

Underworld development team with AI support from Claude Code
DMPlexComputeBdIntegral iterates over ALL label stratum points including
ghost facets, so shared internal boundary facets get integrated on both
the owning rank and the ghost rank. External boundaries are unaffected
because external facets have only one supporting cell (no ghosting).

The fix creates a temporary DMLabel containing only owned (non-ghost)
boundary points by checking against the PetscSF leaf set, then passes
this filtered label to DMPlexComputeBdIntegral.

Also restructures tests to use lazy initialization for BoxInternalBoundary
(which has a pre-existing MPI bug) so it doesn't block other tests under
mpirun.

Verified at np=1,5,6,7,8 — all produce identical results.

Underworld development team with AI support from Claude Code
DMPlexComputeBdResidual_Internal and DMPlexComputeBdIntegral do not
exclude ghost facets (SF leaves) from the facet IS, unlike the volume
residual code which checks the ghost label. For internal boundaries
at partition junctions, this causes shared facets to be integrated on
multiple ranks, producing O(1) errors after MPI summation.

The patch filters ghost facets using ISDifference, following the same
pattern used by DMPlexComputeBdFaceGeomFVM for FVM face flux. It
applies cleanly to PETSc v3.18.0 through v3.24.5.

Includes:
- petsc-custom/patches/plexfem-ghost-facet-fix.patch
- build-petsc.sh integration (auto-applies after clone)
- MR description for upstream PETSc submission

Upstream MR branch pushed to gitlab.com/lmoresi/petsc
(fix/plexfem-ghost-facet-boundary-residual)

Underworld development team with AI support from Claude Code
PETSc changed DMPlexComputeBdIntegral from void (*func)(...) to
void (**funcs)(...) in v3.22.0. CI uses PETSc 3.21.5 (pinned in
environment.yaml), so the wrapper must handle both signatures.

Underworld development team with AI support from Claude Code
…or bug)

Underworld development team with AI support from Claude Code
Apply MPI-safe default thread caps in the uw launcher and add a runtime oversubscription warning/policy in underworld3 import logic; document behavior and override controls in parallel-computing docs.

Underworld development team with AI support from Claude Code
@gthyagi gthyagi force-pushed the feature/boundary-integrals branch from 2c1da1f to d42f6b2 Compare March 10, 2026 03:21
@gthyagi gthyagi changed the base branch from feature/boundary-integrals to development March 10, 2026 03:25
@gthyagi
Copy link
Contributor Author

gthyagi commented Mar 10, 2026

@lmoresi I have rebased onto the development branch and tested the OpenMPI branch on Mac. Both PRs appear ready to merge.

@gthyagi
Copy link
Contributor Author

gthyagi commented Mar 10, 2026

This PR also includes the boundary-integral branch, which I am still testing. I created a new PR (#76) that cherry-picks only the BLAS/OpenMP thread changes. Therefore, this PR is no longer needed and will be closed.

@gthyagi gthyagi closed this Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants