Reduce MPI oversubscription: default BLAS/OpenMP threads to 1 per rank#74
Reduce MPI oversubscription: default BLAS/OpenMP threads to 1 per rank#74gthyagi wants to merge 9 commits intounderworldcode:developmentfrom
Conversation
Review: overlap with PR #75Thanks @gthyagi — this is good defensive work. A few notes on how it fits with the parallel scaling investigation: Root cause is MPICH, not just thread oversubscriptionThe thread capping helps (it was burning 200%+ CPU on single-proc runs due to OpenBLAS defaults), but the catastrophic parallel scaling regression you reported in #68 is actually caused by MPICH 4.x collective operations on macOS. We confirmed this with pure C MPI benchmarks — no Python, no PETSc, no OpenBLAS involved. See the detailed diagnosis in #68. PR #75 ( [activation.env]
OMP_NUM_THREADS = "1"
OPENBLAS_NUM_THREADS = "1"What to keep from this PRThe The The docs update to Suggested path forward
Also — please try the OpenMPI build on your machine to confirm the scaling fix. Instructions are in the #68 comment and in PR #75. Underworld development team with AI support from Claude Code |
|
@gthyagi - I didn't have much luck fixing the scaling issue when I tried knocking back the thread over-subscription, but I was not a thorough with the number of settings. Does this sort out the scaling for you ? We can combine the PRs to grab all the effective fixes. The main thing for you is to rebase onto development. |
|
@lmoresi I’ll implement the proposed path-forward steps. I agree that the changes in the uw launch script are redundant. Also, the thread over-subscription fix alone did not resolve the negative MPI scaling—it only partially addressed the CPU spike issue. The CPU spikes appear to result from a combination of MPICH communication behaviour and multiple thread subscriptions, which is why I referenced this PR in the context of the negative MPI scaling. |
Wrap PETSc's DMPlexComputeBdIntegral to provide standalone boundary/surface
integral capability. In 2D these are line integrals over boundary edges;
in 3D they are surface integrals over boundary faces (both codimension-1).
The integrand may reference the outward unit normal via mesh.Gamma / mesh.Gamma_N.
Internal boundaries (e.g. AnnulusInternalBoundary, BoxInternalBoundary) work
out of the box through the same DMLabel infrastructure.
New API:
bd_int = uw.maths.BdIntegral(mesh, fn=1.0, boundary="Top")
length = bd_int.evaluate()
Tests cover: constant/coordinate/sympy/mesh-variable integrands, normal-vector
integrands, perimeter checks, BoxInternalBoundary, and AnnulusInternalBoundary
circumference.
Underworld development team with AI support from Claude Code
Use the same team attribution wording for both commits and PR descriptions. No emoji in PR bodies. Underworld development team with AI support from Claude Code
DMPlexComputeBdIntegral (unlike DMPlexComputeIntegralFEM) does not perform an MPI reduction — it returns only the local process contribution. Add MPIU_Allreduce in UW_DMPlexComputeBdIntegral to sum across ranks. Found by testing with mpirun -np 5..8 on AnnulusInternalBoundary. Note: a pre-existing PETSc internal-facet orientation issue can cause small errors (~1.5%) on internal boundaries at certain partition counts (e.g. np=6 with the default annulus mesh). External boundaries are unaffected. This is upstream of our wrapper. Underworld development team with AI support from Claude Code
DMPlexComputeBdIntegral iterates over ALL label stratum points including ghost facets, so shared internal boundary facets get integrated on both the owning rank and the ghost rank. External boundaries are unaffected because external facets have only one supporting cell (no ghosting). The fix creates a temporary DMLabel containing only owned (non-ghost) boundary points by checking against the PetscSF leaf set, then passes this filtered label to DMPlexComputeBdIntegral. Also restructures tests to use lazy initialization for BoxInternalBoundary (which has a pre-existing MPI bug) so it doesn't block other tests under mpirun. Verified at np=1,5,6,7,8 — all produce identical results. Underworld development team with AI support from Claude Code
DMPlexComputeBdResidual_Internal and DMPlexComputeBdIntegral do not exclude ghost facets (SF leaves) from the facet IS, unlike the volume residual code which checks the ghost label. For internal boundaries at partition junctions, this causes shared facets to be integrated on multiple ranks, producing O(1) errors after MPI summation. The patch filters ghost facets using ISDifference, following the same pattern used by DMPlexComputeBdFaceGeomFVM for FVM face flux. It applies cleanly to PETSc v3.18.0 through v3.24.5. Includes: - petsc-custom/patches/plexfem-ghost-facet-fix.patch - build-petsc.sh integration (auto-applies after clone) - MR description for upstream PETSc submission Upstream MR branch pushed to gitlab.com/lmoresi/petsc (fix/plexfem-ghost-facet-boundary-residual) Underworld development team with AI support from Claude Code
PETSc changed DMPlexComputeBdIntegral from void (*func)(...) to void (**funcs)(...) in v3.22.0. CI uses PETSc 3.21.5 (pinned in environment.yaml), so the wrapper must handle both signatures. Underworld development team with AI support from Claude Code
…or bug) Underworld development team with AI support from Claude Code
Apply MPI-safe default thread caps in the uw launcher and add a runtime oversubscription warning/policy in underworld3 import logic; document behavior and override controls in parallel-computing docs. Underworld development team with AI support from Claude Code
2c1da1f to
d42f6b2
Compare
|
@lmoresi I have rebased onto the development branch and tested the OpenMPI branch on Mac. Both PRs appear ready to merge. |
|
This PR also includes the boundary-integral branch, which I am still testing. I created a new PR (#76) that cherry-picks only the BLAS/OpenMP thread changes. Therefore, this PR is no longer needed and will be closed. |
Problem
On many systems (especially with OpenBLAS defaults), each MPI rank can spawn multiple BLAS/OpenMP threads. This causes oversubscription (e.g.
np=8with 10 BLAS threads => up to 80 runnable threads), resulting in poor scaling and large walltime regressions.What this PR changes
./uwlauncher now sets safe defaults for common thread-pool env vars only if they are not already set:OMP_NUM_THREADS=1OPENBLAS_NUM_THREADS=1MKL_NUM_THREADS=1VECLIB_MAXIMUM_THREADS=1NUMEXPR_NUM_THREADS=1underworld3/__init__.pyapplies the same MPI-safe default policy at import time for MPI runs (size > 1) when vars are unset.>1.docs/advanced/parallel-computing.md.Why this approach
Override controls
UW_DISABLE_THREAD_CAPS=1>1settings):UW_SUPPRESS_THREAD_WARNING=1Scope and compatibility
Underworld development team with AI support from Claude Code