underworldcode · gthyagi · Mar 3, 2026 · Mar 3, 2026 · Mar 3, 2026 · Mar 3, 2026
@@ -198,16 +198,20 @@ worktree changes, the worktree edits will not be active. Always:
 2. `./uw build`
 3. Run your code / tests from there
 
-### AI-Assisted Commit Attribution
-When committing code developed with AI assistance, end the commit message with:
+### AI-Assisted Attribution (Commits and PRs)
+When committing code or creating pull requests with AI assistance, end the
+message/body with:
 
 ```
-Underworld development team with AI support from Claude Code
+Underworld development team with AI support from [Claude Code](https://claude.com/claude-code)
 ```
 
+(In commit messages, use the plain-text form without the markdown link.)
+
 **Do NOT use**:
 - `Co-Authored-By:` with a noreply email (useless for soliciting responses)
 - Generic AI attribution without team context
+- Emoji in PR descriptions
 
 ---
 

@@ -18,6 +18,56 @@ Underworld3 uses PETSc for parallel operations, which means **you rarely need to
 
 The main use of `uw.mpi.rank` is for conditional output/visualization.
 
+## MPI + Thread Pools (Oversubscription)
+
+When running with MPI, each rank can also spawn BLAS/OpenMP worker threads.
+If this is not controlled, total runnable threads can explode and performance
+can degrade severely.
+
+Example: `mpirun -np 8` with OpenBLAS default `10` threads can create up to
+`80` compute threads, often slower than expected.
+
+### Default Underworld3 Policy
+
+Underworld3 now applies MPI-safe defaults (thread pool size `1`) unless users
+explicitly set their own values:
+
+- `OMP_NUM_THREADS`
+- `OPENBLAS_NUM_THREADS`
+- `MKL_NUM_THREADS`
+- `VECLIB_MAXIMUM_THREADS`
+- `NUMEXPR_NUM_THREADS`
+
+This happens in two places:
+
+1. `./uw` launcher: sets defaults before Python starts.
+2. `underworld3` import path: applies the same defaults for MPI runs if unset.
+
+### Runtime Warning
+
+If running with MPI and any of the thread variables above are explicitly set
+to values greater than `1`, Underworld3 prints a rank-0 warning about possible
+oversubscription.
+
+### User Controls
+
+- Disable automatic thread caps:
+
+```bash
+export UW_DISABLE_THREAD_CAPS=1
+```
+
+- Suppress warning (keep your explicit thread settings):
+
+```bash
+export UW_SUPPRESS_THREAD_WARNING=1
+```
+
+### Recommended Practice
+
+For most MPI benchmark and production jobs, keep `1` thread per rank unless
+you are intentionally tuning hybrid MPI+threads.
+
 ## Parallel-Safe Output
 
 ### The Problem with Rank Conditionals
@@ -468,4 +518,4 @@ These operations require **ALL ranks** to participate:
 4. **Collective operations must run on ALL ranks** - never inside rank conditionals
 5. **Test with `mpirun -np N`** to catch issues early
 
-The parallel safety system makes parallel programming in Underworld3 safer and more intuitive - collective operations are evaluated on all ranks automatically, preventing common deadlock scenarios!
+The parallel safety system makes parallel programming in Underworld3 safer and more intuitive - collective operations are evaluated on all ranks automatically, preventing common deadlock scenarios!
@@ -57,6 +57,27 @@ clone_petsc() {
     echo "Clone complete."
 }
 
+apply_patches() {
+    echo "Applying UW3 patches to PETSc..."
+    cd "$PETSC_DIR"
+
+    # Fix ghost facet double-counting in boundary residual/integral assembly.
+    # Without this, internal boundary natural BCs and BdIntegral produce
+    # incorrect results in parallel (shared facets integrated on multiple ranks).
+    # Submitted upstream: [TODO: add PETSc MR link when available]
+    local patch="${SCRIPT_DIR}/patches/plexfem-ghost-facet-fix.patch"
+    if [ -f "$patch" ]; then
+        if git apply --check "$patch" 2>/dev/null; then
+            git apply "$patch"
+            echo "  Applied: plexfem-ghost-facet-fix.patch"
+        else
+            echo "  Skipped: plexfem-ghost-facet-fix.patch (already applied or conflict)"
+        fi
+    fi
+
+    echo "Patches complete."
+}
+
 configure_petsc() {
     echo "Configuring PETSc with AMR tools..."
     cd "$PETSC_DIR"
@@ -147,6 +168,7 @@ show_help() {
     echo "  build     Build PETSc"
     echo "  test      Run PETSc tests"
     echo "  petsc4py  Build and install petsc4py"
+    echo "  patch     Apply UW3 patches to PETSc source"
     echo "  clean     Remove PETSc directory"
     echo "  help      Show this help"
 }
@@ -155,6 +177,7 @@ show_help() {
 case "${1:-all}" in
     all)
         clone_petsc
+        apply_patches
         configure_petsc
         build_petsc
         build_petsc4py
@@ -175,6 +198,9 @@ case "${1:-all}" in
     build)
         build_petsc
         ;;
+    patch)
+        apply_patches
+        ;;
     test)
         test_petsc
         ;;

@@ -0,0 +1,79 @@
+# PETSc Merge Request: Ghost Facet Fix
+
+## Submission details
+
+- **Fork**: https://gitlab.com/lmoresi/petsc
+- **Branch**: `fix/plexfem-ghost-facet-boundary-residual`
+- **Target**: `petsc/petsc` → `release`
+- **Create MR**: https://gitlab.com/lmoresi/petsc/-/merge_requests/new?merge_request%5Bsource_branch%5D=fix%2Fplexfem-ghost-facet-boundary-residual
+
+## Title
+
+DMPlex: filter ghost facets from boundary residual and integral assembly
+
+## Description
+
+### Problem
+
+`DMPlexComputeBdResidual_Internal` and `DMPlexComputeBdIntegral` construct
+`facetIS` as all facets at depth `dim-1` but do not exclude ghost facets
+(point SF leaves). The volume residual code in `DMPlexComputeResidualByKey`
+checks the `"ghost"` label before calling `DMPlexVecSetClosure` (line 5355),
+but the boundary residual path has no equivalent check.
+
+For **external** boundaries this is benign: ghost facets on the domain
+boundary still have `support[0]` pointing to a valid local cell, and
+`DMLocalToGlobal(ADD_VALUES)` correctly resolves ownership. But for
+**internal** boundaries — facets in the mesh interior that carry a
+`DMLabel` value and a `DM_BC_NATURAL` condition — facets at partition
+junctions are present on multiple ranks. Each rank independently
+integrates the boundary flux via `PetscFEIntegrateBdResidual`, and
+`DMLocalToGlobal(ADD_VALUES)` sums all contributions, causing
+double-counting.
+
+The same issue affects `DMPlexComputeBdIntegral`: ghost boundary facets
+are included in the per-face integral sum, which is then `MPI_Allreduce`d,
+again double-counting shared facets.
+
+### Use case: internal boundaries in geodynamics
+
+[Underworld3](https://github.com/underworldcode/underworld3) is a
+Python/PETSc geodynamics framework that uses DMPlex FEM throughout.
+Geophysical models commonly require traction (natural) boundary conditions
+on **internal surfaces** — for example, a fault plane or a material
+interface embedded within the mesh. These are set up by labeling interior
+facets with a `DMLabel` and registering `DM_BC_NATURAL` via
+`PetscDSAddBoundary` / `PetscWeakFormAddBdResidual`.
+
+This works correctly in serial but produces O(1) errors whenever the
+labeled internal boundary is split across partition boundaries, because
+shared facets are integrated on multiple ranks.
+
+### Fix
+
+After obtaining `facetIS` from `DMLabelGetStratumIS(depthLabel, dim-1, ...)`,
+filter out SF leaves using `ISDifference`. This ensures each boundary facet
+is processed by exactly one rank (the owner).
+
+This follows the same pattern already used in `DMPlexComputeBdFaceGeomFVM`
+(plexfem.c, around line 1087) which calls `PetscFindInt(face, nleaves, leaves, &loc)`
+to skip ghost faces during FVM face flux computation.
+
+### Testing
+
+Tested with Underworld3 Stokes solver using `DM_BC_NATURAL` on an internal
+boundary (annular mesh with labeled interior circle). Results match serial
+to 10+ significant figures across all processor counts tested (1, 2, 4, 8).
+Before the fix, any partition that splits the internal boundary produces
+O(1) errors from double-counted boundary contributions.
+
+The patch applies cleanly to all PETSc releases from v3.18.0 through
+v3.24.5 — the affected code has been structurally unchanged across these
+versions.
+
+## Reference examples
+
+- Stokes Green's function with internal boundary natural BCs:
+  https://github.com/underworldcode/underworld3/blob/development/docs/examples/fluid_mechanics/advanced/Ex_Stokes_Annulus_Kernels.py
+- Cartesian variant:
+  https://github.com/underworldcode/underworld3/blob/development/docs/examples/fluid_mechanics/advanced/Ex_Stokes_Cartesian_Kernels.py
@@ -0,0 +1,59 @@
+diff --git a/src/dm/impls/plex/plexfem.c b/src/dm/impls/plex/plexfem.c
+index 0299f569070..26e6432585d 100644
+--- a/src/dm/impls/plex/plexfem.c
++++ b/src/dm/impls/plex/plexfem.c
+@@ -2874,6 +2874,26 @@ PetscErrorCode DMPlexComputeBdIntegral(DM dm, Vec X, DMLabel label, PetscInt num
+   PetscCall(DMPlexGetDepthLabel(dm, &depthLabel));
+   PetscCall(DMGetDimension(dm, &dim));
+   PetscCall(DMLabelGetStratumIS(depthLabel, dim - 1, &facetIS));
++  /* Filter out ghost facets (SF leaves) so that each boundary facet is only
++     counted on one rank. Without this, shared facets at partition boundaries
++     are integrated on multiple ranks, causing double-counting after MPI sum. */
++  if (facetIS) {
++    PetscSF  sf;
++    PetscInt nleaves;
++    const PetscInt *leaves;
++
++    PetscCall(DMGetPointSF(dm, &sf));
++    PetscCall(PetscSFGetGraph(sf, NULL, &nleaves, &leaves, NULL));
++    if (nleaves > 0 && leaves) {
++      IS leafIS, ownedFacetIS;
++
++      PetscCall(ISCreateGeneral(PETSC_COMM_SELF, nleaves, leaves, PETSC_USE_POINTER, &leafIS));
++      PetscCall(ISDifference(facetIS, leafIS, &ownedFacetIS));
++      PetscCall(ISDestroy(&leafIS));
++      PetscCall(ISDestroy(&facetIS));
++      facetIS = ownedFacetIS;
++    }
++  }
+   PetscCall(DMGetLocalSection(dm, &section));
+   PetscCall(PetscSectionGetNumFields(section, &Nf));
+   /* Get local solution with boundary values */
+@@ -5113,6 +5133,27 @@ static PetscErrorCode DMPlexComputeBdResidual_Internal(DM dm, Vec locX, Vec locX
+   PetscCall(DMPlexGetDepthLabel(dm, &depthLabel));
+   PetscCall(DMGetDimension(dm, &dim));
+   PetscCall(DMLabelGetStratumIS(depthLabel, dim - 1, &facetIS));
++  /* Filter out ghost facets (SF leaves) so that boundary residual contributions
++     from shared facets are only assembled on the owning rank. Without this,
++     internal boundary natural BCs at partition junctions get double-counted
++     because LocalToGlobal with ADD_VALUES sums contributions from all ranks. */
++  if (facetIS) {
++    PetscSF  sf;
++    PetscInt nleaves;
++    const PetscInt *leaves;
++
++    PetscCall(DMGetPointSF(dm, &sf));
++    PetscCall(PetscSFGetGraph(sf, NULL, &nleaves, &leaves, NULL));
++    if (nleaves > 0 && leaves) {
++      IS leafIS, ownedFacetIS;
++
++      PetscCall(ISCreateGeneral(PETSC_COMM_SELF, nleaves, leaves, PETSC_USE_POINTER, &leafIS));
++      PetscCall(ISDifference(facetIS, leafIS, &ownedFacetIS));
++      PetscCall(ISDestroy(&leafIS));
++      PetscCall(ISDestroy(&facetIS));
++      facetIS = ownedFacetIS;
++    }
++  }
+   PetscCall(PetscDSGetNumBoundary(prob, &numBd));
+   for (bd = 0; bd < numBd; ++bd) {
+     PetscWeakForm           wf;
@@ -71,6 +71,8 @@
 from mpi4py import MPI  # for initialising MPI
 import petsc4py as _petsc4py
 import sys
+import os as _os
+import warnings as _warnings
 
 _petsc4py.init(sys.argv)
 
@@ -88,6 +90,65 @@
     # Handle Sphinx autodoc mocking - PETSc mock objects don't support these operations
     pass
 
+
+def _parse_thread_env(name: str):
+    """Return integer thread count from environment variable, or None."""
+    raw = _os.environ.get(name)
+    if raw is None or str(raw).strip() == "":
+        return None
+    try:
+        return int(float(raw))
+    except (TypeError, ValueError):
+        return None
+
+
+def _apply_mpi_thread_policy():
+    """
+    Default thread policy for MPI runs.
+
+    Caps common thread pools to 1 unless explicitly set by the user. This
+    prevents MPI+BLAS oversubscription (large performance regressions).
+    """
+    if MPI.COMM_WORLD.Get_size() <= 1:
+        return
+
+    if _os.environ.get("UW_DISABLE_THREAD_CAPS", "0").lower() in ("1", "true", "yes", "on"):
+        return
+
+    thread_vars = (
+        "OMP_NUM_THREADS",
+        "OPENBLAS_NUM_THREADS",
+        "MKL_NUM_THREADS",
+        "VECLIB_MAXIMUM_THREADS",
+        "NUMEXPR_NUM_THREADS",
+    )
+
+    for var in thread_vars:
+        if _os.environ.get(var, "").strip() == "":
+            _os.environ[var] = "1"
+
+    if MPI.COMM_WORLD.Get_rank() == 0 and _os.environ.get("UW_SUPPRESS_THREAD_WARNING", "0").lower() not in (
+        "1",
+        "true",
+        "yes",
+        "on",
+    ):
+        oversub = {var: _parse_thread_env(var) for var in thread_vars}
+        oversub = {k: v for k, v in oversub.items() if v is not None and v > 1}
+        if oversub:
+            items = ", ".join(f"{k}={v}" for k, v in oversub.items())
+            _warnings.warn(
+                "MPI run with thread pools > 1 detected "
+                f"({items}). This may cause severe oversubscription. "
+                "Set thread counts to 1 per rank, or set UW_SUPPRESS_THREAD_WARNING=1 "
+                "if this is intentional.",
+                RuntimeWarning,
+                stacklevel=2,
+            )
+
+
+_apply_mpi_thread_policy()
+
 # Version is derived from git tags via setuptools_scm
 # Priority: 1) _version.py (generated by setuptools_scm at build time)
 #           2) importlib.metadata (for installed packages)