diff --git a/CHANGELOG.md b/CHANGELOG.md index f9dafed..b81cc41 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -23,8 +23,9 @@ under their entry. `endpoints add/list/remove`, `doctor`, `install-claude`. - Multi-endpoint config schema (`hpc.endpoints.` with `endpoint_id`, `path_prefixes`, `timeout_seconds`) and config - discovery order: `$UXARRAY_MCP_CONFIG` → `~/.config/uxarray-mcp/config.yaml` - → `./config.yaml`. + discovery order: `$UXARRAY_MCP_CONFIG` → `./config.yaml` in the current + working directory → `~/.config/uxarray-mcp/config.yaml` → editable-install + repo fallback. - YAC remote build script (`scripts/hpc_build_yac.py`) and runtime fallback that loads `yac.core` via `importlib.machinery` when the upstream `__init__.py` is unconditional. diff --git a/README.md b/README.md index b1fe950..07ffa21 100644 --- a/README.md +++ b/README.md @@ -82,7 +82,10 @@ The ``uxarray-mcp`` CLI exposes: | ``install-claude`` | print or merge the Claude Desktop ``mcpServers`` block | Config is discovered in this order: ``$UXARRAY_MCP_CONFIG`` → -``~/.config/uxarray-mcp/config.yaml`` → ``./config.yaml`` (repo root). +``./config.yaml`` in the current working directory → +``~/.config/uxarray-mcp/config.yaml`` → the editable-install repo config +fallback. The project-local file wins inside a checkout so development +endpoints are not shadowed by an empty user config. ## Most Users Should Read These in Order diff --git a/conda/recipe/meta.yaml b/conda/recipe/meta.yaml index 7ec12e9..f92a692 100644 --- a/conda/recipe/meta.yaml +++ b/conda/recipe/meta.yaml @@ -23,6 +23,9 @@ requirements: - pip - uv-build >=0.9.26,<0.10.0 run: + # Core package only for the initial feedstock. Add optional HPC support as + # a second output/variant after globus-compute-sdk and academy-py solver + # behavior is validated on conda-forge. - python >={{ python_min }} - fastmcp >=3.4.0 - holoviews >=1.19.0 diff --git a/docs/architecture.html b/docs/architecture.html index 09bed58..ffe5627 100644 --- a/docs/architecture.html +++ b/docs/architecture.html @@ -100,7 +100,7 @@

UXarray MCP Server — Architecture Diagram

-

Mesh-aware assistant · provenance on every output · local or Argonne Improv HPC via Globus Compute · dynamic tool registration

+

Mesh-aware assistant · provenance on every output · local or named HPC endpoints via Globus Compute · unified tool surface

@@ -205,7 +205,7 @@

UXarray MCP Server — Architecture Diagram

Receive tool call run_scientific_agent(path) - HPC tools only if endpoint configured + Remote execution selected per call @@ -261,13 +261,13 @@

UXarray MCP Server — Architecture Diagram

MPAS · UGRID · SCRIP · HEALPix → n_face · n_node · n_edge · format - + - inspect_mesh_hpc( ) + inspect_mesh(use_remote) ✓ _endpoint_is_ready( ) pre-flight remote/compute_functions.py - Globus Executor → Improv cluster + Globus Executor → named endpoint File stays on HPC · timeout 300 s @@ -348,13 +348,13 @@

UXarray MCP Server — Architecture Diagram

Runs entirely on local machine _provenance: venue=local - + HPC Execution ✓ _endpoint_is_ready( ) pre-flight - calculate_area_hpc( ) - calculate_zonal_mean_hpc( ) + calculate_area(use_remote) + calculate_zonal_mean(use_remote) ⚠ auto-fallback if endpoint unreachable @@ -446,15 +446,15 @@

UXarray MCP Server — Architecture Diagram

Flow runs top → bottom, arrows crossing lanes are handoffs. Orange diamonds = routing decisions made automatically by the agent. Green boxes = compute runs on your local machine. - Orange boxes = dispatched to Argonne Improv via Globus Compute — the file never leaves the cluster. - Teal dashed boxes = steps that only activate under certain conditions (data_path provided, endpoint configured). + Orange boxes = dispatched to a named HPC endpoint via Globus Compute — the file never leaves the cluster. + Teal dashed boxes = steps that only activate under certain conditions (data_path provided, remote mode requested). Red dashed inside HPC boxes = automatic local fallback when the endpoint is unreachable. - Dynamic registration: HPC tools (rows 10–13) only appear in Claude's tool list when endpoint_id is set in config.yaml. + Unified registration: tools are registered once; remote execution is selected with use_remote=True and optional endpoint names.
-
All MCP Tools  — 9 always registered · 4 conditional on endpoint_id
+
Representative MCP Tools  — unified local / remote surface
@@ -469,14 +469,14 @@

UXarray MCP Server — Architecture Diagram

- - - - + + + +
#Tool NameSource FileRuns OnDomain ModuleWhat It Does
7get_execution_modeexecution_control.pyLOCALReturns current execution mode (local / hpc / auto) and whether an HPC endpoint is configured.
8set_execution_modeexecution_control.pyLOCALSwitch execution mode from the Claude UI without editing config.yaml directly.
9run_scientific_agentscientific_agent.pyAUTOAll 4 modulesAutonomous 4-stage pipeline: Analyze → Plan → Execute → Verify. Validation-gated. Returns full reasoning trace + provenance + artifacts.
10inspect_mesh_hpcremote_tools.pyHPC*mesh.py on HPCMesh inspection via Globus Compute on Improv. Pre-flight health check. Auto-fallback to local.
11calculate_area_hpcremote_tools.pyHPC*area.py on HPCFace area calculation via Globus Compute. Pre-flight health check. Auto-fallback to local.
12inspect_variable_hpcremote_tools.pyHPC*variable.py on HPCVariable inspection via Globus Compute. Pre-flight health check. Auto-fallback to local.
13calculate_zonal_mean_hpcremote_tools.pyHPC*zonal.py on HPCZonal mean via Globus Compute. Pre-flight health check. Auto-fallback to local.
10inspect_meshremote_tools.pyHPC*mesh.py on HPCMesh inspection with use_remote=True. Pre-flight health check. Auto-fallback to local.
11calculate_arearemote_tools.pyHPC*area.py on HPCFace area calculation with use_remote=True. Pre-flight health check. Auto-fallback to local.
12inspect_variableremote_tools.pyHPC*variable.py on HPCVariable inspection with use_remote=True. Pre-flight health check. Auto-fallback to local.
13calculate_zonal_meanremote_tools.pyHPC*zonal.py on HPCZonal mean with use_remote=True. Pre-flight health check. Auto-fallback to local.
- * HPC tools are only registered when endpoint_id is set in config.yaml. Without an endpoint, Claude's tool list shows 9 tools. + * Remote execution is selected per call with use_remote=True. Endpoint readiness controls remote dispatch and fallback.
diff --git a/docs/architecture.md b/docs/architecture.md index 3fba47c..c79cdb8 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -114,6 +114,21 @@ of exposing separate HPC-only tool names. **Validation gating** — The scientific agent runs dataset validation before zonal mean. If validation fails, zonal mean is skipped rather than producing unreliable results. +## Maintenance Notes + +The current implementation favors a small number of broad tool modules while +the MCP surface is still evolving. That keeps related behavior easy to audit +for the first release, but the largest files should be split once the public +contracts settle: + +- `remote/compute_functions.py` should be divided by remote capability family + (inspection, plotting, vector calculus, remapping, diagnostics). +- `tools/advanced.py` should be divided into spatial, comparison, remapping, + temporal/ensemble, and export modules. + +Keep those refactors behavior-preserving and test-backed; they are polish and +maintainability work, not blockers for the core conda package. + ## Interactive Diagram An interactive architecture diagram is available at `docs/architecture.html` in the repository. diff --git a/docs/chrysalis.md b/docs/chrysalis.md index bef356a..be98c55 100644 --- a/docs/chrysalis.md +++ b/docs/chrysalis.md @@ -19,69 +19,44 @@ which hosts the E3SM next-generation mesh library. are sent as source code via `AllCodeStrategies` and only need `uxarray` + deps in the worker environment. - Login nodes **kill compute processes** — always use the Slurm backend. -- `unset PYTHONPATH` before every endpoint start — the conda `uxarray-yac` env - injects broken `pydantic_core` paths that crash workers. +- YAC remapping needs the Python 3.12 `uxarray-yac` environment plus YAC, MKL, + MPICH, NetCDF, and local shim library paths. Use + `scripts/chrysalis_endpoint.sh` instead of hand-writing those paths. +- If a remote probe times out after the endpoint is `registered`, inspect the + endpoint logs on Chrysalis with `scripts/chrysalis_endpoint.sh logs`. ## Worker Environment | Item | Value | |---|---| -| Venv | `~/venvs/globus-compute-py313` (Python 3.13) | +| UXarray/YAC env | `~/.conda/envs/uxarray-yac` (Python 3.12) | +| Endpoint helper venv | `~/venvs/globus-compute-py313` | | Slurm partition | `debug` (4h walltime, 20 nodes) | | Compute nodes | 251 GB RAM, 128 CPUs | | Endpoint name | `uxarray-chrysalis` | ## First-Time Setup +The checked-in helper script writes the endpoint profile, YAC runtime library +paths, and small BLAS/LAPACK shims needed by the current YAC build: + ```bash -# 1. Create Python 3.13 conda env -/gpfs/fs1/soft/chrysalis/manual/miniforge3/25.3.1/bin/conda create \ - -n gc-py313 python=3.13 -y - -# 2. Build the globus-compute venv -~/.conda/envs/gc-py313/bin/python -m venv ~/venvs/globus-compute-py313 -~/venvs/globus-compute-py313/bin/pip install \ - "globus-compute-endpoint==4.12.0" \ - uxarray xarray netCDF4 h5netcdf numpy matplotlib holoviews cartopy - -# 3. Configure the endpoint (Slurm-backed) -unset PYTHONPATH -~/venvs/globus-compute-py313/bin/globus-compute-endpoint configure uxarray-chrysalis - -cat > ~/.globus_compute/uxarray-chrysalis/user_config_template.yaml.j2 << 'EOF' -endpoint_setup: "" -engine: - type: GlobusComputeEngine - max_workers_per_node: 4 - provider: - type: SlurmProvider - partition: debug - nodes_per_block: 1 - init_blocks: 0 - min_blocks: 0 - max_blocks: 2 - walltime: "04:00:00" - worker_init: | - unset PYTHONPATH - launcher: - type: SrunLauncher -idle_heartbeats_soft: 10 -idle_heartbeats_hard: 5760 -EOF - -cat > ~/.globus_compute/uxarray-chrysalis/user_environment.yaml << 'EOF' -PYTHONPATH: "" -PATH: "/home//venvs/globus-compute-py313/bin:/usr/bin:/bin" -EOF +git clone https://github.com/UXARRAY/uxarray-mcp-server.git +cd uxarray-mcp-server +bash scripts/chrysalis_endpoint.sh configure slurm-debug +bash scripts/chrysalis_endpoint.sh check-yac ``` +The `check-yac` command runs a tiny Slurm job that imports `yac.core`, imports +UXarray's YAC helper, and remaps HEALPix zoom 2 to zoom 3. It should report +`yac_core_ok: true` and `remap_ok: true` before the endpoint is used by MCP. + ## Starting the Endpoint Run this every time you log in: ```bash -unset PYTHONPATH -~/venvs/globus-compute-py313/bin/globus-compute-endpoint start uxarray-chrysalis +bash scripts/chrysalis_endpoint.sh start ``` The endpoint prints its UUID. Add it to your private local config on your @@ -98,6 +73,8 @@ From your laptop after the endpoint is running: ```bash uv run python scripts/hpc_doctor.py --endpoint chrysalis --timeout-seconds 120 +uv run --extra hpc python scripts/yac_smoke_test.py \ + --endpoint chrysalis --timeout-seconds 300 ``` Or manually: @@ -110,6 +87,14 @@ print(validate_hpc_setup(endpoint='chrysalis', run_remote_probe=True, probe_timeout_seconds=120)) ``` +If the manager reports `registered` but worker probes time out, inspect the +remote side on Chrysalis: + +```bash +bash scripts/chrysalis_endpoint.sh logs +squeue -u "$USER" +``` + ## E3SM Next-Generation Ocean Meshes Available at `/lcrc/group/e3sm/ac.xylar/polaris_1.0/chrysalis/test_20260520/unified-mesh-topo-cull2/`: @@ -123,8 +108,23 @@ Available at `/lcrc/group/e3sm/ac.xylar/polaris_1.0/chrysalis/test_20260520/unif ## Troubleshooting -**`ENDPOINT_NOT_ONLINE`** — the Slurm debug job timed out (4h limit). Restart with `unset PYTHONPATH && ~/venvs/globus-compute-py313/bin/globus-compute-endpoint start uxarray-chrysalis`. +**`ENDPOINT_NOT_ONLINE`** — the Slurm debug job timed out (4h limit). Restart +with `bash scripts/chrysalis_endpoint.sh restart`. + +**Worker probe timeout after `registered`** — the manager is connected, but a +Slurm worker did not return. Run `bash scripts/chrysalis_endpoint.sh logs` on +Chrysalis and inspect the latest submit script/log pair. + +**`pydantic_core` not found** — the worker is running from the wrong Python +environment. Re-run `bash scripts/chrysalis_endpoint.sh configure slurm-debug` +and restart the endpoint. -**`WorkerLost` or `SystemError`** — PYTHONPATH is set. Always `unset PYTHONPATH` before starting the endpoint. +**`libnetcdf.so.22`, `liblapack.so.3`, or `libblas.so.3` not found** — the YAC +runtime paths or local MKL shims are missing. Re-run +`bash scripts/chrysalis_endpoint.sh configure slurm-debug`, then +`bash scripts/chrysalis_endpoint.sh check-yac`. -**`pydantic_core` not found** — conda env leaked into the worker. Check `user_environment.yaml` has `PYTHONPATH: ""` and restart. +**`PMI_Init failed` or `WorkerLost` during YAC import** — YAC initializes MPI. +Inside a Globus Compute worker, run the YAC smoke/remap through the dedicated +smoke path, which launches the native YAC child process with +`srun --ntasks 1` when `SLURM_JOB_ID` is present. diff --git a/docs/release.md b/docs/release.md index d18bbd4..da69aa7 100644 --- a/docs/release.md +++ b/docs/release.md @@ -123,6 +123,13 @@ The conda package should install the core MCP server and CLI. HPC-specific Globus Compute dependencies can be added to the feedstock later if conda-forge availability and solver behavior are acceptable. +The seed recipe intentionally targets the **core** package only. Keep +`globus-compute-sdk` and `academy-py` out of the initial conda-forge recipe +until those dependencies and their transitive solver behavior are validated on +conda-forge. If conda-native HPC support becomes necessary, prefer a second +output such as `uxarray-mcp-hpc` or a feedstock variant rather than making every +local-only user solve the remote-execution stack. + ## Privacy Check Before every release, verify endpoint UUIDs and local config did not re-enter diff --git a/scripts/chrysalis_endpoint.sh b/scripts/chrysalis_endpoint.sh index 5ff6590..6ee844c 100755 --- a/scripts/chrysalis_endpoint.sh +++ b/scripts/chrysalis_endpoint.sh @@ -8,11 +8,27 @@ set -euo pipefail USERNAME="jain" # your Chrysalis username # --------------------------------------------------------------------------- -ENDPOINT_NAME="${ENDPOINT_NAME:-chrysalis-uxarray}" +ENDPOINT_NAME="${ENDPOINT_NAME:-uxarray-chrysalis}" CONDA_ENV="$HOME/.conda/envs/uxarray-yac" -VENV_GC="$HOME/venvs/globus-compute" +VENV_GC="${VENV_GC:-$HOME/venvs/globus-compute-py313}" TMUX_SESSION="uxarray-endpoint" +YAC_SHIM_LIB="$HOME/local/yac-runtime-shims/lib" +YAC_SRC_PY="$HOME/src/yac/build/python" +YAC_SRC_CORE="$HOME/src/yac/build/src/core" +YAC_SRC_UTILS="$HOME/src/yac/build/src/utils" +YAC_LOCAL_PREFIX="$HOME/local/yac-3.17" +UXARRAY_YAC_SRC="/lcrc/group/e3sm/jain/uxarray-yac-src" +MKL_LIB="/gpfs/fs1/soft/chrysalis/spack-latest/opt/spack/linux-rhel8-x86_64/oneapi-2022.1.0/intel-oneapi-mkl-2022.1.0-iwhfz52/mkl/2022.1.0/lib/intel64" +MPICH_LIB="/gpfs/fs1/soft/chrysalis/spack-latest/opt/spack/linux-rhel8-x86_64/gcc-11.3.0/mpich-4.3.2-dp2ycaq/lib" +HWLOC_LIB="/gpfs/fs1/soft/chrysalis/spack-latest/opt/spack/linux-rhel8-x86_64/gcc-11.3.0/hwloc-2.12.2-5vqrpw7/lib" +YAKSA_LIB="/gpfs/fs1/soft/chrysalis/spack-latest/opt/spack/linux-rhel8-x86_64/gcc-11.3.0/yaksa-0.4-mejnfxw/lib" +LIBFABRIC_LIB="/gpfs/fs1/soft/chrysalis/spack-latest/opt/spack/linux-rhel8-x86_64/gcc-9.3.0/libfabric-1.16.1-vwdeh3y/lib" +GCC_LIB="/gpfs/fs1/soft/chrysalis/spack-latest/opt/spack/linux-rhel8-x86_64/gcc-8.5.0/gcc-11.3.0-jkpmtgq/lib64" +YAC_PYTHONPATH="$YAC_SRC_PY:$YAC_LOCAL_PREFIX/lib/python3.12/site-packages:$UXARRAY_YAC_SRC" +YAC_LD_LIBRARY_PATH="$YAC_SHIM_LIB:$MKL_LIB:$CONDA_ENV/lib:$YAC_SRC_CORE:$YAC_SRC_UTILS:$YAC_LOCAL_PREFIX/lib:$MPICH_LIB:$HWLOC_LIB:$YAKSA_LIB:$LIBFABRIC_LIB:$GCC_LIB" +YAC_WORKER_PATH="$CONDA_ENV/bin:$VENV_GC/bin:/usr/bin:/bin" + usage() { cat <<'EOF' Usage (run on a Chrysalis login node): @@ -20,6 +36,8 @@ Usage (run on a Chrysalis login node): chrysalis_endpoint.sh start Activate env + start endpoint in tmux chrysalis_endpoint.sh restart Stop running endpoint, then start fresh chrysalis_endpoint.sh status Show endpoint list + chrysalis_endpoint.sh check-yac Run a Slurm YAC import/remap smoke test + chrysalis_endpoint.sh logs Show endpoint and worker log hints Configure modes: single-host (default) LocalProvider — fine for quick probes, killed for real compute @@ -29,7 +47,7 @@ NOTE: Chrysalis login nodes kill processes that use significant CPU/memory. Use slurm-debug mode for any real UXarray analysis. Environment overrides: - ENDPOINT_NAME Globus Compute endpoint profile name (default: chrysalis-uxarray) + ENDPOINT_NAME Globus Compute endpoint profile name (default: uxarray-chrysalis) EOF } @@ -46,6 +64,19 @@ _activate_env() { export PATH="$VENV_GC/bin:$PATH" } +_configure_yac_runtime() { + mkdir -p "$YAC_SHIM_LIB" + ln -sfn "$MKL_LIB/libmkl_rt.so" "$YAC_SHIM_LIB/liblapack.so.3" + ln -sfn "$MKL_LIB/libmkl_rt.so" "$YAC_SHIM_LIB/libblas.so.3" +} + +_export_yac_runtime() { + _configure_yac_runtime + export PATH="$YAC_WORKER_PATH" + export PYTHONPATH="$YAC_PYTHONPATH" + export LD_LIBRARY_PATH="$YAC_LD_LIBRARY_PATH" +} + _check_endpoint_dir() { local ep_dir="$HOME/.globus_compute/$ENDPOINT_NAME" if [[ ! -d "$ep_dir" ]]; then @@ -69,8 +100,12 @@ _configure() { globus-compute-endpoint configure "$ENDPOINT_NAME" fi + _configure_yac_runtime + cat > "$ep_dir/user_environment.yaml" < Activating conda env: $CONDA_ENV" _activate_env + _export_yac_runtime echo "==> Python: $(python --version)" echo "==> uxarray: $(python -c 'import uxarray; print(uxarray.__version__)' 2>/dev/null || echo 'check import')" echo "==> Starting endpoint: $ENDPOINT_NAME" @@ -191,6 +229,86 @@ _status() { globus-compute-endpoint list } +_check_yac() { + _export_yac_runtime + local smoke + smoke="$(mktemp "${TMPDIR:-/tmp}/uxmcp-yac-smoke.XXXXXX.py")" + cat > "$smoke" <<'PY' +import json +import sys +import time +import traceback + +out = {"python": sys.version, "executable": sys.executable} +try: + import yac.core as yc + + out["yac_core_ok"] = True + out["yac_file"] = getattr(yc, "__file__", None) +except Exception as exc: + out["yac_core_ok"] = False + out["yac_core_error"] = f"{type(exc).__name__}: {exc}" + out["yac_traceback"] = traceback.format_exc() + +try: + import numpy as np + import uxarray as ux + import xarray as xr + from uxarray.remap.yac import _import_yac + + yc = _import_yac() + out["uxarray_helper_ok"] = True + out["uxarray_helper_file"] = getattr(yc, "__file__", None) + src = ux.Grid.from_healpix(zoom=2) + dst = ux.Grid.from_healpix(zoom=3) + rng = np.random.default_rng(0) + uxda = ux.UxDataArray( + xr.DataArray(rng.standard_normal(int(src.n_face)), dims=("n_face",), name="field"), + uxgrid=src, + ) + t0 = time.perf_counter() + remapped = uxda.remap.nearest_neighbor(destination_grid=dst, remap_to="face centers") + out["remap_ok"] = True + out["remap_seconds"] = round(time.perf_counter() - t0, 3) + out["remap_shape"] = list(remapped.shape) +except Exception as exc: + out["remap_ok"] = False + out["remap_error"] = f"{type(exc).__name__}: {exc}" + out["remap_traceback"] = traceback.format_exc() + +print(json.dumps(out, indent=2)) +raise SystemExit(0 if out.get("yac_core_ok") and out.get("remap_ok") else 1) +PY + echo "==> Running YAC smoke through Slurm" + srun --ntasks 1 bash -lc "PATH='$PATH' PYTHONPATH='$PYTHONPATH' LD_LIBRARY_PATH='$LD_LIBRARY_PATH' python '$smoke'" +} + +_logs() { + local ep_dir="$HOME/.globus_compute/$ENDPOINT_NAME" + echo "==> Endpoint profile: $ep_dir" + if [[ ! -d "$ep_dir" ]]; then + echo "Endpoint profile not found." + exit 1 + fi + echo + echo "==> Current endpoint config" + sed -n '1,220p' "$ep_dir/user_config_template.yaml.j2" 2>/dev/null || true + echo + echo "==> Current endpoint environment" + sed -n '1,120p' "$ep_dir/user_environment.yaml" 2>/dev/null || true + echo + echo "==> Running endpoint-related processes" + ps -fu "$USER" | grep -E 'globus|parsl|process_worker|interchange|uxarray-chrysalis' | grep -v grep || true + echo + echo "==> Latest submit scripts" + find "$HOME/.globus_compute" -path '*3cca8be6-55ec-4386-b7fd-f6c1e161d52b*/submit_scripts/*' \ + -type f -print 2>/dev/null | sort | tail -10 + echo + echo "==> Latest endpoint logs" + find "$HOME/.globus_compute" -path '*3cca8be6-55ec-4386-b7fd-f6c1e161d52b*' \ + \( -name '*.log' -o -name '*.err' -o -name '*.out' \) -type f -print 2>/dev/null | sort | tail -20 +} + # --------------------------------------------------------------------------- # Main # --------------------------------------------------------------------------- @@ -201,5 +319,7 @@ case "${1:-}" in _do_start) _do_start ;; # internal: invoked by tmux restart) _restart ;; status) _status ;; + check-yac) _check_yac ;; + logs) _logs ;; *) usage; exit 1 ;; esac diff --git a/src/uxarray_mcp/remote/compute_functions.py b/src/uxarray_mcp/remote/compute_functions.py index 76d0861..5570063 100644 --- a/src/uxarray_mcp/remote/compute_functions.py +++ b/src/uxarray_mcp/remote/compute_functions.py @@ -43,26 +43,12 @@ def remote_runtime_probe() -> Dict[str, Any]: yac_info["package_available"] = False yac_info["package_error"] = f"{type(exc).__name__}: {exc}" - try: - yac_core = importlib.import_module("yac.core") - yac_info["core_importable"] = True - yac_info["core_file"] = getattr(yac_core, "__file__", None) - yac_info["version"] = getattr(yac_core, "__version__", None) - except Exception as exc: - yac_info["core_importable"] = False - yac_info["core_import_error"] = f"{type(exc).__name__}: {exc}" - - try: - from uxarray.remap.yac import _import_yac - - yc = _import_yac() - yac_info["uxarray_helper_ok"] = True - yac_info["uxarray_helper_module"] = getattr(yc, "__name__", None) - yac_info["uxarray_helper_file"] = getattr(yc, "__file__", None) - yac_info["has_basicgrid"] = hasattr(yc, "BasicGrid") - except Exception as exc: - yac_info["uxarray_helper_ok"] = False - yac_info["uxarray_helper_error"] = f"{type(exc).__name__}: {exc}" + yac_info["core_importable"] = None + yac_info["uxarray_helper_ok"] = None + yac_info["native_import_check"] = ( + "skipped in remote_runtime_probe; use remote_yac_remap_smoke so " + "YAC/MPI imports run under srun and cannot kill the Globus worker" + ) modules["yac"] = yac_info return { @@ -830,141 +816,156 @@ def remote_subset_bbox_plot( def remote_yac_remap_smoke() -> Dict[str, Any]: """Smoke-test YAC's availability on the remote worker. - Loads YAC via uxarray's canonical helper (works around the broken upstream - ``yac/__init__.py`` that does ``from ._yac import *``), reports the YAC - surface, and runs a minimal nearest-neighbour remap from one HEALPix grid - to another using uxarray's YAC-backed remap path. Returns a dict that - includes both the static surface check and, if a remap was attempted, its - shape and timing. + Runs the native YAC import/remap in a worker-side subprocess. Some YAC/MPI + builds can terminate the importing process when their runtime library path + is incomplete; keeping the import in a child process lets the Globus worker + return structured diagnostics instead of disappearing as ``WorkerLost``. The function is self-contained and serialised via AllCodeStrategies so the Python 3.13/3.11 mismatch between local SDK and worker doesn't bite. """ - import importlib.metadata - import time - import traceback - - out: Dict[str, Any] = {} - - def _surface(mod): - return { - name: hasattr(mod, name) - for name in ( - "BasicGrid", - "InterpField", - "InterpolationStack", - "compute_weights", - "Reg2dGrid", - ) - } - - yc = None - try: - from uxarray.remap.yac import _import_yac - - yc = _import_yac() - out["yac_helper_ok"] = True - out["yac_loader"] = "uxarray.remap.yac._import_yac" - out["yac_module"] = getattr(yc, "__name__", None) - out["yac_file"] = getattr(yc, "__file__", None) - out["surface"] = _surface(yc) - except Exception as exc: - out["yac_helper_ok"] = False - out["yac_helper_error"] = f"{type(exc).__name__}: {exc}" + import json + import os + import re + import shutil + import subprocess + import sys + import textwrap + + code = r""" +import importlib.metadata +import json +import os +import sys +import time +import traceback + +out = { + "python": sys.version, + "executable": sys.executable, + "pythonpath_set": bool(os.environ.get("PYTHONPATH")), + "ld_library_path_set": bool(os.environ.get("LD_LIBRARY_PATH")), +} + +def _surface(mod): + return { + name: hasattr(mod, name) + for name in ( + "BasicGrid", + "InterpField", + "InterpolationStack", + "compute_weights", + "Reg2dGrid", + ) + } - # Fallback: load core.so directly via ExtensionFileLoader, bypassing - # YAC's broken __init__.py. Lets us report surface even when the - # site uxarray is unavailable or itself broken. - try: - import importlib.machinery as _m - import importlib.util as _u - import os as _os - import sys as _sys - import sysconfig as _sc - import types as _t - from pathlib import Path as _P - - search_roots = [_sc.get_paths()["purelib"]] - search_roots.extend(p for p in _sys.path if p) - for env_var in ("YAC_PREFIX", "YAC_ROOT"): - if _os.environ.get(env_var): - search_roots.append(_os.environ[env_var]) - home = _os.path.expanduser("~") - search_roots.append(_os.path.join(home, "yac")) - search_roots.append("/opt/yac") - - so_candidates: list[str] = [] - for root in search_roots: - rp = _P(root) - if not rp.exists(): - continue - so_candidates.extend(str(p) for p in rp.rglob("yac/core.cpython-*.so")) - so_candidates = list(dict.fromkeys(so_candidates)) # dedupe - if not so_candidates: - raise FileNotFoundError(f"No yac/core*.so found in: {search_roots[:6]}") - so = so_candidates[0] - out["yac_search_hits"] = so_candidates[:5] - - pkg = _t.ModuleType("yacshim") - pkg.__path__ = [] - _sys.modules["yacshim"] = pkg - loader = _m.ExtensionFileLoader("yacshim.core", so) - spec = _u.spec_from_loader("yacshim.core", loader) - assert spec is not None and spec.loader is not None - yc = _u.module_from_spec(spec) - spec.loader.exec_module(yc) - out["yac_loader"] = "direct_so" - out["yac_file"] = so - out["surface"] = _surface(yc) - except Exception as exc2: - out["yac_direct_load_ok"] = False - out["yac_direct_load_error"] = f"{type(exc2).__name__}: {exc2}" - out["yac_direct_load_traceback"] = traceback.format_exc() - return out +try: + import yac.core as yc + + out["yac_core_ok"] = True + out["yac_core_file"] = getattr(yc, "__file__", None) +except Exception as exc: + out["yac_core_ok"] = False + out["yac_core_error"] = f"{type(exc).__name__}: {exc}" + out["yac_core_traceback"] = traceback.format_exc() + +try: + from uxarray.remap.yac import _import_yac + + yc = _import_yac() + out["yac_helper_ok"] = True + out["yac_loader"] = "uxarray.remap.yac._import_yac" + out["yac_module"] = getattr(yc, "__name__", None) + out["yac_file"] = getattr(yc, "__file__", None) + out["surface"] = _surface(yc) +except Exception as exc: + out["yac_helper_ok"] = False + out["yac_helper_error"] = f"{type(exc).__name__}: {exc}" + out["yac_helper_traceback"] = traceback.format_exc() + +try: + out["uxarray_version"] = importlib.metadata.version("uxarray") +except Exception: + out["uxarray_version"] = "unknown" + +try: + import numpy as np + import uxarray as ux + import xarray as xr + + src = ux.Grid.from_healpix(zoom=2) + dst = ux.Grid.from_healpix(zoom=3) + out["src_n_face"] = int(src.n_face) + out["dst_n_face"] = int(dst.n_face) + + rng = np.random.default_rng(0) + face_data = rng.standard_normal(int(src.n_face)) + uxda = ux.UxDataArray( + xr.DataArray(face_data, dims=("n_face",), name="field"), + uxgrid=src, + ) - try: - out["uxarray_version"] = importlib.metadata.version("uxarray") - except Exception: - out["uxarray_version"] = "unknown" + t0 = time.perf_counter() + remapped = uxda.remap.nearest_neighbor( + destination_grid=dst, remap_to="face centers" + ) + out["remap_method"] = "nearest_neighbor" + out["remap_ok"] = True + out["remap_seconds"] = round(time.perf_counter() - t0, 3) + out["remap_dst_shape"] = list(remapped.shape) + out["remap_dst_mean"] = float(np.asarray(remapped).mean()) +except Exception as exc: + out["remap_ok"] = False + out["remap_error"] = f"{type(exc).__name__}: {exc}" + out["remap_traceback"] = traceback.format_exc() + +print(json.dumps(out)) +raise SystemExit(0 if out.get("yac_helper_ok") and out.get("remap_ok") else 1) +""" try: - import numpy as np - import uxarray as ux - import xarray as xr - - src = ux.Grid.from_healpix(zoom=2) - dst = ux.Grid.from_healpix(zoom=3) - out["src_n_face"] = int(src.n_face) - out["dst_n_face"] = int(dst.n_face) - - rng = np.random.default_rng(0) - face_data = rng.standard_normal(int(src.n_face)) - uxda = ux.UxDataArray( - xr.DataArray(face_data, dims=("n_face",), name="field"), - uxgrid=src, + command = [sys.executable, "-c", textwrap.dedent(code)] + launch_mode = "direct" + if os.environ.get("SLURM_JOB_ID") and shutil.which("srun"): + command = ["srun", "--ntasks", "1", *command] + launch_mode = "srun" + proc = subprocess.run( + command, + env=os.environ.copy(), + capture_output=True, + text=True, + timeout=240, + check=False, ) + except subprocess.TimeoutExpired as exc: + return { + "subprocess_ok": False, + "subprocess_timeout_seconds": exc.timeout, + "stdout": exc.stdout, + "stderr": exc.stderr, + } - t0 = time.perf_counter() + payload: Dict[str, Any] = { + "subprocess_ok": proc.returncode == 0, + "subprocess_returncode": proc.returncode, + "launch_mode": launch_mode, + } + stdout = proc.stdout.strip() + stderr = proc.stderr.strip() + if stdout: + payload["stdout_tail"] = stdout[-4000:] + if stderr: + payload["stderr_tail"] = stderr[-4000:] + for line in reversed(stdout.splitlines()): + line = re.sub(r"^\d+:\s*", "", line) try: - remapped = uxda.remap.nearest_neighbor( - destination_grid=dst, remap_to="face centers" - ) - out["remap_method"] = "nearest_neighbor" - out["remap_ok"] = True - out["remap_seconds"] = time.perf_counter() - t0 - out["remap_dst_shape"] = list(remapped.shape) - out["remap_dst_mean"] = float(np.asarray(remapped).mean()) - except Exception as exc: - out["remap_method"] = "nearest_neighbor" - out["remap_ok"] = False - out["remap_error"] = f"{type(exc).__name__}: {exc}" - out["remap_traceback"] = traceback.format_exc() - except Exception as exc: - out["remap_setup_ok"] = False - out["remap_setup_error"] = f"{type(exc).__name__}: {exc}" - out["remap_setup_traceback"] = traceback.format_exc() - - return out + payload.update(json.loads(line)) + break + except Exception: + continue + else: + payload["json_parse_error"] = "No JSON object found in subprocess stdout." + return payload # --------------------------------------------------------------------------- diff --git a/src/uxarray_mcp/remote/health.py b/src/uxarray_mcp/remote/health.py index 01fe07c..57530e8 100644 --- a/src/uxarray_mcp/remote/health.py +++ b/src/uxarray_mcp/remote/health.py @@ -71,6 +71,16 @@ } +def _is_expected_yac_pythonpath(pythonpath: str) -> bool: + """Return True when PYTHONPATH only exposes endpoint-side YAC runtime paths.""" + parts = [part for part in pythonpath.split(":") if part] + if not parts: + return False + return all( + "yac" in part.lower() and "/.conda/envs/" not in part.lower() for part in parts + ) + + def _translate_globus_status(raw: str) -> str: """Map a raw Globus status string to our vocabulary.""" return _GLOBUS_TO_STATUS.get(raw.lower(), "unreachable") @@ -296,6 +306,8 @@ def _worker_probe() -> dict: result = fut.result(timeout=timeout_seconds) elapsed = round(time.monotonic() - t0, 1) + pythonpath = result.get("pythonpath") or "" + yac_pythonpath = _is_expected_yac_pythonpath(pythonpath) payload = { "status": "active", **_endpoint_public_fields(config), @@ -303,15 +315,17 @@ def _worker_probe() -> dict: "python": result.get("python", ""), "slurm_job_id": result.get("slurm_job_id") or None, "pbs_job_id": result.get("pbs_job_id") or None, - "pythonpath_set": bool(result.get("pythonpath")), + "pythonpath_set": bool(pythonpath), + "pythonpath_expected_yac_runtime": yac_pythonpath, "elapsed_seconds": elapsed, } - # Warn if PYTHONPATH is set — this is the root cause of most worker crashes - if result.get("pythonpath"): + # Warn on arbitrary PYTHONPATH leaks, but allow endpoint-side YAC paths. + if pythonpath and not yac_pythonpath: payload["warning"] = ( "PYTHONPATH is set on the worker. This can cause pydantic/dill " "conflicts. Add 'unset PYTHONPATH' to worker_init in the endpoint " - "config, and set PYTHONPATH: '' in user_environment.yaml." + "config, and only set narrow runtime paths such as the YAC " + "Python bindings when they are required." ) return payload diff --git a/tests/test_hpc_safety.py b/tests/test_hpc_safety.py index 61f21e3..5a5f38f 100644 --- a/tests/test_hpc_safety.py +++ b/tests/test_hpc_safety.py @@ -6,11 +6,14 @@ """ import importlib.util +import json +import subprocess from unittest.mock import MagicMock, patch import pytest from uxarray_mcp.remote import health +from uxarray_mcp.remote.compute_functions import remote_yac_remap_smoke from uxarray_mcp.remote.config import HPCConfig from uxarray_mcp.remote.health import check_endpoint_health from uxarray_mcp.tools.inspection import validate_dataset @@ -90,6 +93,42 @@ def test_globus_offline_maps_to_offline(self): assert result["status"] == "offline" + def test_yac_pythonpath_is_expected_runtime_path(self): + """Endpoint-side YAC source/runtime paths are not a worker leak.""" + pythonpath = ( + "/home/jain/src/yac/build/python:" + "/home/jain/local/yac-3.17/lib/python3.12/site-packages:" + "/lcrc/group/e3sm/jain/uxarray-yac-src" + ) + + assert health._is_expected_yac_pythonpath(pythonpath) is True + + def test_conda_env_pythonpath_is_not_expected_yac_runtime_path(self): + """A broad conda env site-packages path can still leak pydantic/dill.""" + pythonpath = "/home/jain/.conda/envs/uxarray-yac/lib/python3.12/site-packages" + + assert health._is_expected_yac_pythonpath(pythonpath) is False + + def test_remote_yac_smoke_parses_subprocess_payload(self, monkeypatch): + """YAC smoke returns structured output from the worker-side subprocess.""" + payload = { + "yac_helper_ok": True, + "remap_ok": True, + "remap_dst_shape": [768], + } + + def fake_run(*args, **kwargs): + return subprocess.CompletedProcess(args, 0, f"0: {json.dumps(payload)}", "") + + monkeypatch.setattr(subprocess, "run", fake_run) + + result = remote_yac_remap_smoke() + + assert result["subprocess_ok"] is True + assert result["subprocess_returncode"] == 0 + assert result["yac_helper_ok"] is True + assert result["remap_dst_shape"] == [768] + # ----------------------------------------------------------------------------- # Unit Tests (Mocked) — _endpoint_is_ready