Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,9 @@ under their entry.
`endpoints add/list/remove`, `doctor`, `install-claude`.
- Multi-endpoint config schema (`hpc.endpoints.<name>` with
`endpoint_id`, `path_prefixes`, `timeout_seconds`) and config
discovery order: `$UXARRAY_MCP_CONFIG` → `~/.config/uxarray-mcp/config.yaml`
→ `./config.yaml`.
discovery order: `$UXARRAY_MCP_CONFIG` → `./config.yaml` in the current
working directory → `~/.config/uxarray-mcp/config.yaml` → editable-install
repo fallback.
- YAC remote build script (`scripts/hpc_build_yac.py`) and runtime
fallback that loads `yac.core` via `importlib.machinery` when the
upstream `__init__.py` is unconditional.
Expand Down
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,10 @@ The ``uxarray-mcp`` CLI exposes:
| ``install-claude`` | print or merge the Claude Desktop ``mcpServers`` block |

Config is discovered in this order: ``$UXARRAY_MCP_CONFIG`` →
``~/.config/uxarray-mcp/config.yaml`` → ``./config.yaml`` (repo root).
``./config.yaml`` in the current working directory →
``~/.config/uxarray-mcp/config.yaml`` → the editable-install repo config
fallback. The project-local file wins inside a checkout so development
endpoints are not shadowed by an empty user config.

## Most Users Should Read These in Order

Expand Down
3 changes: 3 additions & 0 deletions conda/recipe/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ requirements:
- pip
- uv-build >=0.9.26,<0.10.0
run:
# Core package only for the initial feedstock. Add optional HPC support as
# a second output/variant after globus-compute-sdk and academy-py solver
# behavior is validated on conda-forge.
- python >={{ python_min }}
- fastmcp >=3.4.0
- holoviews >=1.19.0
Expand Down
34 changes: 17 additions & 17 deletions docs/architecture.html
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@
<div class="logo-box">⬡</div>
<div class="hdr-text">
<h1>UXarray MCP Server — Architecture Diagram</h1>
<p>Mesh-aware assistant · provenance on every output · local or Argonne Improv HPC via Globus Compute · dynamic tool registration</p>
<p>Mesh-aware assistant · provenance on every output · local or named HPC endpoints via Globus Compute · unified tool surface</p>
</div>
<div class="branch">
<svg width="12" height="12" viewBox="0 0 16 16" fill="#58a6ff"><path d="M11.75 2.5a.75.75 0 1 0 0 1.5.75.75 0 0 0 0-1.5zm-2.25.75a2.25 2.25 0 1 1 3 2.122V6A2.5 2.5 0 0 1 10 8.5H6a1 1 0 0 0-1 1v1.128a2.251 2.251 0 1 1-1.5 0V5.372a2.25 2.25 0 1 1 1.5 0v1.836A2.492 2.492 0 0 1 6 7h4a1 1 0 0 0 1-1v-.628A2.25 2.25 0 0 1 9.5 3.25z"/></svg>
Expand Down Expand Up @@ -205,7 +205,7 @@ <h1>UXarray MCP Server — Architecture Diagram</h1>
<text x="540" y="96" text-anchor="middle" font-size="15" font-weight="700" fill="#58a6ff" font-family="Inter,sans-serif">Receive tool call</text>
<line x1="400" y1="103" x2="682" y2="103" stroke="#112238" stroke-width="1"/>
<text x="540" y="121" text-anchor="middle" font-size="13" fill="#8b949e" font-family="Inter,sans-serif">run_scientific_agent(path)</text>
<text x="540" y="139" text-anchor="middle" font-size="11.5" fill="#484f58" font-family="Inter,sans-serif">HPC tools only if endpoint configured</text>
<text x="540" y="139" text-anchor="middle" font-size="11.5" fill="#484f58" font-family="Inter,sans-serif">Remote execution selected per call</text>

<!-- Agent: Begin analysis -->
<rect x="747" y="76" width="306" height="74" rx="8" fill="#0d0a20" stroke="#6e40c9" stroke-width="2"/>
Expand Down Expand Up @@ -261,13 +261,13 @@ <h1>UXarray MCP Server — Architecture Diagram</h1>
<text x="1260" y="383" text-anchor="middle" font-size="11.5" fill="#8b949e" font-family="Inter,sans-serif">MPAS · UGRID · SCRIP · HEALPix</text>
<text x="1260" y="398" text-anchor="middle" font-size="11" fill="#2ea043" font-family="Inter,sans-serif" font-style="italic">→ n_face · n_node · n_edge · format</text>

<!-- HPC: inspect_mesh_hpc() -->
<!-- HPC: inspect_mesh(..., use_remote=True) -->
<rect x="1467" y="320" width="306" height="86" rx="8" fill="#180e00" stroke="#d29922" stroke-width="2"/>
<text x="1620" y="342" text-anchor="middle" font-size="15" font-weight="700" fill="#e3b341" font-family="Inter,sans-serif">inspect_mesh_hpc( )</text>
<text x="1620" y="342" text-anchor="middle" font-size="15" font-weight="700" fill="#e3b341" font-family="Inter,sans-serif">inspect_mesh(use_remote)</text>
<line x1="1480" y1="350" x2="1762" y2="350" stroke="#201400" stroke-width="1"/>
<text x="1620" y="362" text-anchor="middle" font-size="11.5" fill="#39d0d0" font-family="Inter,sans-serif" font-style="italic">✓ _endpoint_is_ready( ) pre-flight</text>
<text x="1620" y="377" text-anchor="middle" font-size="13" fill="#8b949e" font-family="Inter,sans-serif">remote/compute_functions.py</text>
<text x="1620" y="392" text-anchor="middle" font-size="11.5" fill="#8b949e" font-family="Inter,sans-serif">Globus Executor → Improv cluster</text>
<text x="1620" y="392" text-anchor="middle" font-size="11.5" fill="#8b949e" font-family="Inter,sans-serif">Globus Executor → named endpoint</text>
<text x="1620" y="406" text-anchor="middle" font-size="11" fill="#d29922" font-family="Inter,sans-serif" font-style="italic">File stays on HPC · timeout 300 s</text>

<!-- validate_dataset box -->
Expand Down Expand Up @@ -348,13 +348,13 @@ <h1>UXarray MCP Server — Architecture Diagram</h1>
<text x="1260" y="772" text-anchor="middle" font-size="11.5" fill="#2ea043" font-family="Inter,sans-serif" font-style="italic">Runs entirely on local machine</text>
<text x="1260" y="787" text-anchor="middle" font-size="11" fill="#484f58" font-family="Inter,sans-serif">_provenance: venue=local</text>

<!-- HPC: calc_area_hpc + zonal_mean_hpc -->
<!-- HPC: calculate_area/calculate_zonal_mean with use_remote=True -->
<rect x="1467" y="688" width="306" height="108" rx="8" fill="#180e00" stroke="#d29922" stroke-width="2"/>
<text x="1620" y="710" text-anchor="middle" font-size="15" font-weight="700" fill="#e3b341" font-family="Inter,sans-serif">HPC Execution</text>
<line x1="1480" y1="718" x2="1762" y2="718" stroke="#201400" stroke-width="1"/>
<text x="1620" y="733" text-anchor="middle" font-size="11.5" fill="#39d0d0" font-family="Inter,sans-serif" font-style="italic">✓ _endpoint_is_ready( ) pre-flight</text>
<text x="1620" y="749" text-anchor="middle" font-size="13" fill="#8b949e" font-family="Inter,sans-serif">calculate_area_hpc( )</text>
<text x="1620" y="765" text-anchor="middle" font-size="13" fill="#8b949e" font-family="Inter,sans-serif">calculate_zonal_mean_hpc( )</text>
<text x="1620" y="749" text-anchor="middle" font-size="13" fill="#8b949e" font-family="Inter,sans-serif">calculate_area(use_remote)</text>
<text x="1620" y="765" text-anchor="middle" font-size="13" fill="#8b949e" font-family="Inter,sans-serif">calculate_zonal_mean(use_remote)</text>
<line x1="1480" y1="776" x2="1762" y2="776" stroke="#3d1400" stroke-width="1" stroke-dasharray="4,3"/>
<text x="1620" y="790" text-anchor="middle" font-size="12" fill="#f85149" font-family="Inter,sans-serif" font-style="italic">⚠ auto-fallback if endpoint unreachable</text>

Expand Down Expand Up @@ -446,15 +446,15 @@ <h1>UXarray MCP Server — Architecture Diagram</h1>
Flow runs <strong>top → bottom</strong>, arrows crossing lanes are handoffs.
<strong style="color:#e3b341">Orange diamonds</strong> = routing decisions made automatically by the agent.
<strong style="color:#56d364">Green boxes</strong> = compute runs on your local machine.
<strong style="color:#e3b341">Orange boxes</strong> = dispatched to Argonne Improv via Globus Compute — the file never leaves the cluster.
<strong style="color:#39d0d0">Teal dashed boxes</strong> = steps that only activate under certain conditions (data_path provided, endpoint configured).
<strong style="color:#e3b341">Orange boxes</strong> = dispatched to a named HPC endpoint via Globus Compute — the file never leaves the cluster.
<strong style="color:#39d0d0">Teal dashed boxes</strong> = steps that only activate under certain conditions (data_path provided, remote mode requested).
<strong style="color:#f85149">Red dashed</strong> inside HPC boxes = automatic local fallback when the endpoint is unreachable.
<strong>Dynamic registration:</strong> HPC tools (rows 10–13) only appear in Claude's tool list when <code>endpoint_id</code> is set in <code>config.yaml</code>.
<strong>Unified registration:</strong> tools are registered once; remote execution is selected with <code>use_remote=True</code> and optional endpoint names.
</div>

<!-- Tool table -->
<div class="card">
<div class="card-h">All MCP Tools &nbsp;<span style="font-weight:400;color:#6e7681;font-size:12px">— 9 always registered · 4 conditional on endpoint_id</span></div>
<div class="card-h">Representative MCP Tools &nbsp;<span style="font-weight:400;color:#6e7681;font-size:12px">— unified local / remote surface</span></div>
<table>
<thead>
<tr><th>#</th><th>Tool Name</th><th>Source File</th><th>Runs On</th><th>Domain Module</th><th>What It Does</th></tr>
Expand All @@ -469,14 +469,14 @@ <h1>UXarray MCP Server — Architecture Diagram</h1>
<tr><td>7</td><td><code>get_execution_mode</code></td><td><code>execution_control.py</code></td><td><span class="tag tl">LOCAL</span></td><td>—</td><td>Returns current execution mode (local / hpc / auto) and whether an HPC endpoint is configured.</td></tr>
<tr><td>8</td><td><code>set_execution_mode</code></td><td><code>execution_control.py</code></td><td><span class="tag tl">LOCAL</span></td><td>—</td><td>Switch execution mode from the Claude UI without editing config.yaml directly.</td></tr>
<tr><td>9</td><td><code>run_scientific_agent</code></td><td><code>scientific_agent.py</code></td><td><span class="tag ta">AUTO</span></td><td>All 4 modules</td><td>Autonomous 4-stage pipeline: Analyze → Plan → Execute → Verify. Validation-gated. Returns full reasoning trace + provenance + artifacts.</td></tr>
<tr><td>10</td><td><code>inspect_mesh_hpc</code></td><td><code>remote_tools.py</code></td><td><span class="tag th">HPC*</span></td><td><code>mesh.py</code> on HPC</td><td>Mesh inspection via Globus Compute on Improv. Pre-flight health check. Auto-fallback to local.</td></tr>
<tr><td>11</td><td><code>calculate_area_hpc</code></td><td><code>remote_tools.py</code></td><td><span class="tag th">HPC*</span></td><td><code>area.py</code> on HPC</td><td>Face area calculation via Globus Compute. Pre-flight health check. Auto-fallback to local.</td></tr>
<tr><td>12</td><td><code>inspect_variable_hpc</code></td><td><code>remote_tools.py</code></td><td><span class="tag th">HPC*</span></td><td><code>variable.py</code> on HPC</td><td>Variable inspection via Globus Compute. Pre-flight health check. Auto-fallback to local.</td></tr>
<tr><td>13</td><td><code>calculate_zonal_mean_hpc</code></td><td><code>remote_tools.py</code></td><td><span class="tag th">HPC*</span></td><td><code>zonal.py</code> on HPC</td><td>Zonal mean via Globus Compute. Pre-flight health check. Auto-fallback to local.</td></tr>
<tr><td>10</td><td><code>inspect_mesh</code></td><td><code>remote_tools.py</code></td><td><span class="tag th">HPC*</span></td><td><code>mesh.py</code> on HPC</td><td>Mesh inspection with <code>use_remote=True</code>. Pre-flight health check. Auto-fallback to local.</td></tr>
<tr><td>11</td><td><code>calculate_area</code></td><td><code>remote_tools.py</code></td><td><span class="tag th">HPC*</span></td><td><code>area.py</code> on HPC</td><td>Face area calculation with <code>use_remote=True</code>. Pre-flight health check. Auto-fallback to local.</td></tr>
<tr><td>12</td><td><code>inspect_variable</code></td><td><code>remote_tools.py</code></td><td><span class="tag th">HPC*</span></td><td><code>variable.py</code> on HPC</td><td>Variable inspection with <code>use_remote=True</code>. Pre-flight health check. Auto-fallback to local.</td></tr>
<tr><td>13</td><td><code>calculate_zonal_mean</code></td><td><code>remote_tools.py</code></td><td><span class="tag th">HPC*</span></td><td><code>zonal.py</code> on HPC</td><td>Zonal mean with <code>use_remote=True</code>. Pre-flight health check. Auto-fallback to local.</td></tr>
</tbody>
</table>
<div style="padding:10px 15px;font-size:11.5px;color:#6e7681;border-top:1px solid #21262d;">
* HPC tools are only registered when <code>endpoint_id</code> is set in <code>config.yaml</code>. Without an endpoint, Claude's tool list shows 9 tools.
* Remote execution is selected per call with <code>use_remote=True</code>. Endpoint readiness controls remote dispatch and fallback.
</div>
</div>

Expand Down
15 changes: 15 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,21 @@ of exposing separate HPC-only tool names.

**Validation gating** — The scientific agent runs dataset validation before zonal mean. If validation fails, zonal mean is skipped rather than producing unreliable results.

## Maintenance Notes

The current implementation favors a small number of broad tool modules while
the MCP surface is still evolving. That keeps related behavior easy to audit
for the first release, but the largest files should be split once the public
contracts settle:

- `remote/compute_functions.py` should be divided by remote capability family
(inspection, plotting, vector calculus, remapping, diagnostics).
- `tools/advanced.py` should be divided into spatial, comparison, remapping,
temporal/ensemble, and export modules.

Keep those refactors behavior-preserving and test-backed; they are polish and
maintainability work, not blockers for the core conda package.

## Interactive Diagram

An interactive architecture diagram is available at `docs/architecture.html` in the repository.
94 changes: 47 additions & 47 deletions docs/chrysalis.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,69 +19,44 @@ which hosts the E3SM next-generation mesh library.
are sent as source code via `AllCodeStrategies` and only need `uxarray` + deps
in the worker environment.
- Login nodes **kill compute processes** — always use the Slurm backend.
- `unset PYTHONPATH` before every endpoint start — the conda `uxarray-yac` env
injects broken `pydantic_core` paths that crash workers.
- YAC remapping needs the Python 3.12 `uxarray-yac` environment plus YAC, MKL,
MPICH, NetCDF, and local shim library paths. Use
`scripts/chrysalis_endpoint.sh` instead of hand-writing those paths.
- If a remote probe times out after the endpoint is `registered`, inspect the
endpoint logs on Chrysalis with `scripts/chrysalis_endpoint.sh logs`.

## Worker Environment

| Item | Value |
|---|---|
| Venv | `~/venvs/globus-compute-py313` (Python 3.13) |
| UXarray/YAC env | `~/.conda/envs/uxarray-yac` (Python 3.12) |
| Endpoint helper venv | `~/venvs/globus-compute-py313` |
| Slurm partition | `debug` (4h walltime, 20 nodes) |
| Compute nodes | 251 GB RAM, 128 CPUs |
| Endpoint name | `uxarray-chrysalis` |

## First-Time Setup

The checked-in helper script writes the endpoint profile, YAC runtime library
paths, and small BLAS/LAPACK shims needed by the current YAC build:

```bash
# 1. Create Python 3.13 conda env
/gpfs/fs1/soft/chrysalis/manual/miniforge3/25.3.1/bin/conda create \
-n gc-py313 python=3.13 -y

# 2. Build the globus-compute venv
~/.conda/envs/gc-py313/bin/python -m venv ~/venvs/globus-compute-py313
~/venvs/globus-compute-py313/bin/pip install \
"globus-compute-endpoint==4.12.0" \
uxarray xarray netCDF4 h5netcdf numpy matplotlib holoviews cartopy

# 3. Configure the endpoint (Slurm-backed)
unset PYTHONPATH
~/venvs/globus-compute-py313/bin/globus-compute-endpoint configure uxarray-chrysalis

cat > ~/.globus_compute/uxarray-chrysalis/user_config_template.yaml.j2 << 'EOF'
endpoint_setup: ""
engine:
type: GlobusComputeEngine
max_workers_per_node: 4
provider:
type: SlurmProvider
partition: debug
nodes_per_block: 1
init_blocks: 0
min_blocks: 0
max_blocks: 2
walltime: "04:00:00"
worker_init: |
unset PYTHONPATH
launcher:
type: SrunLauncher
idle_heartbeats_soft: 10
idle_heartbeats_hard: 5760
EOF

cat > ~/.globus_compute/uxarray-chrysalis/user_environment.yaml << 'EOF'
PYTHONPATH: ""
PATH: "/home/<username>/venvs/globus-compute-py313/bin:/usr/bin:/bin"
EOF
git clone https://github.com/UXARRAY/uxarray-mcp-server.git
cd uxarray-mcp-server
bash scripts/chrysalis_endpoint.sh configure slurm-debug
bash scripts/chrysalis_endpoint.sh check-yac
```

The `check-yac` command runs a tiny Slurm job that imports `yac.core`, imports
UXarray's YAC helper, and remaps HEALPix zoom 2 to zoom 3. It should report
`yac_core_ok: true` and `remap_ok: true` before the endpoint is used by MCP.

## Starting the Endpoint

Run this every time you log in:

```bash
unset PYTHONPATH
~/venvs/globus-compute-py313/bin/globus-compute-endpoint start uxarray-chrysalis
bash scripts/chrysalis_endpoint.sh start
```

The endpoint prints its UUID. Add it to your private local config on your
Expand All @@ -98,6 +73,8 @@ From your laptop after the endpoint is running:

```bash
uv run python scripts/hpc_doctor.py --endpoint chrysalis --timeout-seconds 120
uv run --extra hpc python scripts/yac_smoke_test.py \
--endpoint chrysalis --timeout-seconds 300
```

Or manually:
Expand All @@ -110,6 +87,14 @@ print(validate_hpc_setup(endpoint='chrysalis', run_remote_probe=True,
probe_timeout_seconds=120))
```

If the manager reports `registered` but worker probes time out, inspect the
remote side on Chrysalis:

```bash
bash scripts/chrysalis_endpoint.sh logs
squeue -u "$USER"
```

## E3SM Next-Generation Ocean Meshes

Available at `/lcrc/group/e3sm/ac.xylar/polaris_1.0/chrysalis/test_20260520/unified-mesh-topo-cull2/`:
Expand All @@ -123,8 +108,23 @@ Available at `/lcrc/group/e3sm/ac.xylar/polaris_1.0/chrysalis/test_20260520/unif

## Troubleshooting

**`ENDPOINT_NOT_ONLINE`** — the Slurm debug job timed out (4h limit). Restart with `unset PYTHONPATH && ~/venvs/globus-compute-py313/bin/globus-compute-endpoint start uxarray-chrysalis`.
**`ENDPOINT_NOT_ONLINE`** — the Slurm debug job timed out (4h limit). Restart
with `bash scripts/chrysalis_endpoint.sh restart`.

**Worker probe timeout after `registered`** — the manager is connected, but a
Slurm worker did not return. Run `bash scripts/chrysalis_endpoint.sh logs` on
Chrysalis and inspect the latest submit script/log pair.

**`pydantic_core` not found** — the worker is running from the wrong Python
environment. Re-run `bash scripts/chrysalis_endpoint.sh configure slurm-debug`
and restart the endpoint.

**`WorkerLost` or `SystemError`** — PYTHONPATH is set. Always `unset PYTHONPATH` before starting the endpoint.
**`libnetcdf.so.22`, `liblapack.so.3`, or `libblas.so.3` not found** — the YAC
runtime paths or local MKL shims are missing. Re-run
`bash scripts/chrysalis_endpoint.sh configure slurm-debug`, then
`bash scripts/chrysalis_endpoint.sh check-yac`.

**`pydantic_core` not found** — conda env leaked into the worker. Check `user_environment.yaml` has `PYTHONPATH: ""` and restart.
**`PMI_Init failed` or `WorkerLost` during YAC import** — YAC initializes MPI.
Inside a Globus Compute worker, run the YAC smoke/remap through the dedicated
smoke path, which launches the native YAC child process with
`srun --ntasks 1` when `SLURM_JOB_ID` is present.
7 changes: 7 additions & 0 deletions docs/release.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,13 @@ The conda package should install the core MCP server and CLI. HPC-specific
Globus Compute dependencies can be added to the feedstock later if conda-forge
availability and solver behavior are acceptable.

The seed recipe intentionally targets the **core** package only. Keep
`globus-compute-sdk` and `academy-py` out of the initial conda-forge recipe
until those dependencies and their transitive solver behavior are validated on
conda-forge. If conda-native HPC support becomes necessary, prefer a second
output such as `uxarray-mcp-hpc` or a feedstock variant rather than making every
local-only user solve the remote-execution stack.

## Privacy Check

Before every release, verify endpoint UUIDs and local config did not re-enter
Expand Down
Loading
Loading