From 4bf825329b176d08a2b3f6e97ec0e36624aa068e Mon Sep 17 00:00:00 2001 From: Zhiyu Cheng Date: Sun, 12 Apr 2026 02:35:06 -0700 Subject: [PATCH 1/8] Polish eval skills Signed-off-by: Zhiyu Cheng --- .claude/skills/common/remote-execution.md | 11 +++++++++++ .claude/skills/common/slurm-setup.md | 14 ++++++++++++++ .claude/skills/evaluation/SKILL.md | 13 +++++++++++++ 3 files changed, 38 insertions(+) diff --git a/.claude/skills/common/remote-execution.md b/.claude/skills/common/remote-execution.md index 7c99a5c2a9..2e538fa466 100644 --- a/.claude/skills/common/remote-execution.md +++ b/.claude/skills/common/remote-execution.md @@ -28,6 +28,17 @@ clusters: default_cluster: my-cluster ``` +### Checkpoint and storage availability + +Cluster compute nodes may not share the same filesystem as login nodes or other clusters. Before running any workload that references a checkpoint path, verify the path is accessible from compute nodes: + +| Cluster type | Compute-node storage | NOT accessible from compute nodes | +|-------------|---------------------|----------------------------------| +| JET clusters (oci-hsg, cw, oci-nrt) | `/lustre/fsw/...` | Workstation NFS (`/home/scratch.*`), other cluster mounts | +| dlcluster | `/home/omniml_data_*`, `/home/scratch.*` | `/lustre/` paths | + +If a checkpoint was produced on a different cluster or workstation, copy it to the target cluster's accessible storage before submitting jobs. NEL and SLURM do NOT sync checkpoints automatically. + See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types. --- diff --git a/.claude/skills/common/slurm-setup.md b/.claude/skills/common/slurm-setup.md index 37b9fbd56a..f26731d883 100644 --- a/.claude/skills/common/slurm-setup.md +++ b/.claude/skills/common/slurm-setup.md @@ -51,6 +51,20 @@ srun \ " ``` +### Container registry credentials (pyxis) + +If `srun --container-image` uses an image from a private registry (e.g., `nvcr.io/nvidia/...`), pyxis/enroot needs credentials on the cluster. Check for existing credentials and add if missing: + +```bash +cat ~/.config/enroot/.credentials 2>/dev/null || echo "No credentials" +# To add NGC credentials: +mkdir -p ~/.config/enroot +echo 'machine nvcr.io login $oauthtoken password ' > ~/.config/enroot/.credentials +chmod 600 ~/.config/enroot/.credentials +``` + +Without this, `srun` will fail with `401 Unauthorized` when pulling from `nvcr.io`. + Submit and capture the job ID: ```bash diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md index f8eab5561b..714d9fa522 100644 --- a/.claude/skills/evaluation/SKILL.md +++ b/.claude/skills/evaluation/SKILL.md @@ -286,6 +286,19 @@ After job submission, you can monitor progress using: --- +### NEL CI and Cluster-Specific Notes + +For running evaluations on NVIDIA JET clusters (oci-hsg, cw, oci-nrt) or SLURM clusters like dlcluster, read `references/nel-ci-guide.md`. It covers: +- NEL CI GitLab trigger pattern vs NEL SLURM executor +- Cluster-specific GPU counts and storage paths +- Checkpoint availability (compute nodes may not share login node filesystems) +- Environment variable prefixes (`host:`, `lit:`) for SLURM executor +- SGLang must bind `--host 0.0.0.0` for health checks +- Directory setup and `chmod 777` for JET service account access +- Common issues (NGC auth, gated datasets, walltime, `NEL_OTHER_OVERRIDES` space-splitting) + +--- + Direct users with issues to: - **GitHub Issues:** From 2e84f3ba52bb63361d7f8dacc8152887a3160b6a Mon Sep 17 00:00:00 2001 From: Zhiyu Cheng Date: Sun, 12 Apr 2026 02:37:18 -0700 Subject: [PATCH 2/8] Polish eval skills Signed-off-by: Zhiyu Cheng --- .../evaluation/references/nel-ci-guide.md | 189 ++++++++++++++++++ 1 file changed, 189 insertions(+) create mode 100644 .claude/skills/evaluation/references/nel-ci-guide.md diff --git a/.claude/skills/evaluation/references/nel-ci-guide.md b/.claude/skills/evaluation/references/nel-ci-guide.md new file mode 100644 index 0000000000..771558f6cd --- /dev/null +++ b/.claude/skills/evaluation/references/nel-ci-guide.md @@ -0,0 +1,189 @@ +# NEL CI Evaluation Guide + +NEL CI is the recommended entry point for running evaluations on NVIDIA JET infrastructure. This guide covers patterns for evaluating quantized checkpoints using both the NEL SLURM executor (direct) and the NEL CI GitLab pipeline. + +Reference repo: `gitlab-master.nvidia.com/dl/JoC/competitive_evaluation/nemo-evaluator-launcher-ci` + +--- + +## 1. Two Execution Paths + +| Path | When to use | How it works | +|------|-------------|--------------| +| **NEL SLURM executor** | You have SSH access to the cluster, checkpoint is on cluster storage | `nel run --config config.yaml` from your workstation; NEL SSHes to cluster and submits sbatch jobs | +| **NEL CI GitLab pipeline** | You want managed infrastructure, MLflow export, reproducible configs | Trigger via GitLab API or UI; JET orchestrates everything | + +### NEL SLURM executor + +Best for iterative development and debugging. Run from any machine with SSH access to the cluster: + +```bash +export DUMMY_API_KEY=dummy +export HF_TOKEN= + +nel run --config eval_config.yaml \ + -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10 # test first +``` + +### NEL CI trigger + +Best for production evaluations with MLflow tracking. See the trigger script pattern in section 4. + +--- + +## 2. Cluster Reference + +| Cluster | GPUs/Node | Architecture | Max Walltime | Storage | Notes | +|---------|-----------|-------------|--------------|---------|-------| +| oci-hsg | 4 | GB200 | 4 hours | `/lustre/` | Set `tensor_parallel_size=4` | +| cw | 8 | H100 | — | `/lustre/` | — | +| oci-nrt | 8 | H100 | — | `/lustre/` | Numerics configs | +| dlcluster | 4 (B100 partition) | B100 | 8 hours | `/home/omniml_data_*` | No `/lustre/`; use local NFS paths | + +**Important**: `deployment.tensor_parallel_size` determines how many GPUs are requested. If this exceeds the cluster's GPUs per node, the job fails with a memory allocation error. + +--- + +## 3. Checkpoint Availability + +The checkpoint must be on a filesystem accessible from the cluster's **compute nodes** (not just login nodes). + +| Cluster type | Accessible storage | NOT accessible | +|-------------|-------------------|----------------| +| JET clusters (oci-hsg, cw, oci-nrt) | `/lustre/fsw/...` | Workstation paths (`/home/scratch.*`), NFS mounts from other clusters | +| dlcluster | `/home/omniml_data_*`, `/home/scratch.*` | `/lustre/` (not available) | + +If the checkpoint is on a workstation, **copy it to cluster storage first**: + +```bash +rsync -av /path/to/local/checkpoint \ + :/lustre/fsw/portfolios/coreai/users/$USER/checkpoints/ +``` + +For dlcluster, the checkpoint paths are directly accessible since the NFS mounts are shared between login and compute nodes. + +--- + +## 4. NEL CI Trigger Pattern + +For JET clusters, trigger evaluations via the GitLab API. Use `NEL_DEPLOYMENT_COMMAND` (not `NEL_OTHER_OVERRIDES` with `deployment.extra_args`) because `NEL_OTHER_OVERRIDES` splits values on spaces, breaking multi-flag commands. + +```bash +export GITLAB_TOKEN= + +curl -k --request POST \ + --header "PRIVATE-TOKEN: ${GITLAB_TOKEN}" \ + --header "Content-Type: application/json" \ + --data '{ + "ref": "main", + "variables": [ + {"key": "NEL_CONFIG_PATH", "value": "configs/AA/minimax_m2_5_lbd_lax.yaml"}, + {"key": "NEL_ACCOUNT", "value": "coreai_dlalgo_modelopt"}, + {"key": "NEL_CLUSTER", "value": "oci-hsg"}, + {"key": "NEL_CHECKPOINT_OR_ARTIFACT", "value": "/lustre/.../checkpoint"}, + {"key": "NEL_DEPLOYMENT_IMAGE", "value": "vllm/vllm-openai:v0.19.0"}, + {"key": "NEL_TASKS", "value": "simple_evals.gpqa_diamond_aa_v3"}, + {"key": "NEL_DEPLOYMENT_COMMAND", "value": "vllm serve /checkpoint --host 0.0.0.0 --port 8000 --tensor-parallel-size 4 --quantization modelopt_fp4 --trust-remote-code --served-model-name my-model"}, + {"key": "NEL_OTHER_OVERRIDES", "value": "deployment.tensor_parallel_size=4 execution.walltime=04:00:00"}, + {"key": "NEL_HF_HOME", "value": "/lustre/.../cache/huggingface"}, + {"key": "NEL_VLLM_CACHE", "value": "/lustre/.../cache/vllm"}, + {"key": "NEL_CLUSTER_OUTPUT_DIR", "value": "/lustre/.../nv-eval-rundirs"} + ] + }' \ + "https://gitlab-master.nvidia.com/api/v4/projects/221804/pipeline" +``` + +--- + +## 5. Environment Variables + +### SLURM executor format + +Env vars in NEL SLURM configs require explicit prefixes: + +| Prefix | Meaning | Example | +|--------|---------|---------| +| `host:VAR_NAME` | Read from the host environment where `nel run` is executed | `host:HF_TOKEN` | +| `lit:value` | Literal string value | `lit:dummy` | + +```yaml +evaluation: + env_vars: + DUMMY_API_KEY: host:DUMMY_API_KEY + HF_TOKEN: host:HF_TOKEN +``` + +### JET executor format + +JET configs reference JET secrets with `$SECRET_NAME`: + +```yaml +execution: + env_vars: + evaluation: + HF_TOKEN: $COMPEVAL_HF_TOKEN +``` + +### Gated datasets + +Tasks that download gated HuggingFace datasets (e.g., GPQA, HLE) need `HF_TOKEN` passed to the evaluation container. Set it at the evaluation level or per-task: + +```yaml +evaluation: + env_vars: + HF_TOKEN: host:HF_TOKEN # SLURM executor + tasks: + - name: simple_evals.gpqa_diamond + env_vars: + HF_TOKEN: host:HF_TOKEN +``` + +--- + +## 6. Serving Framework Notes + +### vLLM + +- Binds to `0.0.0.0` by default — health checks work out of the box +- For NVFP4: `--quantization modelopt_fp4` +- For unsupported models (e.g., ministral3): may need custom `deployment.command` to patch the framework before serving (see `deployment/references/unsupported-models.md`) + +### SGLang + +- **Must include `--host 0.0.0.0`** — SGLang defaults to `127.0.0.1` which blocks health checks from the eval client +- Must include `--port 8000` to match NEL's expected port +- For NVFP4: `--quantization modelopt_fp4` + +--- + +## 7. Common Issues + +| Issue | Cause | Fix | +|-------|-------|-----| +| `401 Unauthorized` pulling eval container | NGC credentials not set on cluster | Set up `~/.config/enroot/.credentials` with NGC API key | +| `PermissionError: /hf-cache/...` | HF cache dir not writable by svc-jet | Set `NEL_HF_HOME` to your own `chmod 777` directory | +| Health check stuck at `000` | Server binding to localhost | Add `--host 0.0.0.0` to deployment command (SGLang) | +| `Memory required by task is not available` | TP size exceeds GPUs/node | Set `tensor_parallel_size` to match cluster (4 for oci-hsg, dlcluster B100) | +| TIMEOUT after eval completes | Walltime too short for eval + MLflow export | Set `execution.walltime=04:00:00` | +| Gated dataset auth failure | `HF_TOKEN` not passed to eval container | Add `env_vars.HF_TOKEN` at evaluation or task level | +| `NEL_OTHER_OVERRIDES` splits `extra_args` | Space-separated parsing breaks multi-flag values | Use `NEL_DEPLOYMENT_COMMAND` instead | +| Checkpoint not found in container | Path not on cluster compute-node filesystem | Copy checkpoint to `/lustre/` (or cluster-accessible path) first | +| `trusted_eval` type mismatch in MLflow export | NEL writes boolean `true` instead of string `"true"` | Fix with `sed -i "s/trusted_eval: true/trusted_eval: 'true'/"` in export config | + +--- + +## 8. Directory Setup for JET Clusters + +Before running evaluations on a JET cluster, create writable directories: + +```bash +ssh +mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface +mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm +mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs +chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface +chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm +chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs +``` + +`chmod 777` is required because `svc-jet` (JET service account) runs containers and needs write access. From e952bcdcefe6d45c928d3911dfa7a3e2a9517819 Mon Sep 17 00:00:00 2001 From: Zhiyu Cheng Date: Sun, 12 Apr 2026 02:46:14 -0700 Subject: [PATCH 3/8] update Signed-off-by: Zhiyu Cheng --- .claude/skills/evaluation/references/nel-ci-guide.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/.claude/skills/evaluation/references/nel-ci-guide.md b/.claude/skills/evaluation/references/nel-ci-guide.md index 771558f6cd..5c71ed7144 100644 --- a/.claude/skills/evaluation/references/nel-ci-guide.md +++ b/.claude/skills/evaluation/references/nel-ci-guide.md @@ -126,12 +126,16 @@ execution: ### Gated datasets -Tasks that download gated HuggingFace datasets (e.g., GPQA, HLE) need `HF_TOKEN` passed to the evaluation container. Set it at the evaluation level or per-task: +Tasks that download gated HuggingFace datasets (e.g., GPQA, HLE) need `HF_TOKEN` passed to the evaluation container. + +**NEL CI (JET)**: Handled automatically — the `COMPEVAL_HF_TOKEN` JET secret is pre-configured by the eval platform team. No user action needed; you don't even need personal access to the gated dataset. + +**NEL SLURM executor**: You must provide your own HF token, AND your HuggingFace account must have been granted access to the gated dataset (e.g., request access at https://huggingface.co/datasets/Idavidrein/gpqa for GPQA). ```yaml evaluation: env_vars: - HF_TOKEN: host:HF_TOKEN # SLURM executor + HF_TOKEN: host:HF_TOKEN # SLURM executor — reads from your local env tasks: - name: simple_evals.gpqa_diamond env_vars: From 2cb3b39cc5ddc2c0a3fea11af33e7b526603c790 Mon Sep 17 00:00:00 2001 From: Zhiyu Cheng Date: Sun, 12 Apr 2026 13:55:38 -0700 Subject: [PATCH 4/8] Add end-to-end workflow doc and cross-skill references MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add common/end-to-end-workflow.md documenting the PTQ → Deploy → Eval pipeline, workspace continuity, unsupported model handling, NEL deployment.command pattern, and NEL CI vs SLURM executor decision table - Add cross-skill workspace flow to workspace-management.md - Add "Next steps" to ptq/SKILL.md pointing to deployment/evaluation - Add pipeline integration note to evaluation/SKILL.md Depends on PR #1236 (deployment/references/unsupported-models.md). Signed-off-by: Zhiyu Cheng --- .claude/skills/common/end-to-end-workflow.md | 70 +++++++++++++++++++ .claude/skills/common/workspace-management.md | 19 +++++ .claude/skills/evaluation/SKILL.md | 4 +- .claude/skills/ptq/SKILL.md | 2 + 4 files changed, 94 insertions(+), 1 deletion(-) create mode 100644 .claude/skills/common/end-to-end-workflow.md diff --git a/.claude/skills/common/end-to-end-workflow.md b/.claude/skills/common/end-to-end-workflow.md new file mode 100644 index 0000000000..1dae03c2e5 --- /dev/null +++ b/.claude/skills/common/end-to-end-workflow.md @@ -0,0 +1,70 @@ +# End-to-End Workflow: PTQ → Deploy → Eval + +This document ties together the three domain skills (PTQ, Deployment, Evaluation) for the common workflow of quantizing a model, deploying it, and evaluating accuracy. + +## Pipeline Overview + +```text +PTQ (quantize) → Deployment (serve) → Evaluation (benchmark) +───────────────── ────────────────── ──────────────────────── +hf_ptq.py vLLM / SGLang / TRT-LLM NEL (SLURM or JET) + ↓ ↓ ↓ +NVFP4/FP8 checkpoint OpenAI-compatible API MMLU, GSM8K, GPQA scores + (safetensors) (http://host:8000) (results.yml) +``` + +## Workspace Continuity + +All three stages share the same workspace directory. The PTQ output becomes the deployment input, and eval results land alongside: + +```text +workspaces/model-name-format/ + output/ ← PTQ checkpoint (safetensors + config.json) + eval_results/ ← NEL evaluation artifacts (results.yml per task) + eval_config.yaml ← NEL config for evaluation + scripts/ ← Custom run scripts (if needed) + logs/ ← SLURM job logs +``` + +When starting a deployment or evaluation step, always check for an existing workspace from a prior PTQ run: + +```bash +ls workspaces/ +``` + +## Unsupported Models + +Models not in the verified support matrices require extra work at each stage: + +| Stage | What can go wrong | Reference | +|-------|-------------------|-----------| +| **PTQ** | Unknown architecture, FP8 source checkpoint, VLM structure | `ptq/references/unsupported-models.md` | +| **Deployment** | Missing architecture mapping, weight key mismatches, quant/unquant layer confusion | `deployment/references/unsupported-models.md` | +| **Evaluation** | Framework patches needed in deployment container, gated datasets, cluster storage | `evaluation/references/nel-ci-guide.md` | + +Each stage has its own debug loop (run → read error → diagnose → patch → re-run). Fixes from one stage often inform the next — e.g., if PTQ required a transformers upgrade, deployment and evaluation will too. + +## NEL Evaluation with Custom Deployments + +When the serving framework needs runtime patches (e.g., transformers upgrade, model handler fix), override `deployment.command` in the NEL config to inject fixes before serving: + +```yaml +deployment: + command: >- + pip install "transformers>=5.0.0.dev0" --pre -q && + sed -i 's/old_pattern/new_pattern/' /path/to/framework/file.py && + ${deployment.base_command} +``` + +This works with both NEL SLURM executor and NEL CI (via `NEL_DEPLOYMENT_COMMAND`). + +## Decision: NEL SLURM Executor vs NEL CI (JET) + +| Factor | NEL SLURM executor | NEL CI (JET) | +|--------|-------------------|--------------| +| **When to use** | Iterative debugging, checkpoint on non-JET cluster, custom patches needed | Production evals, MLflow tracking, reproducible configs | +| **Checkpoint location** | Any cluster you have SSH access to | Must be on JET cluster `/lustre/` storage | +| **Secrets (HF_TOKEN, NGC)** | Provide your own via `host:` env vars | Managed centrally via JET secrets | +| **Container patches** | Override `deployment.command` | Use `NEL_DEPLOYMENT_COMMAND` | +| **MLflow export** | Manual setup | Automatic | +| **Gated datasets** | Your HF account needs access | Handled by `COMPEVAL_HF_TOKEN` | diff --git a/.claude/skills/common/workspace-management.md b/.claude/skills/common/workspace-management.md index bd32916632..5d85e91186 100644 --- a/.claude/skills/common/workspace-management.md +++ b/.claude/skills/common/workspace-management.md @@ -92,6 +92,21 @@ rsync -a --quiet \ "$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT//" ``` +## Cross-Skill Workspace Flow + +Workspaces carry over across the PTQ → Deploy → Eval pipeline. Each stage adds to the same directory: + +```text +workspaces/model-name-format/ + output/ ← PTQ: quantized checkpoint + eval_results/ ← Evaluation: NEL artifacts (results.yml per task) + eval_config.yaml ← Evaluation: NEL config + scripts/ ← Deployment/PTQ: custom run scripts + logs/ ← All: SLURM job logs +``` + +See `skills/common/end-to-end-workflow.md` for the full pipeline. + ## Example Flow ```text @@ -104,6 +119,10 @@ User: "deploy the model I just quantized" Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4" → reuse, find checkpoint at workspaces/qwen3-0.6b-nvfp4/output/ +User: "evaluate the quantized model on MMLU and GSM8K" +Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4" + → reuse, write eval_config.yaml, results to workspaces/qwen3-0.6b-nvfp4/eval_results/ + User: "now quantize Llama-3.1-8B with fp8" Agent: ls workspaces/ → no llama → mkdir workspaces/llama-3.1-8b-fp8 diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md index 714d9fa522..5174e7befa 100644 --- a/.claude/skills/evaluation/SKILL.md +++ b/.claude/skills/evaluation/SKILL.md @@ -12,10 +12,12 @@ license: Apache-2.0 You're an expert in NeMo Evaluator Launcher! Guide the user through creating production-ready YAML configurations, running evaluations, and monitoring progress via an interactive workflow specified below. -### Workspace (multi-user / Slack bot) +### Workspace and Pipeline Integration If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications. +This skill is often the final stage of the PTQ → Deploy → Eval pipeline. If the model required runtime patches during deployment (transformers upgrade, framework source fixes), carry those patches into the NEL config via `deployment.command`. See `skills/common/end-to-end-workflow.md` for the full pipeline. + ### Workflow ```text diff --git a/.claude/skills/ptq/SKILL.md b/.claude/skills/ptq/SKILL.md index 932f62ec2c..79074dbd6e 100644 --- a/.claude/skills/ptq/SKILL.md +++ b/.claude/skills/ptq/SKILL.md @@ -113,6 +113,8 @@ ls -lh / Report the path and size to the user. +**Next steps**: If the user wants to deploy or evaluate the quantized checkpoint, use the **deployment** or **evaluation** skill. The checkpoint workspace carries over — see `skills/common/end-to-end-workflow.md` for the full PTQ → Deploy → Eval pipeline. If the model required patches during PTQ (e.g., transformers upgrade), the same fixes will likely be needed at deployment and evaluation time. + ## Key API Rules - `mtq.register()` classes **must** define `_setup()` and call it from `__init__` From 1b94fc98b13054193d388d69b4ca6079ba7f3e64 Mon Sep 17 00:00:00 2001 From: Zhiyu Cheng Date: Sun, 12 Apr 2026 17:55:40 -0700 Subject: [PATCH 5/8] fix format Signed-off-by: Zhiyu Cheng --- .claude/skills/evaluation/references/nel-ci-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.claude/skills/evaluation/references/nel-ci-guide.md b/.claude/skills/evaluation/references/nel-ci-guide.md index 5c71ed7144..208088ad2d 100644 --- a/.claude/skills/evaluation/references/nel-ci-guide.md +++ b/.claude/skills/evaluation/references/nel-ci-guide.md @@ -130,7 +130,7 @@ Tasks that download gated HuggingFace datasets (e.g., GPQA, HLE) need `HF_TOKEN` **NEL CI (JET)**: Handled automatically — the `COMPEVAL_HF_TOKEN` JET secret is pre-configured by the eval platform team. No user action needed; you don't even need personal access to the gated dataset. -**NEL SLURM executor**: You must provide your own HF token, AND your HuggingFace account must have been granted access to the gated dataset (e.g., request access at https://huggingface.co/datasets/Idavidrein/gpqa for GPQA). +**NEL SLURM executor**: You must provide your own HF token, AND your HuggingFace account must have been granted access to the gated dataset (e.g., request access at for GPQA). ```yaml evaluation: From b1be817ac2130b0a8b9eaade6063c027adee208f Mon Sep 17 00:00:00 2001 From: Zhiyu Cheng Date: Sun, 12 Apr 2026 21:41:25 -0700 Subject: [PATCH 6/8] Add NEL CI learnings: wrapper script pattern, cross-cluster copy, Hydra escaping - Add wrapper script pattern for complex deployment commands that break Hydra's override parser (put serve.sh in checkpoint dir, reference as bash /checkpoint/serve.sh) - Add NEL_CONFIG_BASE64 + Python trigger pattern for custom configs - Add cross-cluster checkpoint copy via tar pipe - Add Hydra LexerNoViableAltException and Bad Request to common issues Learned from triggering full AA evaluation (MMLU-PRO, GPQA Diamond, LiveCodeBench, SCICODE, AIME 2025, Terminal-Bench Hard) for Devstral-Small-2-24B NVFP4 on oci-hsg via NEL CI. Signed-off-by: Zhiyu Cheng --- .../evaluation/references/nel-ci-guide.md | 86 ++++++++++++++++++- 1 file changed, 84 insertions(+), 2 deletions(-) diff --git a/.claude/skills/evaluation/references/nel-ci-guide.md b/.claude/skills/evaluation/references/nel-ci-guide.md index 208088ad2d..42cde09e23 100644 --- a/.claude/skills/evaluation/references/nel-ci-guide.md +++ b/.claude/skills/evaluation/references/nel-ci-guide.md @@ -60,13 +60,26 @@ rsync -av /path/to/local/checkpoint \ :/lustre/fsw/portfolios/coreai/users/$USER/checkpoints/ ``` +**Cross-cluster copy** (e.g., dlcluster → oci-hsg): If the two clusters can't SSH to each other directly, pipe through your workstation without staging to disk: + +```bash +ssh user@source-cluster "tar czf - -C /path/to/checkpoint ." | \ + ssh user@target-cluster "tar xzf - -C /lustre/.../checkpoints/model-name" +``` + +After copying, set permissions for svc-jet: `chmod -R 777 /lustre/.../checkpoints/model-name` + For dlcluster, the checkpoint paths are directly accessible since the NFS mounts are shared between login and compute nodes. --- ## 4. NEL CI Trigger Pattern -For JET clusters, trigger evaluations via the GitLab API. Use `NEL_DEPLOYMENT_COMMAND` (not `NEL_OTHER_OVERRIDES` with `deployment.extra_args`) because `NEL_OTHER_OVERRIDES` splits values on spaces, breaking multi-flag commands. +For JET clusters, trigger evaluations via the GitLab API. + +### Simple deployment (standard models) + +For models that work with stock vLLM/SGLang, use `NEL_DEPLOYMENT_COMMAND` directly: ```bash export GITLAB_TOKEN= @@ -82,7 +95,6 @@ curl -k --request POST \ {"key": "NEL_CLUSTER", "value": "oci-hsg"}, {"key": "NEL_CHECKPOINT_OR_ARTIFACT", "value": "/lustre/.../checkpoint"}, {"key": "NEL_DEPLOYMENT_IMAGE", "value": "vllm/vllm-openai:v0.19.0"}, - {"key": "NEL_TASKS", "value": "simple_evals.gpqa_diamond_aa_v3"}, {"key": "NEL_DEPLOYMENT_COMMAND", "value": "vllm serve /checkpoint --host 0.0.0.0 --port 8000 --tensor-parallel-size 4 --quantization modelopt_fp4 --trust-remote-code --served-model-name my-model"}, {"key": "NEL_OTHER_OVERRIDES", "value": "deployment.tensor_parallel_size=4 execution.walltime=04:00:00"}, {"key": "NEL_HF_HOME", "value": "/lustre/.../cache/huggingface"}, @@ -93,6 +105,74 @@ curl -k --request POST \ "https://gitlab-master.nvidia.com/api/v4/projects/221804/pipeline" ``` +### Complex deployment (unsupported models needing runtime patches) + +If the model needs runtime patches (e.g., transformers upgrade, framework source fixes), **do NOT put multi-step commands in `NEL_DEPLOYMENT_COMMAND`** — Hydra's override parser will break on nested quotes, `&&`, `$()`, etc. + +Instead, use the **wrapper script pattern**: place a `serve.sh` in the checkpoint directory on the cluster, then point `NEL_DEPLOYMENT_COMMAND` to it. + +**Step 1** — Write wrapper script to the checkpoint directory on the cluster: + +```bash +ssh 'cat > /lustre/.../checkpoint/serve.sh << '"'"'EOF'"'"' +#!/bin/bash +set -e +pip install "transformers>=5.0.0.dev0" "huggingface_hub>=0.32.0" --pre -q +# Patch vLLM for ministral3 support (example) +MISTRAL3_PY=$(find /usr/local/lib -path "*/vllm/model_executor/models/mistral3.py" 2>/dev/null | head -1) +sed -i "s/old_pattern/new_pattern/" "$MISTRAL3_PY" +exec vllm serve /checkpoint --host 0.0.0.0 --port 8000 \ + --tensor-parallel-size 4 --quantization modelopt_fp4 \ + --trust-remote-code --served-model-name my-model --gpu-memory-utilization 0.9 +EOF +chmod 777 /lustre/.../checkpoint/serve.sh' +``` + +**Step 2** — Set `NEL_DEPLOYMENT_COMMAND` to the wrapper: + +``` +{"key": "NEL_DEPLOYMENT_COMMAND", "value": "bash /checkpoint/serve.sh"} +``` + +This works because the checkpoint is mounted at `/checkpoint` inside the container. The script is Hydra-safe (no special characters in the override value). + +### Custom configs with `NEL_CONFIG_BASE64` + +When using a custom config (not from the repo), use `NEL_CONFIG_BASE64` instead of `NEL_CONFIG_PATH`. This requires setting `NEL_UNTRUSTED_EVAL=true`: + +```python +import json, base64, subprocess, os + +with open("my_config.yaml") as f: + config_b64 = base64.b64encode(f.read().encode()).decode() + +payload = { + "ref": "main", + "variables": [ + {"key": "NEL_CONFIG_BASE64", "value": config_b64}, + {"key": "NEL_ACCOUNT", "value": "coreai_dlalgo_modelopt"}, + {"key": "NEL_CLUSTER", "value": "oci-hsg"}, + {"key": "NEL_CHECKPOINT_OR_ARTIFACT", "value": "/lustre/.../checkpoint"}, + {"key": "NEL_DEPLOYMENT_IMAGE", "value": "vllm/vllm-openai:v0.19.0"}, + {"key": "NEL_DEPLOYMENT_COMMAND", "value": "bash /checkpoint/serve.sh"}, + {"key": "NEL_UNTRUSTED_EVAL", "value": "true"}, + # ... other variables + ] +} + +# Use Python to construct JSON (avoids shell escaping issues with curl) +token = os.environ["GITLAB_TOKEN"] +subprocess.run( + ["curl", "-k", "--request", "POST", + "--header", f"PRIVATE-TOKEN: {token}", + "--header", "Content-Type: application/json", + "--data", json.dumps(payload), + "https://gitlab-master.nvidia.com/api/v4/projects/221804/pipeline"], +) +``` + +> **Tip**: Use Python (not bash) to construct the JSON payload for `curl`. Shell escaping of base64 strings and nested quotes is error-prone. + --- ## 5. Environment Variables @@ -173,6 +253,8 @@ evaluation: | `NEL_OTHER_OVERRIDES` splits `extra_args` | Space-separated parsing breaks multi-flag values | Use `NEL_DEPLOYMENT_COMMAND` instead | | Checkpoint not found in container | Path not on cluster compute-node filesystem | Copy checkpoint to `/lustre/` (or cluster-accessible path) first | | `trusted_eval` type mismatch in MLflow export | NEL writes boolean `true` instead of string `"true"` | Fix with `sed -i "s/trusted_eval: true/trusted_eval: 'true'/"` in export config | +| `LexerNoViableAltException` in Hydra | `NEL_DEPLOYMENT_COMMAND` contains quotes, `&&`, `$()` | Use wrapper script pattern (section 4): put script in checkpoint dir, set command to `bash /checkpoint/serve.sh` | +| `Bad Request` from GitLab API trigger | Shell escaping mangled the JSON payload | Use Python to construct JSON (section 4) instead of bash heredocs/string interpolation | --- From 7dcede44f3677c8689bba9750ba43a5933ba5d68 Mon Sep 17 00:00:00 2001 From: Zhiyu Cheng Date: Sun, 12 Apr 2026 21:44:14 -0700 Subject: [PATCH 7/8] fix format Signed-off-by: Zhiyu Cheng --- .claude/skills/evaluation/references/nel-ci-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.claude/skills/evaluation/references/nel-ci-guide.md b/.claude/skills/evaluation/references/nel-ci-guide.md index 42cde09e23..1caec64627 100644 --- a/.claude/skills/evaluation/references/nel-ci-guide.md +++ b/.claude/skills/evaluation/references/nel-ci-guide.md @@ -130,7 +130,7 @@ chmod 777 /lustre/.../checkpoint/serve.sh' **Step 2** — Set `NEL_DEPLOYMENT_COMMAND` to the wrapper: -``` +```json {"key": "NEL_DEPLOYMENT_COMMAND", "value": "bash /checkpoint/serve.sh"} ``` From 8176fc7089685f503eb2f32a5c686bc618de5362 Mon Sep 17 00:00:00 2001 From: Zhiyu Cheng Date: Sun, 12 Apr 2026 22:32:52 -0700 Subject: [PATCH 8/8] Add served_model_name mismatch to NEL CI common issues MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When using NEL_DEPLOYMENT_COMMAND with a custom --served-model-name, deployment.served_model_name must also be overridden via NEL_OTHER_OVERRIDES — NEL uses the config field (not the actual serve command) to set the eval client's model_id. Without this, the client sends the checkpoint path as model_id, causing 404 errors. Signed-off-by: Zhiyu Cheng --- .claude/skills/evaluation/references/nel-ci-guide.md | 1 + 1 file changed, 1 insertion(+) diff --git a/.claude/skills/evaluation/references/nel-ci-guide.md b/.claude/skills/evaluation/references/nel-ci-guide.md index 1caec64627..846d0236c8 100644 --- a/.claude/skills/evaluation/references/nel-ci-guide.md +++ b/.claude/skills/evaluation/references/nel-ci-guide.md @@ -255,6 +255,7 @@ evaluation: | `trusted_eval` type mismatch in MLflow export | NEL writes boolean `true` instead of string `"true"` | Fix with `sed -i "s/trusted_eval: true/trusted_eval: 'true'/"` in export config | | `LexerNoViableAltException` in Hydra | `NEL_DEPLOYMENT_COMMAND` contains quotes, `&&`, `$()` | Use wrapper script pattern (section 4): put script in checkpoint dir, set command to `bash /checkpoint/serve.sh` | | `Bad Request` from GitLab API trigger | Shell escaping mangled the JSON payload | Use Python to construct JSON (section 4) instead of bash heredocs/string interpolation | +| `The model does not exist` (404) | Eval client uses checkpoint path as model_id instead of served_model_name | Add `deployment.served_model_name=` to `NEL_OTHER_OVERRIDES` to match `--served-model-name` in your serve command | ---