NVIDIA · Edwardf0t1 · Apr 12, 2026 · Apr 12, 2026 · Apr 12, 2026 · Apr 12, 2026
diff --git a/.claude/skills/common/end-to-end-workflow.md b/.claude/skills/common/end-to-end-workflow.md
@@ -0,0 +1,70 @@
+# End-to-End Workflow: PTQ → Deploy → Eval
+
+This document ties together the three domain skills (PTQ, Deployment, Evaluation) for the common workflow of quantizing a model, deploying it, and evaluating accuracy.
+
+## Pipeline Overview
+
+```text
+PTQ (quantize)          → Deployment (serve)         → Evaluation (benchmark)
+─────────────────         ──────────────────           ────────────────────────
+hf_ptq.py                vLLM / SGLang / TRT-LLM      NEL (SLURM or JET)
+  ↓                         ↓                            ↓
+NVFP4/FP8 checkpoint      OpenAI-compatible API        MMLU, GSM8K, GPQA scores
+  (safetensors)            (http://host:8000)           (results.yml)
+```
+
+## Workspace Continuity
+
+All three stages share the same workspace directory. The PTQ output becomes the deployment input, and eval results land alongside:
+
+```text
+workspaces/model-name-format/
+  output/              ← PTQ checkpoint (safetensors + config.json)
+  eval_results/        ← NEL evaluation artifacts (results.yml per task)
+  eval_config.yaml     ← NEL config for evaluation
+  scripts/             ← Custom run scripts (if needed)
+  logs/                ← SLURM job logs
+```
+
+When starting a deployment or evaluation step, always check for an existing workspace from a prior PTQ run:
+
+```bash
+ls workspaces/
+```
+
+## Unsupported Models
+
+Models not in the verified support matrices require extra work at each stage:
+
+| Stage | What can go wrong | Reference |
+|-------|-------------------|-----------|
+| **PTQ** | Unknown architecture, FP8 source checkpoint, VLM structure | `ptq/references/unsupported-models.md` |
+| **Deployment** | Missing architecture mapping, weight key mismatches, quant/unquant layer confusion | `deployment/references/unsupported-models.md` |
+| **Evaluation** | Framework patches needed in deployment container, gated datasets, cluster storage | `evaluation/references/nel-ci-guide.md` |
+
+Each stage has its own debug loop (run → read error → diagnose → patch → re-run). Fixes from one stage often inform the next — e.g., if PTQ required a transformers upgrade, deployment and evaluation will too.
+
+## NEL Evaluation with Custom Deployments
+
+When the serving framework needs runtime patches (e.g., transformers upgrade, model handler fix), override `deployment.command` in the NEL config to inject fixes before serving:
+
+```yaml
+deployment:
+  command: >-
+    pip install "transformers>=5.0.0.dev0" --pre -q &&
+    sed -i 's/old_pattern/new_pattern/' /path/to/framework/file.py &&
+    ${deployment.base_command}
+```
+
+This works with both NEL SLURM executor and NEL CI (via `NEL_DEPLOYMENT_COMMAND`).
+
+## Decision: NEL SLURM Executor vs NEL CI (JET)
+
+| Factor | NEL SLURM executor | NEL CI (JET) |
+|--------|-------------------|--------------|
+| **When to use** | Iterative debugging, checkpoint on non-JET cluster, custom patches needed | Production evals, MLflow tracking, reproducible configs |
+| **Checkpoint location** | Any cluster you have SSH access to | Must be on JET cluster `/lustre/` storage |
+| **Secrets (HF_TOKEN, NGC)** | Provide your own via `host:` env vars | Managed centrally via JET secrets |
+| **Container patches** | Override `deployment.command` | Use `NEL_DEPLOYMENT_COMMAND` |
+| **MLflow export** | Manual setup | Automatic |
+| **Gated datasets** | Your HF account needs access | Handled by `COMPEVAL_HF_TOKEN` |
diff --git a/.claude/skills/common/remote-execution.md b/.claude/skills/common/remote-execution.md
@@ -28,6 +28,17 @@ clusters:
 default_cluster: my-cluster
 ```
 
+### Checkpoint and storage availability
+
+Cluster compute nodes may not share the same filesystem as login nodes or other clusters. Before running any workload that references a checkpoint path, verify the path is accessible from compute nodes:
+
+| Cluster type | Compute-node storage | NOT accessible from compute nodes |
+|-------------|---------------------|----------------------------------|
+| JET clusters (oci-hsg, cw, oci-nrt) | `/lustre/fsw/...` | Workstation NFS (`/home/scratch.*`), other cluster mounts |
+| dlcluster | `/home/omniml_data_*`, `/home/scratch.*` | `/lustre/` paths |
+
+If a checkpoint was produced on a different cluster or workstation, copy it to the target cluster's accessible storage before submitting jobs. NEL and SLURM do NOT sync checkpoints automatically.
+
 See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.
 
 ---

diff --git a/.claude/skills/common/slurm-setup.md b/.claude/skills/common/slurm-setup.md
@@ -51,6 +51,20 @@ srun \
     "
 ```
 
+### Container registry credentials (pyxis)
+
+If `srun --container-image` uses an image from a private registry (e.g., `nvcr.io/nvidia/...`), pyxis/enroot needs credentials on the cluster. Check for existing credentials and add if missing:
+
+```bash
+cat ~/.config/enroot/.credentials 2>/dev/null || echo "No credentials"
+# To add NGC credentials:
+mkdir -p ~/.config/enroot
+echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials
+chmod 600 ~/.config/enroot/.credentials
+```
+
+Without this, `srun` will fail with `401 Unauthorized` when pulling from `nvcr.io`.
+
 Submit and capture the job ID:
 
 ```bash

diff --git a/.claude/skills/common/workspace-management.md b/.claude/skills/common/workspace-management.md
@@ -92,6 +92,21 @@ rsync -a --quiet \
     "$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT/<name>/"
 ```
 
+## Cross-Skill Workspace Flow
+
+Workspaces carry over across the PTQ → Deploy → Eval pipeline. Each stage adds to the same directory:
+
+```text
+workspaces/model-name-format/
+  output/              ← PTQ: quantized checkpoint
+  eval_results/        ← Evaluation: NEL artifacts (results.yml per task)
+  eval_config.yaml     ← Evaluation: NEL config
+  scripts/             ← Deployment/PTQ: custom run scripts
+  logs/                ← All: SLURM job logs
+```
+
+See `skills/common/end-to-end-workflow.md` for the full pipeline.
+
 ## Example Flow
 
 ```text
@@ -104,6 +119,10 @@ User: "deploy the model I just quantized"
 Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
        → reuse, find checkpoint at workspaces/qwen3-0.6b-nvfp4/output/
 
+User: "evaluate the quantized model on MMLU and GSM8K"
+Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
+       → reuse, write eval_config.yaml, results to workspaces/qwen3-0.6b-nvfp4/eval_results/
+
 User: "now quantize Llama-3.1-8B with fp8"
 Agent: ls workspaces/ → no llama
        → mkdir workspaces/llama-3.1-8b-fp8

diff --git a/.claude/skills/evaluation/SKILL.md b/.claude/skills/evaluation/SKILL.md
@@ -12,10 +12,12 @@ license: Apache-2.0
 
 You're an expert in NeMo Evaluator Launcher! Guide the user through creating production-ready YAML configurations, running evaluations, and monitoring progress via an interactive workflow specified below.
 
-### Workspace (multi-user / Slack bot)
+### Workspace and Pipeline Integration
 
 If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.
 
+This skill is often the final stage of the PTQ → Deploy → Eval pipeline. If the model required runtime patches during deployment (transformers upgrade, framework source fixes), carry those patches into the NEL config via `deployment.command`. See `skills/common/end-to-end-workflow.md` for the full pipeline.
+
 ### Workflow
 
 ```text
@@ -286,6 +288,19 @@ After job submission, you can monitor progress using:
 
 ---
 
+### NEL CI and Cluster-Specific Notes
+
+For running evaluations on NVIDIA JET clusters (oci-hsg, cw, oci-nrt) or SLURM clusters like dlcluster, read `references/nel-ci-guide.md`. It covers:
+- NEL CI GitLab trigger pattern vs NEL SLURM executor
+- Cluster-specific GPU counts and storage paths
+- Checkpoint availability (compute nodes may not share login node filesystems)
+- Environment variable prefixes (`host:`, `lit:`) for SLURM executor
+- SGLang must bind `--host 0.0.0.0` for health checks
+- Directory setup and `chmod 777` for JET service account access
+- Common issues (NGC auth, gated datasets, walltime, `NEL_OTHER_OVERRIDES` space-splitting)
+
+---
+
 Direct users with issues to:
 
 - **GitHub Issues:** <https://github.com/NVIDIA-NeMo/Evaluator/issues>