Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions .claude/skills/common/end-to-end-workflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# End-to-End Workflow: PTQ → Deploy → Eval

This document ties together the three domain skills (PTQ, Deployment, Evaluation) for the common workflow of quantizing a model, deploying it, and evaluating accuracy.

## Pipeline Overview

```text
PTQ (quantize) → Deployment (serve) → Evaluation (benchmark)
───────────────── ────────────────── ────────────────────────
hf_ptq.py vLLM / SGLang / TRT-LLM NEL (SLURM or JET)
↓ ↓ ↓
NVFP4/FP8 checkpoint OpenAI-compatible API MMLU, GSM8K, GPQA scores
(safetensors) (http://host:8000) (results.yml)
```

## Workspace Continuity

All three stages share the same workspace directory. The PTQ output becomes the deployment input, and eval results land alongside:

```text
workspaces/model-name-format/
output/ ← PTQ checkpoint (safetensors + config.json)
eval_results/ ← NEL evaluation artifacts (results.yml per task)
eval_config.yaml ← NEL config for evaluation
scripts/ ← Custom run scripts (if needed)
logs/ ← SLURM job logs
```

When starting a deployment or evaluation step, always check for an existing workspace from a prior PTQ run:

```bash
ls workspaces/
```

## Unsupported Models

Models not in the verified support matrices require extra work at each stage:

| Stage | What can go wrong | Reference |
|-------|-------------------|-----------|
| **PTQ** | Unknown architecture, FP8 source checkpoint, VLM structure | `ptq/references/unsupported-models.md` |
| **Deployment** | Missing architecture mapping, weight key mismatches, quant/unquant layer confusion | `deployment/references/unsupported-models.md` |
| **Evaluation** | Framework patches needed in deployment container, gated datasets, cluster storage | `evaluation/references/nel-ci-guide.md` |

Each stage has its own debug loop (run → read error → diagnose → patch → re-run). Fixes from one stage often inform the next — e.g., if PTQ required a transformers upgrade, deployment and evaluation will too.

## NEL Evaluation with Custom Deployments

When the serving framework needs runtime patches (e.g., transformers upgrade, model handler fix), override `deployment.command` in the NEL config to inject fixes before serving:

```yaml
deployment:
command: >-
pip install "transformers>=5.0.0.dev0" --pre -q &&
Comment on lines +49 to +54
sed -i 's/old_pattern/new_pattern/' /path/to/framework/file.py &&
${deployment.base_command}
```

This works with both NEL SLURM executor and NEL CI (via `NEL_DEPLOYMENT_COMMAND`).
Comment on lines +47 to +59
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Does NeMo Evaluator Launcher (NEL) support overriding deployment.command in both SLURM executor and GitLab CI modes?

💡 Result:

Yes, NeMo Evaluator Launcher (NEL) supports overriding deployment.command in SLURM executor mode. No evidence found for GitLab CI mode support; official docs only document local, Slurm, and Lepton AI executors.

Citations:


Remove or verify unsupported CI mode claim for deployment.command override.

Line 59 claims the deployment.command override pattern works with "both NEL SLURM executor and NEL CI (via NEL_DEPLOYMENT_COMMAND)". However, official NEL documentation only documents support for SLURM executor, local, and Lepton AI executors—not GitLab CI. Either remove the CI claim or add clear documentation that this is an extension beyond officially supported NEL features.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/common/end-to-end-workflow.md around lines 47 - 59, The doc
currently claims the deployment.command override works "with both NEL SLURM
executor and NEL CI (via NEL_DEPLOYMENT_COMMAND)"; remove or amend this to avoid
asserting unsupported CI support: either delete the "NEL CI (via
NEL_DEPLOYMENT_COMMAND)" clause or add a clear note that using
NEL_DEPLOYMENT_COMMAND is a custom/unsupported extension (not part of official
NEL executors) and list only the officially supported executors (e.g., NEL
SLURM, local, Lepton AI); update the text around deployment.command and
NEL_DEPLOYMENT_COMMAND to reflect this change.


## Decision: NEL SLURM Executor vs NEL CI (JET)

| Factor | NEL SLURM executor | NEL CI (JET) |
|--------|-------------------|--------------|
| **When to use** | Iterative debugging, checkpoint on non-JET cluster, custom patches needed | Production evals, MLflow tracking, reproducible configs |
| **Checkpoint location** | Any cluster you have SSH access to | Must be on JET cluster `/lustre/` storage |
| **Secrets (HF_TOKEN, NGC)** | Provide your own via `host:` env vars | Managed centrally via JET secrets |
| **Container patches** | Override `deployment.command` | Use `NEL_DEPLOYMENT_COMMAND` |
| **MLflow export** | Manual setup | Automatic |
| **Gated datasets** | Your HF account needs access | Handled by `COMPEVAL_HF_TOKEN` |
11 changes: 11 additions & 0 deletions .claude/skills/common/remote-execution.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,17 @@ clusters:
default_cluster: my-cluster
```

### Checkpoint and storage availability

Cluster compute nodes may not share the same filesystem as login nodes or other clusters. Before running any workload that references a checkpoint path, verify the path is accessible from compute nodes:

| Cluster type | Compute-node storage | NOT accessible from compute nodes |
|-------------|---------------------|----------------------------------|
| JET clusters (oci-hsg, cw, oci-nrt) | `/lustre/fsw/...` | Workstation NFS (`/home/scratch.*`), other cluster mounts |
| dlcluster | `/home/omniml_data_*`, `/home/scratch.*` | `/lustre/` paths |

If a checkpoint was produced on a different cluster or workstation, copy it to the target cluster's accessible storage before submitting jobs. NEL and SLURM do NOT sync checkpoints automatically.

See `.claude/clusters.yaml.example` for a fully annotated example with multiple cluster types.

---
Expand Down
14 changes: 14 additions & 0 deletions .claude/skills/common/slurm-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,20 @@ srun \
"
```

### Container registry credentials (pyxis)

If `srun --container-image` uses an image from a private registry (e.g., `nvcr.io/nvidia/...`), pyxis/enroot needs credentials on the cluster. Check for existing credentials and add if missing:

```bash
cat ~/.config/enroot/.credentials 2>/dev/null || echo "No credentials"
# To add NGC credentials:
mkdir -p ~/.config/enroot
echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials
Comment on lines +60 to +62
chmod 600 ~/.config/enroot/.credentials
```

Without this, `srun` will fail with `401 Unauthorized` when pulling from `nvcr.io`.

Submit and capture the job ID:

```bash
Expand Down
19 changes: 19 additions & 0 deletions .claude/skills/common/workspace-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,21 @@ rsync -a --quiet \
"$MODELOPT_REPO_DIR/" "$MODELOPT_WORKSPACE_ROOT/<name>/"
```

## Cross-Skill Workspace Flow

Workspaces carry over across the PTQ → Deploy → Eval pipeline. Each stage adds to the same directory:

```text
workspaces/model-name-format/
output/ ← PTQ: quantized checkpoint
eval_results/ ← Evaluation: NEL artifacts (results.yml per task)
eval_config.yaml ← Evaluation: NEL config
scripts/ ← Deployment/PTQ: custom run scripts
logs/ ← All: SLURM job logs
```

See `skills/common/end-to-end-workflow.md` for the full pipeline.

## Example Flow

```text
Expand All @@ -104,6 +119,10 @@ User: "deploy the model I just quantized"
Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
→ reuse, find checkpoint at workspaces/qwen3-0.6b-nvfp4/output/

User: "evaluate the quantized model on MMLU and GSM8K"
Agent: ls workspaces/ → sees "qwen3-0.6b-nvfp4"
→ reuse, write eval_config.yaml, results to workspaces/qwen3-0.6b-nvfp4/eval_results/

User: "now quantize Llama-3.1-8B with fp8"
Agent: ls workspaces/ → no llama
→ mkdir workspaces/llama-3.1-8b-fp8
Expand Down
17 changes: 16 additions & 1 deletion .claude/skills/evaluation/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,12 @@ license: Apache-2.0

You're an expert in NeMo Evaluator Launcher! Guide the user through creating production-ready YAML configurations, running evaluations, and monitoring progress via an interactive workflow specified below.

### Workspace (multi-user / Slack bot)
### Workspace and Pipeline Integration

If `MODELOPT_WORKSPACE_ROOT` is set, read `skills/common/workspace-management.md`. Check for existing workspaces — especially if evaluating a model from a prior PTQ or deployment step. Reuse the existing workspace so you have access to the quantized checkpoint and any code modifications.

This skill is often the final stage of the PTQ → Deploy → Eval pipeline. If the model required runtime patches during deployment (transformers upgrade, framework source fixes), carry those patches into the NEL config via `deployment.command`. See `skills/common/end-to-end-workflow.md` for the full pipeline.

### Workflow

```text
Expand Down Expand Up @@ -286,6 +288,19 @@ After job submission, you can monitor progress using:

---

### NEL CI and Cluster-Specific Notes

For running evaluations on NVIDIA JET clusters (oci-hsg, cw, oci-nrt) or SLURM clusters like dlcluster, read `references/nel-ci-guide.md`. It covers:
- NEL CI GitLab trigger pattern vs NEL SLURM executor
- Cluster-specific GPU counts and storage paths
- Checkpoint availability (compute nodes may not share login node filesystems)
- Environment variable prefixes (`host:`, `lit:`) for SLURM executor
- SGLang must bind `--host 0.0.0.0` for health checks
- Directory setup and `chmod 777` for JET service account access
- Common issues (NGC auth, gated datasets, walltime, `NEL_OTHER_OVERRIDES` space-splitting)

---

Direct users with issues to:

- **GitHub Issues:** <https://github.com/NVIDIA-NeMo/Evaluator/issues>
Expand Down
Loading
Loading