Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions .claude/skills/common/slurm-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,3 +192,128 @@ chmod -R g+rwX /path/to/.hf_cache/
```

Scope `chmod` to only the directories the job needs — avoid world-writable paths on shared clusters.

---

## 6. Container Registry Authentication
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also applies for local container setup, as I suggested in @Edwardf0t1 's PR, maybe we can put this in evn-setup.md for setting HF_TOKEN, docker login token, and ngc token.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a structural reorganization and I’m fine with either dirs.

Copy link
Copy Markdown
Contributor Author

@kaix-nv kaix-nv Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the issue again. We already have one environment-setup.md, adding another env-setup.md seems to be a bit confusing. Besides, the auth section is specifically about container registry credentials for SLURM job submission, such as detecting runtimes, checking credentials per runtime, and fixing auth before sbatch. That fits SLURM setup, not general environment setup. So I prefer to keep them in slurm-setup.md. cc @Edwardf0t1


**Before submitting any SLURM job that pulls a container image**, check that the cluster has credentials for the image's registry. Missing auth causes jobs to fail after waiting in the queue — a costly mistake.

### Step 1: Detect the container runtime

Different clusters use different container runtimes. Detect which is available:

```bash
# On the cluster (or via ssh):
which enroot 2>/dev/null && echo "RUNTIME=enroot"
which docker 2>/dev/null && echo "RUNTIME=docker"
```

| Runtime | Typical clusters | SLURM integration |
| --- | --- | --- |
| **enroot/pyxis** | NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT) | `srun --container-image` |
| **Docker** | Bare-metal / on-prem with GPU | `docker run` inside job script |

### Step 2: Check credentials for the image's registry

Determine the registry from the image URI:

| Image pattern | Registry |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add SGLang docker:lmsysorg/sglang:...

| --- | --- |
| `nvcr.io/nvidia/...` | NGC |
| `vllm/vllm-openai:...`, `lmsysorg/sglang:...`, or no registry prefix | DockerHub |
| `ghcr.io/...` | GitHub Container Registry |
| `docker.io/...` | DockerHub (explicit) |

Then check credentials based on the runtime:

#### enroot/pyxis

```bash
grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null
```

Look for `machine <registry>` lines:
- NGC → `machine nvcr.io`
- DockerHub → `machine auth.docker.io`
- GHCR → `machine ghcr.io`

#### Docker

```bash
cat ~/.docker/config.json 2>/dev/null | python3 -c "import json,sys; print('\n'.join(json.load(sys.stdin).get('auths', {}).keys()))"
```

Look for registry keys (`https://index.docker.io/v1/`, `nvcr.io`, `ghcr.io`).

### Step 3: If credentials are missing

**Do not submit the job.** Instead:

1. Tell the user which registry and runtime need authentication
2. Show the fix for their runtime:

**enroot/pyxis:**

```bash
mkdir -p ~/.config/enroot

# DockerHub (get token from https://hub.docker.com/settings/security)
cat >> ~/.config/enroot/.credentials << 'EOF'
machine auth.docker.io
login <dockerhub_username>
password <access_token>
EOF

# NGC (get API key from https://org.ngc.nvidia.com/setup/api-keys)
cat >> ~/.config/enroot/.credentials << 'EOF'
machine nvcr.io
login $oauthtoken
password <ngc_api_key>
EOF
```

**Docker:**

```bash
# DockerHub (interactive prompt)
docker login

# NGC (use --password-stdin to avoid exposing secrets in process list)
echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin
```
Comment thread
coderabbitai[bot] marked this conversation as resolved.

3. **Suggest an alternative image** on an authenticated registry. NVIDIA clusters typically have NGC auth pre-configured, so prefer NGC-hosted images:

| DockerHub image | NGC alternative |
| --- | --- |
| `vllm/vllm-openai:latest` | `nvcr.io/nvidia/vllm:<YY.MM>-py3` (check [NGC catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm) for latest tag) |
| `nvcr.io/nvidia/tensorrt-llm/release:<tag>` | Already NGC |

> **Note:** NGC image tags follow `YY.MM-py3` format (e.g., `26.03-py3`). Not all DockerHub images have NGC equivalents. If no NGC alternative exists and DockerHub auth is missing, the user must add DockerHub credentials or pre-cache the image as a `.sqsh` file.

4. After the user fixes auth or switches images, verify the image is **actually pullable** before submitting (credentials alone don't guarantee the image exists):

```bash
# enroot — test pull (aborts after manifest fetch)
enroot import --output /dev/null docker://<registry>#<image> 2>&1 | head -10
# Success: shows "Fetching image manifest" + layer info
# Failure: shows "401 Unauthorized" or "404 Not Found"

# docker
docker manifest inspect <image> 2>&1 | head -5

# singularity
singularity pull --dry-run docker://<image> 2>&1 | head -5
```

> **Important**: Credentials existing for a registry does NOT mean a specific image is accessible. The image may not exist, or the credentials may lack permissions for that repository. Always verify the specific image before submitting.

### Common failure modes

| Symptom | Runtime | Cause | Fix |
| --- | --- | --- | --- |
| `curl: (22) ... error: 401` | enroot | No credentials for registry | Add to `~/.config/enroot/.credentials` |
| `pyxis: failed to import docker image` | enroot | Auth failed or rate limit | Check credentials; DockerHub free: 100 pulls/6h per IP |
| `unauthorized: authentication required` | docker | No `docker login` | Run `docker login [registry]` |
| Image pulls on some nodes but not others | any | Cached on one node only | Pre-cache image or ensure auth on all nodes |
2 changes: 2 additions & 0 deletions .claude/skills/deployment/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,8 @@ All checks must pass before reporting success to the user.

If a cluster config exists (`~/.config/modelopt/clusters.yaml` or `.claude/clusters.yaml`), or the user mentions running on a remote machine:

0. **Check container registry auth** — before submitting any SLURM job with a container image, verify credentials exist on the cluster per `skills/common/slurm-setup.md` section 6. If credentials are missing for the image's registry, ask the user to fix auth or switch to an image on an authenticated registry (e.g., NGC). **Do not submit until auth is confirmed.**

1. **Source remote utilities:**

```bash
Expand Down
36 changes: 34 additions & 2 deletions .claude/skills/evaluation/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Config Generation Progress:
- [ ] Step 5: Confirm tasks (iterative)
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
- [ ] Step 7: Advanced - Interceptors
- [ ] Step 7.5: Check container registry auth (SLURM only)
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Comment thread
kaix-nv marked this conversation as resolved.
- [ ] Step 8: Run the evaluation
```

Expand Down Expand Up @@ -74,9 +75,9 @@ Prompt the user with "I'll ask you 5 questions to build the base config we'll ad
4. Safety & Security (like Garak and Safety Harness)
5. Multilingual (like MMATH, Global MMLU, MMLU-Prox)

DON'T ALLOW FOR ANY OTHER OPTIONS, only the ones listed above under each category (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.
Only accept options from the categories listed above (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.

> **Note:** These categories come from NEL's `build-config` CLI. If `nel skills build-config --help` shows different options than listed above, use the CLI's current options instead.
> **Note:** These categories come from NEL's `build-config` CLI. **Always run `nel skills build-config --help` first** to get the current options — they may differ from this list (e.g., `chat_reasoning` instead of separate `chat`/`reasoning`, `general_knowledge` instead of `standard`). When the CLI's current options differ from this list, prefer the CLI's options.

When you have all the answers, run the script to build the base config:

Expand Down Expand Up @@ -181,6 +182,36 @@ If the user needs multi-node evaluation (model >120B, or more throughput), read

- The docs may show incorrect parameter names for logging. Use `max_logged_requests` and `max_logged_responses` (NOT `max_saved_*` or `max_*`).

**Step 7.5: Check container registry authentication (SLURM only)**

NEL's default deployment images by framework:

| Framework | Default image | Registry |
| --- | --- | --- |
| vLLM | `vllm/vllm-openai:latest` | DockerHub |
| SGLang | `lmsysorg/sglang:latest` | DockerHub |
| TRT-LLM | `nvcr.io/nvidia/tensorrt-llm/release:...` | NGC |
| Evaluation tasks | `nvcr.io/nvidia/eval-factory/*:26.03` | NGC |

Before submitting, verify the cluster has credentials for the deployment image. See `skills/common/slurm-setup.md` section 6 for the full procedure.
Comment thread
kaix-nv marked this conversation as resolved.

```bash
ssh <host> "grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null"
```

**Decision flow (check before submitting):**
1. Check if the cluster has credentials for the default DockerHub image (see command above)
2. If DockerHub credentials exist → use the default image and submit
3. If DockerHub credentials are missing but can be added → add them (see `slurm-setup.md` section 6), then submit
4. If DockerHub credentials cannot be added → override `deployment.image` to the NGC alternative and submit:

```yaml
deployment:
image: nvcr.io/nvidia/vllm:<YY.MM>-py3 # check https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm for latest tag
```

5. **Do not retry more than once** without fixing the auth issue

**Step 8: Run the evaluation**

Print the following commands to the user. Propose to execute them in order to confirm the config works as expected before the full run.
Expand Down Expand Up @@ -303,5 +334,6 @@ Config Generation Progress:
- [ ] Step 5: Confirm tasks (iterative)
- [ ] Step 6: Advanced - Multi-node (Data Parallel)
- [ ] Step 7: Advanced - Interceptors
- [ ] Step 7.5: Check container registry auth (SLURM only)
- [ ] Step 8: Run the evaluation
```
19 changes: 19 additions & 0 deletions .claude/skills/ptq/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,24 @@ Check the support table in `examples/llm_ptq/README.md` for verified HF models.
- **Listed** → supported, use `hf_ptq.py` (step 4A/4B)
- **Not listed** → read `references/unsupported-models.md` to determine if `hf_ptq.py` can still work or if a custom script is needed (step 4C)

## Step 2.5 — Check for model-specific dependencies

If the model uses `trust_remote_code` (check `config.json` for `auto_map`), inspect its custom Python files for imports not present in the container:

```bash
grep -h "^from \|^import " <model_path>/modeling_*.py | sort -u
```

**Known dependency patterns:**

| Import found | Packages to install |
| --- | --- |
| `from mamba_ssm` / `from causal_conv1d` | `mamba-ssm causal-conv1d` (Mamba/hybrid models: NemotronH, Jamba) |

If extra deps are needed:
- **Launcher (4B)**: set `EXTRA_PIP_DEPS` in the task's `environment` section — `ptq.sh` installs them automatically
- **Manual (4A)**: `unset PIP_CONSTRAINT && pip install <deps>` before running `hf_ptq.py`

## Step 3 — Choose quantization format

**First**, check for a model-specific recipe:
Expand Down Expand Up @@ -128,6 +146,7 @@ Validate the exported checkpoint's quantization pattern matches the recipe. Quan

## Common Pitfalls

- **Model-specific dependencies**: Models with `trust_remote_code` may import packages not in the container (e.g., `mamba-ssm` for hybrid Mamba models). See Step 2.5. Use `EXTRA_PIP_DEPS` env var with the launcher, or install manually before running `hf_ptq.py`
- **Transformers version**: New models may need a newer version of transformers than what's installed. Check `config.json` for `transformers_version`. In containers, beware of `PIP_CONSTRAINT` blocking upgrades — see `references/slurm-setup-ptq.md` for workarounds
- **Gated datasets**: Some calibration datasets require HF authentication. Ensure `HF_TOKEN` is set in the job environment, or use `--dataset cnn_dailymail` as a non-gated alternative
- **NFS root_squash + Docker**: See `skills/common/slurm-setup.md` section 5
Expand Down
6 changes: 3 additions & 3 deletions .claude/skills/ptq/references/launcher-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,13 @@ uv run launch.py --yaml <config.yaml> hf_local=<cache> --yes # Local Docker

## HF Transformers PTQ Config

The launcher provides `common/hf_ptq/hf_ptq.sh` which wraps `hf_ptq.py`. Configure via environment variables:
The launcher provides `common/hf/ptq.sh` which wraps `hf_ptq.py`. Configure via environment variables:

```yaml
job_name: <Model>_<Format>
pipeline:
task_0:
script: common/hf_ptq/hf_ptq.sh
script: common/hf/ptq.sh
environment:
- HF_MODEL: <HuggingFace model ID, e.g. Qwen/Qwen3-0.6B>
- QFORMAT: <format, e.g. nvfp4, fp8, int4_awq>
Expand Down Expand Up @@ -75,7 +75,7 @@ The launcher SSHes to `SLURM_HOST` via `nemo_run.SSHTunnel`. If `identity` is om
## Known Issues

- **UID mapping in Docker**: May cause `getpwuid` failures. Add `USER=user` and `LOGNAME=user` to environment.
- **Megatron-LM submodule**: Only needed for `MegatronLMQuantizeTask` (Megatron models). HF PTQ via `common/hf_ptq/hf_ptq.sh` does not require it.
- **Megatron-LM submodule**: Only needed for `MegatronLMQuantizeTask` (Megatron models). HF PTQ via `common/hf/ptq.sh` does not require it.

## Dry Run

Expand Down
8 changes: 8 additions & 0 deletions tools/launcher/common/hf/ptq.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,14 @@

set -e

# Install extra pip dependencies if specified (e.g., mamba-ssm for hybrid Mamba models).
if [ -n "$EXTRA_PIP_DEPS" ]; then
Comment thread
kaix-nv marked this conversation as resolved.
echo "Installing extra dependencies: $EXTRA_PIP_DEPS"
unset PIP_CONSTRAINT
read -r -a _deps <<< "$EXTRA_PIP_DEPS"
pip install "${_deps[@]}"
fi
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Comment thread
kaix-nv marked this conversation as resolved.

REPO=""
LOCAL_DIR=""
PTQ_ARGS=()
Expand Down
Loading