-
Notifications
You must be signed in to change notification settings - Fork 361
Add dep check for ptq and runtime check for evaluation/deployment #1240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -192,3 +192,128 @@ chmod -R g+rwX /path/to/.hf_cache/ | |
| ``` | ||
|
|
||
| Scope `chmod` to only the directories the job needs — avoid world-writable paths on shared clusters. | ||
|
|
||
| --- | ||
|
|
||
| ## 6. Container Registry Authentication | ||
|
|
||
| **Before submitting any SLURM job that pulls a container image**, check that the cluster has credentials for the image's registry. Missing auth causes jobs to fail after waiting in the queue — a costly mistake. | ||
|
|
||
| ### Step 1: Detect the container runtime | ||
|
|
||
| Different clusters use different container runtimes. Detect which is available: | ||
|
|
||
| ```bash | ||
| # On the cluster (or via ssh): | ||
| which enroot 2>/dev/null && echo "RUNTIME=enroot" | ||
| which docker 2>/dev/null && echo "RUNTIME=docker" | ||
| ``` | ||
|
|
||
| | Runtime | Typical clusters | SLURM integration | | ||
| | --- | --- | --- | | ||
| | **enroot/pyxis** | NVIDIA internal (DGX Cloud, EOS, Selene, GCP-NRT) | `srun --container-image` | | ||
| | **Docker** | Bare-metal / on-prem with GPU | `docker run` inside job script | | ||
|
|
||
| ### Step 2: Check credentials for the image's registry | ||
|
|
||
| Determine the registry from the image URI: | ||
|
|
||
| | Image pattern | Registry | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add SGLang docker: |
||
| | --- | --- | | ||
| | `nvcr.io/nvidia/...` | NGC | | ||
| | `vllm/vllm-openai:...`, `lmsysorg/sglang:...`, or no registry prefix | DockerHub | | ||
| | `ghcr.io/...` | GitHub Container Registry | | ||
| | `docker.io/...` | DockerHub (explicit) | | ||
|
|
||
| Then check credentials based on the runtime: | ||
|
|
||
| #### enroot/pyxis | ||
|
|
||
| ```bash | ||
| grep -E '^\s*machine\s+' ~/.config/enroot/.credentials 2>/dev/null | ||
| ``` | ||
|
|
||
| Look for `machine <registry>` lines: | ||
| - NGC → `machine nvcr.io` | ||
| - DockerHub → `machine auth.docker.io` | ||
| - GHCR → `machine ghcr.io` | ||
|
|
||
| #### Docker | ||
|
|
||
| ```bash | ||
| cat ~/.docker/config.json 2>/dev/null | python3 -c "import json,sys; print('\n'.join(json.load(sys.stdin).get('auths', {}).keys()))" | ||
| ``` | ||
|
|
||
| Look for registry keys (`https://index.docker.io/v1/`, `nvcr.io`, `ghcr.io`). | ||
|
|
||
| ### Step 3: If credentials are missing | ||
|
|
||
| **Do not submit the job.** Instead: | ||
|
|
||
| 1. Tell the user which registry and runtime need authentication | ||
| 2. Show the fix for their runtime: | ||
|
|
||
| **enroot/pyxis:** | ||
|
|
||
| ```bash | ||
| mkdir -p ~/.config/enroot | ||
|
|
||
| # DockerHub (get token from https://hub.docker.com/settings/security) | ||
| cat >> ~/.config/enroot/.credentials << 'EOF' | ||
| machine auth.docker.io | ||
| login <dockerhub_username> | ||
| password <access_token> | ||
| EOF | ||
|
|
||
| # NGC (get API key from https://org.ngc.nvidia.com/setup/api-keys) | ||
| cat >> ~/.config/enroot/.credentials << 'EOF' | ||
| machine nvcr.io | ||
| login $oauthtoken | ||
| password <ngc_api_key> | ||
| EOF | ||
| ``` | ||
|
|
||
| **Docker:** | ||
|
|
||
| ```bash | ||
| # DockerHub (interactive prompt) | ||
| docker login | ||
|
|
||
| # NGC (use --password-stdin to avoid exposing secrets in process list) | ||
| echo "$NGC_API_KEY" | docker login nvcr.io -u '$oauthtoken' --password-stdin | ||
| ``` | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
|
|
||
| 3. **Suggest an alternative image** on an authenticated registry. NVIDIA clusters typically have NGC auth pre-configured, so prefer NGC-hosted images: | ||
|
|
||
| | DockerHub image | NGC alternative | | ||
| | --- | --- | | ||
| | `vllm/vllm-openai:latest` | `nvcr.io/nvidia/vllm:<YY.MM>-py3` (check [NGC catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm) for latest tag) | | ||
| | `nvcr.io/nvidia/tensorrt-llm/release:<tag>` | Already NGC | | ||
|
|
||
| > **Note:** NGC image tags follow `YY.MM-py3` format (e.g., `26.03-py3`). Not all DockerHub images have NGC equivalents. If no NGC alternative exists and DockerHub auth is missing, the user must add DockerHub credentials or pre-cache the image as a `.sqsh` file. | ||
|
|
||
| 4. After the user fixes auth or switches images, verify the image is **actually pullable** before submitting (credentials alone don't guarantee the image exists): | ||
|
|
||
| ```bash | ||
| # enroot — test pull (aborts after manifest fetch) | ||
| enroot import --output /dev/null docker://<registry>#<image> 2>&1 | head -10 | ||
| # Success: shows "Fetching image manifest" + layer info | ||
| # Failure: shows "401 Unauthorized" or "404 Not Found" | ||
|
|
||
| # docker | ||
| docker manifest inspect <image> 2>&1 | head -5 | ||
|
|
||
| # singularity | ||
| singularity pull --dry-run docker://<image> 2>&1 | head -5 | ||
| ``` | ||
|
|
||
| > **Important**: Credentials existing for a registry does NOT mean a specific image is accessible. The image may not exist, or the credentials may lack permissions for that repository. Always verify the specific image before submitting. | ||
|
|
||
| ### Common failure modes | ||
|
|
||
| | Symptom | Runtime | Cause | Fix | | ||
| | --- | --- | --- | --- | | ||
| | `curl: (22) ... error: 401` | enroot | No credentials for registry | Add to `~/.config/enroot/.credentials` | | ||
| | `pyxis: failed to import docker image` | enroot | Auth failed or rate limit | Check credentials; DockerHub free: 100 pulls/6h per IP | | ||
| | `unauthorized: authentication required` | docker | No `docker login` | Run `docker login [registry]` | | ||
| | Image pulls on some nodes but not others | any | Cached on one node only | Pre-cache image or ensure auth on all nodes | | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also applies for local container setup, as I suggested in @Edwardf0t1 's PR, maybe we can put this in
evn-setup.mdfor setting HF_TOKEN, docker login token, and ngc token.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a structural reorganization and I’m fine with either dirs.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the issue again. We already have one
environment-setup.md, adding anotherenv-setup.mdseems to be a bit confusing. Besides, the auth section is specifically about container registry credentials for SLURM job submission, such as detecting runtimes, checking credentials per runtime, and fixing auth before sbatch. That fits SLURM setup, not general environment setup. So I prefer to keep them inslurm-setup.md. cc @Edwardf0t1