Skip to content

[1/N] Polish evaluation skills and common skills based on an E2E workflow testing#1239

Open
Edwardf0t1 wants to merge 8 commits intomainfrom
zhiyu/polish-eval-skills
Open

[1/N] Polish evaluation skills and common skills based on an E2E workflow testing#1239
Edwardf0t1 wants to merge 8 commits intomainfrom
zhiyu/polish-eval-skills

Conversation

@Edwardf0t1
Copy link
Copy Markdown
Contributor

@Edwardf0t1 Edwardf0t1 commented Apr 12, 2026

What does this PR do?

Type of change: Documentation

Polish evaluation and common skills based on end-to-end experience quantizing + deploying + evaluating Devstral-Small-2-24B-Instruct (NVFP4 MLP-only) on dlcluster B100 GPUs.

Depends on PR #1236 (creates deployment/references/unsupported-models.md referenced here). Merge #1236 first.

Changes

New files:

  • common/end-to-end-workflow.md — Cross-skill pipeline doc covering PTQ → Deploy → Eval workflow, workspace continuity, unsupported model handling across stages, NEL deployment.command pattern for runtime patches, and NEL CI vs SLURM executor decision table

  • evaluation/references/nel-ci-guide.md — Comprehensive NEL CI evaluation guide covering:

    • NEL SLURM executor vs NEL CI (JET) execution paths
    • Cluster reference table (oci-hsg, cw, oci-nrt, dlcluster) with GPUs, storage, walltime
    • Checkpoint storage availability per cluster type (not all use /lustre/)
    • NEL CI GitLab trigger pattern with NEL_DEPLOYMENT_COMMAND
    • Env var prefixes (host:, lit:) for SLURM executor vs JET secrets
    • Gated dataset handling (JET manages secrets centrally; SLURM needs user's own HF_TOKEN + dataset access)
    • SGLang --host 0.0.0.0 requirement
    • Common issues table (NGC auth, permissions, walltime, TP mismatch, etc.)
    • JET cluster directory setup with chmod 777 for svc-jet

Updated files:

  • common/remote-execution.md — Added "Checkpoint and storage availability" section (compute nodes may not share login node filesystems; dlcluster uses /home/omniml_data_*, not /lustre/)
  • common/slurm-setup.md — Added "Container registry credentials (pyxis)" section for NGC enroot auth setup
  • common/workspace-management.md — Added "Cross-Skill Workspace Flow" showing directory layout across PTQ/deploy/eval stages
  • evaluation/SKILL.md — Added "NEL CI and Cluster-Specific Notes" pointer section; renamed workspace section to include pipeline integration note
  • ptq/SKILL.md — Added "Next steps" after Step 5 pointing to deployment/evaluation skills + end-to-end workflow

Motivation

Learnings from:

  1. Quantizing Devstral-Small-2-24B (FP8 VLM → NVFP4 MLP-only) on dlcluster
  2. Deploying on vLLM with ministral3 architecture patch
  3. Evaluating with NEL SLURM executor (MMLU, GSM8K, GPQA) — including NGC auth, gated dataset access, env var prefixes
  4. Previous NEL CI experience on oci-hsg cluster (from internal doc)

Testing

Validated end-to-end: PTQ (6 min) → vLLM deployment (3 debug iterations) → NEL evaluation (MMLU 77.4%, GSM8K 80%, GPQA 40% on limit_samples=10).

Before your PR is "Ready for review"

  • Is this change backward compatible?: N/A (documentation only)
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: N/A (skill documentation)
  • Did you update Changelog?: N/A

Summary by CodeRabbit

  • Documentation
    • Added an end-to-end PTQ → Deploy → Eval workflow with shared workspace layout and artifact locations
    • Added checkpoint/storage availability guidance with cluster-specific paths and staging commands
    • Documented container registry authentication for SLURM and an observed unauthorized failure mode
    • Added runtime deployment-fix guidance (overrideable command/wrapper pattern) and SLURM vs CI decision table
    • Expanded CI/cluster notes, troubleshooting, gated dataset handling, and PTQ next-step pointers for deployment/evaluation

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 12, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 965a0781-79e4-454d-b5cc-1d278c6edab6

📥 Commits

Reviewing files that changed from the base of the PR and between 7dcede4 and 8176fc7.

📒 Files selected for processing (1)
  • .claude/skills/evaluation/references/nel-ci-guide.md
✅ Files skipped from review due to trivial changes (1)
  • .claude/skills/evaluation/references/nel-ci-guide.md

📝 Walkthrough

Walkthrough

Adds documentation tying PTQ → Deployment → Evaluation into a single pipeline: workspace conventions and artifact locations, cluster/CI execution guidance (SLURM vs NEL CI), registry/auth instructions, workspace persistence across stages, runtime deployment override patterns, and operational troubleshooting for JET and SLURM.

Changes

Cohort / File(s) Summary
End-to-end workflow
​.claude/skills/common/end-to-end-workflow.md
New doc defining the PTQ → Deployment → Evaluation pipeline, shared workspace layout (workspaces/<model>/), expected PTQ/deploy/eval artifact locations, unsupported-model failure modes, and deployment override mechanisms (deployment.command, NEL_DEPLOYMENT_COMMAND).
Cluster & remote execution
​.claude/skills/common/remote-execution.md, ​.claude/skills/common/slurm-setup.md, ​.claude/skills/evaluation/references/nel-ci-guide.md
Added checkpoint accessibility checks and cluster storage path mappings with copy examples; Pyxis/Enroot registry auth steps and noted srun 401 failure mode; comprehensive NEL CI guide covering SLURM vs GitLab flows, GPU/tensor-parallel sizing, env-var/secret formats, gated dataset behavior, checkpoint transfer commands, and JET-specific operational snippets.
Workspace & pipeline integration
​.claude/skills/common/workspace-management.md, ​.claude/skills/ptq/SKILL.md
Documented cross-skill workspace flow persisting PTQ outputs and adding eval_results/ + eval_config.yaml; updated example flow to reuse workspaces; PTQ SKILL now points users to deployment/evaluation next steps and notes carrying PTQ patches forward.
Evaluation guidance updates
​.claude/skills/evaluation/SKILL.md
Inserted guidance to carry deployment-time runtime patches into NEL config via deployment.command; added “NEL CI and Cluster-Specific Notes” checklist referencing nel-ci-guide.md and cluster/CI operational considerations.

Sequence Diagram(s)

sequenceDiagram
    participant User as User / Agent
    participant PTQ as PTQ Job
    participant WS as Workspace\n(workspaces/<model>)
    participant Deploy as Deployment Job\n(NEL/SLURM/CI)
    participant Serve as Model Server
    participant Eval as Evaluation Job
    participant Storage as Cluster Storage

    User->>PTQ: submit PTQ (produce quantized checkpoint -> WS/output/)
    PTQ->>WS: write checkpoint & artifacts
    User->>Deploy: trigger deployment (consumes WS/output/, may set deployment.command / NEL_DEPLOYMENT_COMMAND)
    Deploy->>Serve: stage model and start serving
    Serve->>Eval: run evaluation tasks (reads from Serve, writes WS/eval_results/)
    Eval->>WS: write eval artifacts and eval_config.yaml
    Eval->>Storage: persist final results
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly describes the main change: polishing evaluation and common skills based on end-to-end workflow testing, which aligns with the primary objective of documenting learnings from PTQ → Deploy → Eval pipeline validation.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns ✅ Passed PR contains only documentation changes to markdown files in .claude/skills/ directory; no Python code or configuration files modified.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch zhiyu/polish-eval-skills

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 12, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1239/

Built to branch gh-pages at 2026-04-13 05:37 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.91%. Comparing base (0357cb9) to head (8176fc7).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1239   +/-   ##
=======================================
  Coverage   76.91%   76.91%           
=======================================
  Files         350      350           
  Lines       40480    40480           
=======================================
  Hits        31137    31137           
  Misses       9343     9343           
Flag Coverage Δ
unit 55.51% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Add common/end-to-end-workflow.md documenting the PTQ → Deploy → Eval
  pipeline, workspace continuity, unsupported model handling, NEL
  deployment.command pattern, and NEL CI vs SLURM executor decision table
- Add cross-skill workspace flow to workspace-management.md
- Add "Next steps" to ptq/SKILL.md pointing to deployment/evaluation
- Add pipeline integration note to evaluation/SKILL.md

Depends on PR #1236 (deployment/references/unsupported-models.md).

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1 Edwardf0t1 marked this pull request as ready for review April 12, 2026 20:58
@Edwardf0t1 Edwardf0t1 self-assigned this Apr 12, 2026
@Edwardf0t1 Edwardf0t1 changed the title [1/N] Polish evaluation skills [1/N] Polish evaluation skills and common skills based on an E2E workflow testing Apr 12, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end documentation that connects the PTQ, deployment, and evaluation “skills”, with cluster-specific guidance for running NeMo Evaluator Launcher (NEL) on SLURM vs JET/NEL CI.

Changes:

  • Added a new PTQ → Deploy → Eval workflow doc and cross-skill workspace flow guidance.
  • Added a comprehensive NEL CI / cluster guide (JET vs SLURM executor, storage paths, env var formats, common issues).
  • Updated existing skill docs to point users to the new workflow and evaluation references.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
.claude/skills/ptq/SKILL.md Adds a “Next steps” pointer into the end-to-end pipeline doc.
.claude/skills/evaluation/SKILL.md Adds pipeline integration notes and a pointer to the new NEL CI guide.
.claude/skills/evaluation/references/nel-ci-guide.md New detailed guide for NEL CI/JET vs SLURM executor usage, clusters, storage, and troubleshooting.
.claude/skills/common/workspace-management.md Documents how a single workspace is reused across PTQ/deploy/eval stages.
.claude/skills/common/slurm-setup.md Adds enroot/pyxis private registry credential setup for nvcr.io.
.claude/skills/common/remote-execution.md Adds compute-node storage availability guidance to prevent “checkpoint not found” issues.
.claude/skills/common/end-to-end-workflow.md New end-to-end workflow doc tying the three skills together.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

| Stage | What can go wrong | Reference |
|-------|-------------------|-----------|
| **PTQ** | Unknown architecture, FP8 source checkpoint, VLM structure | `ptq/references/unsupported-models.md` |
| **Deployment** | Missing architecture mapping, weight key mismatches, quant/unquant layer confusion | `deployment/references/unsupported-models.md` |
Comment on lines +49 to +54
When the serving framework needs runtime patches (e.g., transformers upgrade, model handler fix), override `deployment.command` in the NEL config to inject fixes before serving:

```yaml
deployment:
command: >-
pip install "transformers>=5.0.0.dev0" --pre -q &&
```bash
export GITLAB_TOKEN=<your_gitlab_token>

curl -k --request POST \
Comment on lines +56 to +64
If the checkpoint is on a workstation, **copy it to cluster storage first**:

```bash
rsync -av /path/to/local/checkpoint \
<cluster-login>:/lustre/fsw/portfolios/coreai/users/$USER/checkpoints/
```

For dlcluster, the checkpoint paths are directly accessible since the NFS mounts are shared between login and compute nodes.

Comment on lines +185 to +193
mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface
mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm
mkdir -p /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs
chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface
chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/cache/vllm
chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs
```

`chmod 777` is required because `svc-jet` (JET service account) runs containers and needs write access.
Comment on lines +60 to +62
# To add NGC credentials:
mkdir -p ~/.config/enroot
echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials

- Binds to `0.0.0.0` by default — health checks work out of the box
- For NVFP4: `--quantization modelopt_fp4`
- For unsupported models (e.g., ministral3): may need custom `deployment.command` to patch the framework before serving (see `deployment/references/unsupported-models.md`)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
.claude/skills/common/slurm-setup.md (1)

62-62: Clarify that $oauthtoken is a literal string.

Users might misinterpret $oauthtoken as a shell variable that needs to be set. Consider adding a brief note that $oauthtoken is NGC's required literal login string (not a variable).

📝 Suggested clarification
 # To add NGC credentials:
 mkdir -p ~/.config/enroot
-echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials
+# Note: '$oauthtoken' is a literal string required by NGC (not a shell variable)
+echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>' > ~/.config/enroot/.credentials
 chmod 600 ~/.config/enroot/.credentials
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/common/slurm-setup.md at line 62, The line that writes
credentials uses the token string "$oauthtoken" which readers may think is a
shell variable; update the documentation line that currently shows echo 'machine
nvcr.io login $oauthtoken password <NGC_API_KEY>' >
~/.config/enroot/.credentials to explicitly state that $oauthtoken is the
literal NGC login string (not a shell variable) and add a short parenthetical
like "(use the literal string $oauthtoken as NGC requires, do not replace with a
shell variable)" or escape it in the example so readers understand it's not to
be substituted; reference the exact example command text "machine nvcr.io login
$oauthtoken password <NGC_API_KEY>" when making the clarification.
.claude/skills/evaluation/references/nel-ci-guide.md (1)

178-193: Consider noting the security implications of chmod 777.

While chmod 777 is necessary for the JET service account to access these directories (as explained), it makes them world-writable. Consider adding a brief note about the security trade-off, or suggesting users scope these permissions to only the directories actually needed for evaluation runs.

📝 Suggested addition
 chmod 777 /lustre/fsw/portfolios/coreai/users/$USER/nv-eval-rundirs
+
+# Note: chmod 777 is required for svc-jet service account access, but makes
+# directories world-writable. Limit to only the directories needed for your runs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/references/nel-ci-guide.md around lines 178 - 193,
Add a short note after the chmod 777 commands explaining the security trade-off
of making those paths world-writable and suggest safer alternatives: prefer
scoping permissions to the service account or group (e.g., chown svc-jet or set
group ownership and use chmod 770/770-like perms), or use filesystem ACLs to
grant only the svc-jet account write access to the listed directories
(/lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface, /.../cache/vllm,
/.../nv-eval-rundirs) instead of blanket chmod 777.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/common/end-to-end-workflow.md:
- Around line 47-59: The doc currently claims the deployment.command override
works "with both NEL SLURM executor and NEL CI (via NEL_DEPLOYMENT_COMMAND)";
remove or amend this to avoid asserting unsupported CI support: either delete
the "NEL CI (via NEL_DEPLOYMENT_COMMAND)" clause or add a clear note that using
NEL_DEPLOYMENT_COMMAND is a custom/unsupported extension (not part of official
NEL executors) and list only the officially supported executors (e.g., NEL
SLURM, local, Lepton AI); update the text around deployment.command and
NEL_DEPLOYMENT_COMMAND to reflect this change.

---

Nitpick comments:
In @.claude/skills/common/slurm-setup.md:
- Line 62: The line that writes credentials uses the token string "$oauthtoken"
which readers may think is a shell variable; update the documentation line that
currently shows echo 'machine nvcr.io login $oauthtoken password <NGC_API_KEY>'
> ~/.config/enroot/.credentials to explicitly state that $oauthtoken is the
literal NGC login string (not a shell variable) and add a short parenthetical
like "(use the literal string $oauthtoken as NGC requires, do not replace with a
shell variable)" or escape it in the example so readers understand it's not to
be substituted; reference the exact example command text "machine nvcr.io login
$oauthtoken password <NGC_API_KEY>" when making the clarification.

In @.claude/skills/evaluation/references/nel-ci-guide.md:
- Around line 178-193: Add a short note after the chmod 777 commands explaining
the security trade-off of making those paths world-writable and suggest safer
alternatives: prefer scoping permissions to the service account or group (e.g.,
chown svc-jet or set group ownership and use chmod 770/770-like perms), or use
filesystem ACLs to grant only the svc-jet account write access to the listed
directories (/lustre/fsw/portfolios/coreai/users/$USER/cache/huggingface,
/.../cache/vllm, /.../nv-eval-rundirs) instead of blanket chmod 777.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3c8b9ea9-5ab6-43f7-96a8-fb58f0a1804e

📥 Commits

Reviewing files that changed from the base of the PR and between 0357cb9 and 2cb3b39.

📒 Files selected for processing (7)
  • .claude/skills/common/end-to-end-workflow.md
  • .claude/skills/common/remote-execution.md
  • .claude/skills/common/slurm-setup.md
  • .claude/skills/common/workspace-management.md
  • .claude/skills/evaluation/SKILL.md
  • .claude/skills/evaluation/references/nel-ci-guide.md
  • .claude/skills/ptq/SKILL.md

Comment on lines +47 to +59
## NEL Evaluation with Custom Deployments

When the serving framework needs runtime patches (e.g., transformers upgrade, model handler fix), override `deployment.command` in the NEL config to inject fixes before serving:

```yaml
deployment:
command: >-
pip install "transformers>=5.0.0.dev0" --pre -q &&
sed -i 's/old_pattern/new_pattern/' /path/to/framework/file.py &&
${deployment.base_command}
```

This works with both NEL SLURM executor and NEL CI (via `NEL_DEPLOYMENT_COMMAND`).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Does NeMo Evaluator Launcher (NEL) support overriding deployment.command in both SLURM executor and GitLab CI modes?

💡 Result:

Yes, NeMo Evaluator Launcher (NEL) supports overriding deployment.command in SLURM executor mode. No evidence found for GitLab CI mode support; official docs only document local, Slurm, and Lepton AI executors.

Citations:


Remove or verify unsupported CI mode claim for deployment.command override.

Line 59 claims the deployment.command override pattern works with "both NEL SLURM executor and NEL CI (via NEL_DEPLOYMENT_COMMAND)". However, official NEL documentation only documents support for SLURM executor, local, and Lepton AI executors—not GitLab CI. Either remove the CI claim or add clear documentation that this is an extension beyond officially supported NEL features.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/common/end-to-end-workflow.md around lines 47 - 59, The doc
currently claims the deployment.command override works "with both NEL SLURM
executor and NEL CI (via NEL_DEPLOYMENT_COMMAND)"; remove or amend this to avoid
asserting unsupported CI support: either delete the "NEL CI (via
NEL_DEPLOYMENT_COMMAND)" clause or add a clear note that using
NEL_DEPLOYMENT_COMMAND is a custom/unsupported extension (not part of official
NEL executors) and list only the officially supported executors (e.g., NEL
SLURM, local, Lepton AI); update the text around deployment.command and
NEL_DEPLOYMENT_COMMAND to reflect this change.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…ra escaping

- Add wrapper script pattern for complex deployment commands that break
  Hydra's override parser (put serve.sh in checkpoint dir, reference as
  bash /checkpoint/serve.sh)
- Add NEL_CONFIG_BASE64 + Python trigger pattern for custom configs
- Add cross-cluster checkpoint copy via tar pipe
- Add Hydra LexerNoViableAltException and Bad Request to common issues

Learned from triggering full AA evaluation (MMLU-PRO, GPQA Diamond,
LiveCodeBench, SCICODE, AIME 2025, Terminal-Bench Hard) for
Devstral-Small-2-24B NVFP4 on oci-hsg via NEL CI.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
.claude/skills/evaluation/references/nel-ci-guide.md (1)

70-70: Prefer least-privilege wording for chmod 777 guidance

Given this is shared infra guidance, consider documenting 777 as a fallback-only option and prefer group ACL/owner-group write where possible to reduce accidental overexposure.

Also applies to: 128-128, 270-275

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/references/nel-ci-guide.md at line 70, Replace the
blanket "chmod -R 777 /lustre/.../checkpoints/model-name" guidance with
least-privilege instructions: set the directory owner/group to svc-jet (chown
svc-jet:svc-jet) and grant group write/execute (chmod g+rwX) or use POSIX ACLs
(setfacl) to grant only svc-jet the required access, and note that 777 should be
documented as a last-resort fallback; apply the same change to the other
occurrences referenced (lines 128 and 270-275) and ensure examples mention using
setfacl or group-owner changes rather than recommending 777.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/skills/evaluation/references/nel-ci-guide.md:
- Line 87: Remove the insecure TLS bypass from the curl examples by deleting the
"-k" flag in the command that currently appears as "curl -k --request POST" (and
the other occurrence) and either leave the example using standard TLS
verification or replace it with an explicit documented option to set a CA bundle
(e.g., mention "--cacert /path/to/corporate-ca.pem" or a pinned CA path) so the
examples do not send PRIVATE-TOKEN over disabled verification.
- Around line 133-135: The fenced code block containing the JSON snippet with
the key "NEL_DEPLOYMENT_COMMAND" should include a language tag to satisfy
markdownlint MD040; update the block fence to use ```json so the snippet
({"key": "NEL_DEPLOYMENT_COMMAND", "value": "bash /checkpoint/serve.sh"}) is
marked as JSON, ensuring the markdown linter recognizes the language.

---

Nitpick comments:
In @.claude/skills/evaluation/references/nel-ci-guide.md:
- Line 70: Replace the blanket "chmod -R 777 /lustre/.../checkpoints/model-name"
guidance with least-privilege instructions: set the directory owner/group to
svc-jet (chown svc-jet:svc-jet) and grant group write/execute (chmod g+rwX) or
use POSIX ACLs (setfacl) to grant only svc-jet the required access, and note
that 777 should be documented as a last-resort fallback; apply the same change
to the other occurrences referenced (lines 128 and 270-275) and ensure examples
mention using setfacl or group-owner changes rather than recommending 777.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2e6188dc-54d7-48bd-8e55-d7d6b0fc9ed8

📥 Commits

Reviewing files that changed from the base of the PR and between 1b94fc9 and b1be817.

📒 Files selected for processing (1)
  • .claude/skills/evaluation/references/nel-ci-guide.md

```bash
export GITLAB_TOKEN=<your_gitlab_token>

curl -k --request POST \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid curl -k in token-bearing API examples

Both trigger examples disable TLS verification while transmitting PRIVATE-TOKEN, which normalizes an insecure pattern. Please remove -k (or document a pinned CA/corporate CA path explicitly) in the primary examples.

Proposed doc fix
-curl -k --request POST \
+curl --request POST \
-    ["curl", "-k", "--request", "POST",
+    ["curl", "--request", "POST",

Also applies to: 166-166

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/evaluation/references/nel-ci-guide.md at line 87, Remove the
insecure TLS bypass from the curl examples by deleting the "-k" flag in the
command that currently appears as "curl -k --request POST" (and the other
occurrence) and either leave the example using standard TLS verification or
replace it with an explicit documented option to set a CA bundle (e.g., mention
"--cacert /path/to/corporate-ca.pem" or a pinned CA path) so the examples do not
send PRIVATE-TOKEN over disabled verification.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
When using NEL_DEPLOYMENT_COMMAND with a custom --served-model-name,
deployment.served_model_name must also be overridden via
NEL_OTHER_OVERRIDES — NEL uses the config field (not the actual serve
command) to set the eval client's model_id. Without this, the client
sends the checkpoint path as model_id, causing 404 errors.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants