|
| 1 | +# End-to-End Workflow: PTQ → Deploy → Eval |
| 2 | + |
| 3 | +This document ties together the three domain skills (PTQ, Deployment, Evaluation) for the common workflow of quantizing a model, deploying it, and evaluating accuracy. |
| 4 | + |
| 5 | +## Pipeline Overview |
| 6 | + |
| 7 | +```text |
| 8 | +PTQ (quantize) → Deployment (serve) → Evaluation (benchmark) |
| 9 | +───────────────── ────────────────── ──────────────────────── |
| 10 | +hf_ptq.py vLLM / SGLang / TRT-LLM NEL (SLURM or JET) |
| 11 | + ↓ ↓ ↓ |
| 12 | +NVFP4/FP8 checkpoint OpenAI-compatible API MMLU, GSM8K, GPQA scores |
| 13 | + (safetensors) (http://host:8000) (results.yml) |
| 14 | +``` |
| 15 | + |
| 16 | +## Workspace Continuity |
| 17 | + |
| 18 | +All three stages share the same workspace directory. The PTQ output becomes the deployment input, and eval results land alongside: |
| 19 | + |
| 20 | +```text |
| 21 | +workspaces/model-name-format/ |
| 22 | + output/ ← PTQ checkpoint (safetensors + config.json) |
| 23 | + eval_results/ ← NEL evaluation artifacts (results.yml per task) |
| 24 | + eval_config.yaml ← NEL config for evaluation |
| 25 | + scripts/ ← Custom run scripts (if needed) |
| 26 | + logs/ ← SLURM job logs |
| 27 | +``` |
| 28 | + |
| 29 | +When starting a deployment or evaluation step, always check for an existing workspace from a prior PTQ run: |
| 30 | + |
| 31 | +```bash |
| 32 | +ls workspaces/ |
| 33 | +``` |
| 34 | + |
| 35 | +## Unsupported Models |
| 36 | + |
| 37 | +Models not in the verified support matrices require extra work at each stage: |
| 38 | + |
| 39 | +| Stage | What can go wrong | Reference | |
| 40 | +|-------|-------------------|-----------| |
| 41 | +| **PTQ** | Unknown architecture, FP8 source checkpoint, VLM structure | `ptq/references/unsupported-models.md` | |
| 42 | +| **Deployment** | Missing architecture mapping, weight key mismatches, quant/unquant layer confusion | `deployment/references/unsupported-models.md` | |
| 43 | +| **Evaluation** | Framework patches needed in deployment container, gated datasets, cluster storage | `evaluation/references/nel-ci-guide.md` | |
| 44 | + |
| 45 | +Each stage has its own debug loop (run → read error → diagnose → patch → re-run). Fixes from one stage often inform the next — e.g., if PTQ required a transformers upgrade, deployment and evaluation will too. |
| 46 | + |
| 47 | +## NEL Evaluation with Custom Deployments |
| 48 | + |
| 49 | +When the serving framework needs runtime patches (e.g., transformers upgrade, model handler fix), override `deployment.command` in the NEL config to inject fixes before serving: |
| 50 | + |
| 51 | +```yaml |
| 52 | +deployment: |
| 53 | + command: >- |
| 54 | + pip install "transformers>=5.0.0.dev0" --pre -q && |
| 55 | + sed -i 's/old_pattern/new_pattern/' /path/to/framework/file.py && |
| 56 | + ${deployment.base_command} |
| 57 | +``` |
| 58 | +
|
| 59 | +This works with both NEL SLURM executor and NEL CI (via `NEL_DEPLOYMENT_COMMAND`). |
| 60 | + |
| 61 | +## Decision: NEL SLURM Executor vs NEL CI (JET) |
| 62 | + |
| 63 | +| Factor | NEL SLURM executor | NEL CI (JET) | |
| 64 | +|--------|-------------------|--------------| |
| 65 | +| **When to use** | Iterative debugging, checkpoint on non-JET cluster, custom patches needed | Production evals, MLflow tracking, reproducible configs | |
| 66 | +| **Checkpoint location** | Any cluster you have SSH access to | Must be on JET cluster `/lustre/` storage | |
| 67 | +| **Secrets (HF_TOKEN, NGC)** | Provide your own via `host:` env vars | Managed centrally via JET secrets | |
| 68 | +| **Container patches** | Override `deployment.command` | Use `NEL_DEPLOYMENT_COMMAND` | |
| 69 | +| **MLflow export** | Manual setup | Automatic | |
| 70 | +| **Gated datasets** | Your HF account needs access | Handled by `COMPEVAL_HF_TOKEN` | |
0 commit comments