test: add checkpoint robustness functional tests by adil-a · Pull Request #1606 · NVIDIA-NeMo/Automodel

adil-a · 2026-03-25T15:16:58Z

Summary

Comprehensive checkpoint robustness testing for all supported models. Tests the full lifecycle: load → SFT/PEFT (few steps) → save → reload → verify correctness.

Tracks #1586.

Note: vLLM deployment tests moved to separate PR (#1656).

Test Infrastructure

test_checkpoint_robustness_llm.py — 6-phase test harness:

Phase 1: Train for 5 steps, checkpoint
Phase 2: Capture reference logits
Phase 3: Reload from consolidated checkpoint, assert KL = 0 (exact match)
Phase 4: Load into vanilla HF AutoModelForCausalLM, assert KL < threshold
Phase 5 (optional): Cross-TP reload (save at TP=1, reload at TP=2)
Phase 6 (optional): Training resumption — baseline vs resumed loss continuity

test_checkpoint_robustness_biencoder.py — Biencoder variant using cosine similarity for embedding models (Embed-1B-v2).

CI Integration

Robustness tests run automatically after finetune in the same Slurm allocation. Configured via ci.checkpoint_robustness section in recipe YAMLs:

Common args (max_steps=5, dataset_limit=500, etc.) defined once in finetune_launcher.sh
Model-specific args (KL thresholds, TP overrides, tokenizer names) in each YAML's ci: section
28 recipe YAMLs configured (14 SFT + 14 PEFT)
20 new configs added to nightly_recipes.yml

Features

--hf_device_map_auto: Spread Phase 4 HF model across all GPUs for large models (49B+)
--resume_loss_threshold: Configurable resume loss comparison threshold
--tokenizer_name: Dynamic tokenization for non-Llama models
--max_vram_gb / --max_cpu_gb: Peak memory regression assertions
--check_fused_qkv_keys: Verify PEFT adapter has split q/k/v projections
--check_phantom_keys: Scan for leaked mxFP4 keys in consolidated checkpoints

Results

Passing Models (8 single-node + 3 multi-node)

#	Model	SFT	PEFT	TP	HF KL (SFT)	HF KL (PEFT)	Resume
1	Llama 3.2 3B	PASS	PASS	1	5e-3	5e-3	PASS
2	GPT-OSS 20B	PASS	PASS	1	5e-2	5e-2	Disabled (MoE)
3	Nemotron Nano V3	PASS	PASS	1	7e-2	1e-1	Disabled (MoE)
4	Gemma 3 270m	PASS	PASS	1	3.8e-3	7.5e-3	PASS
5	Phi-4	PASS	PASS	2	7.6e-4	6.4e-4	PASS (t=7e-3)
6	Qwen2.5 7B	PASS	PASS	2	5.9e-3	5.5e-2	PASS
7	Nemotron-Nano-8B-v1	PASS	PASS	2	4.2e-4	2.1e-3	Disabled (Mamba)
8	Qwen3-MoE 30B	PASS	FAIL	1	6.4e-5	—	—
14	Embed-1B-v2	PASS	N/A	1	cosine=1.0	N/A	PASS (t=2e-2)
15	Super-120B	PASS	FAIL*	EP=32 (4N)	PASS	KL=8.5e-2	Disabled (MoE)
13	Super-49B	Phase 1-3	FAIL*	TP=4 (2N)	KL=10.6	KL=10.5	—

*Phase 4 failures due to combined QKV projection keys in consolidated checkpoints — vanilla HF can't load them. Phases 1-3 (training + automodel reload) all pass.

Known Issues

Combined QKV Phase 4: Super-49B/120B PEFT produce combined projection keys that vanilla HF can't load. StateDictAdapter needed.
MoE resume non-determinism: DeepEP expert routing causes loss diff. --check_resume disabled for MoE.
Qwen3-MoE PEFT bug: Phase 3 KL=0.84 — real checkpoint reload bug in Qwen3MoeStateDictAdapter.
5 failing models: Flash 1B (triton_attention.py missing), Nano V2 (FSDP wrap), Baichuan (meta tensor), Mistral3 (FP8 scalars).

Test plan

Validate all passing models (SFT + PEFT)
Multi-node: Super-120B SFT (4 nodes), Super-49B (2 nodes), Embed-1B-v2
CI integration via ci.checkpoint_robustness in recipe YAMLs
Add --hf_device_map_auto for large model Phase 4
Add --resume_loss_threshold flag
Add 20 missing models to nightly_recipes.yml
Remove vLLM tests to separate PR (test: add vLLM deployment tests for checkpoint robustness #1656)
Investigate 5 failing models (transformers 5.3 compat, FP8)
Fix combined QKV Phase 4 for Super-49B/120B

🤖 Generated with Claude Code

Add end-to-end checkpoint robustness tests that verify checkpoint save/load round-trips produce bitwise-identical logits. Tests cover both SFT and PEFT workflows: - Phase 1: Train for N steps and save checkpoint - Phase 2: Capture reference logits - Phase 3: Reload automodel from consolidated checkpoint (SFT) or auto-resume from checkpoint dir (PEFT), assert zero KL divergence - Phase 4: Load into vanilla HF, assert KL within relaxed threshold (accounts for kernel/attention implementation differences) Also adds a vLLM deployment smoke test that verifies greedy decoding matches between HF and vLLM for consolidated checkpoints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

copy-pr-bot · 2026-03-25T15:17:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Add Phase 5 that reloads consolidated checkpoint with a different TP size (e.g., train at TP=1, reload at TP=2). Exercises FSDP2 DTensor resharding and QKV interleaving under different sharding layouts. Opt-in via --cross_tp_size <int> with separate --cross_tp_kl_threshold (default 5e-3) since TP resharding introduces forward pass numerical differences similar to the HF comparison. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

- Add GPT-OSS 20B SFT and PEFT checkpoint robustness shell scripts with hf_kl_threshold=5e-2 (higher for MoE due to expert routing numerical divergence from RoPE precision and attention kernel diffs) - Add vLLM PEFT support via native LoRA (enable_lora + LoRARequest) - Add --vllm_smoke_test mode for models where model_impl="transformers" is unavailable (e.g., MoE with transformers<5.0): loads model into vLLM native backend and verifies non-empty output without HF comparison - Add vLLM step to Llama PEFT shell script - Handle models returning raw tensors instead of CausalLMOutput in _get_logits Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Add LoRA support to vLLM smoke test path (enable_lora + LoRARequest). Fix GPT-OSS model name to openai/gpt-oss-20b in PEFT script and add vLLM deployment step. Update hf_kl_threshold to 5e-2 for MoE. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Update all checkpoint robustness shell scripts to use 8 GPUs (CUDA_VISIBLE_DEVICES=0-7, nproc_per_node=8). Add cross-TP test (--cross_tp_size 2) to Llama SFT script. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Combine separate SFT/PEFT scripts into one per model. Add dedicated vLLM deployment scripts that reuse checkpoints from robustness runs. Shell scripts: - L2_Checkpoint_Robustness_Llama3_2_3B.sh (SFT + cross-TP + PEFT) - L2_Checkpoint_Robustness_GPT_OSS_20B.sh (SFT + PEFT, ep_size=8) - L2_vLLM_Deploy_Llama3_2_3B.sh (SFT greedy + PEFT LoRA) - L2_vLLM_Deploy_GPT_OSS_20B.sh (SFT smoke + PEFT LoRA smoke) All scripts use 8 GPUs, hardcoded /adasif/checkpoints/ paths, and LATEST symlink for step dir resolution. vLLM scripts must run in an environment with vllm installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Add SFT and PEFT checkpoint robustness tests for Nemotron Nano V3 (hybrid Mamba2+Attention+MoE, 30B/3B-active). Uses experts_implementation= grouped_mm for HF comparison to match automodel's batched GEMM backend, reducing KL divergence from bf16 numerical noise. Also fixes transformers >= 5.2 compatibility where check_model_inputs was split into merge_with_config_defaults + capture_outputs but the deprecated import still exists with a different signature. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

- Dynamic tokenizer: --tokenizer_name flag for non-Llama models - Memory tracking: --max_vram_gb / --max_cpu_gb with peak VRAM and RSS assertions - Phantom key check: --check_phantom_keys scans consolidated safetensors for leaked _blocks/_scales keys (GPT-OSS mxFP4) - Fused QKV check: --check_fused_qkv_keys verifies PEFT adapter has split q/k/v projections - Resume loss continuity: --check_resume trains baseline + resumed run, compares per-step losses (disabled for MoE due to DeepEP non-determinism) - vLLM token comparison: assert length equality before content comparison - Audit fixes: no vacuous passes for phantom keys, resume, or vLLM checks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New models: Nemotron Flash 1B, Gemma 3 270m, Phi-4, Nemotron Nano V2 9B, Baichuan 2 7B, Qwen2.5 7B, Qwen3-MoE 30B, Nemotron Super 120B, Llama-3.3-Super-49B, Mistral3 3B, Nemotron-Nano-8B-v1, llama-nemotron-embed-1b-v2 - 12 robustness shell scripts (SFT + PEFT per model) - 11 vLLM deploy shell scripts (no vLLM for biencoder) - 5 new YAML configs (Mistral3 SFT/PEFT, Nano-8B-v1 SFT/PEFT, Qwen3-MoE SFT) - Biencoder test (test_checkpoint_robustness_biencoder.py) with cosine similarity - 12 test methods in TestCheckpointRobustness + 11 in TestVLLMDeploy Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add --dataset.limit_dataset_samples 500 / --dataset.num_samples_limit 500 to all robustness scripts (squad and hellaswag respectively) to cut dataset mapping time from ~60s to ~1s per run - Add --max_vram_gb / --max_cpu_gb thresholds to Gemma 3 and Phi-4 based on observed peak usage (~1.2x headroom) - Fix Gemma 3 to TP=1 (1 KV head not divisible by TP=2) - Fix Phi-4 to TP=1 (DTensor redistribution assertion with TP=2) - Tighten HF KL thresholds based on observed values: Gemma 3 SFT: 6e-3, PEFT: 8e-3 Phi-4 SFT: 1.2e-3, PEFT: 1e-3 - Register dataset.num_samples_limit in conftest.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ustness-test

- Qwen2.5 7B: tighten SFT KL to 9e-3, PEFT to 8e-2, add cross-TP, memory limits - Qwen3-MoE 30B: tighten SFT KL to 1e-4, add memory limits - Nemotron-Nano-8B-v1: tighten SFT KL to 7e-4, add cross-TP, disable resume (Mamba hybrid non-determinism) - Baichuan/Mistral3: add cross-TP to SFT step - Add __main__ block to test_checkpoint_robustness_llm.py for direct execution Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Super-49B confirmed multi-node only (OOM on 8 GPUs with TP=4 PP=2). Updated all model results including vLLM pass/fail status. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Observed peak memory (1.2x headroom applied): - Llama 3.2 3B: SFT 3.91→5 GB VRAM, PEFT 3.89→5 GB VRAM - GPT-OSS 20B: SFT 19.24→24 GB VRAM, PEFT 9.49→12 GB VRAM - Nemotron Nano V3: SFT 29.02→35 GB VRAM, PEFT 12.47→15 GB VRAM Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ustness-test

…lerance Allows MoE and Mamba hybrid models to use a looser threshold for training resumption loss continuity checks (default: 5e-3 for dense SFT). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phi-4 DTensor bug at TP=2 fixed on main. Both SFT and PEFT pass. Added configurable --resume_loss_threshold CLI arg (default 5e-3). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Enables device_map="auto" in Phase 4 to spread vanilla HF model across all GPUs on rank 0's node. Required for 49B+ models that don't fit on 1 GPU (98GB at bf16 > 80GB H100). Validated under torchrun on 8 GPUs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Results: - Super-120B SFT: PASS (4 nodes, EP=32, device_map=auto for Phase 4) - Super-49B SFT: Phase 1-3 PASS (2 nodes, TP=4), Phase 4 FAIL (combined QKV keys) - Super-49B/120B PEFT: Phase 1-3 PASS, Phase 4 FAIL (combined QKV in adapter) - Embed-1B-v2: PASS (cosine=1.0, resume with t=2e-2) Changes: - Add --hf_device_map_auto flag for Phase 4 large model HF loading - Fix biencoder import (recipes.biencoder -> recipes.retrieval) - Fix biencoder tokenizer compatibility (NeMoAutoTokenizer + return_tensors) - Add --resume_loss_threshold to biencoder test - Register new CLI flags in conftest.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

vLLM deployment verification tests that load consolidated checkpoints and compare greedy output token-for-token against HuggingFace. Supports both full comparison and smoke test mode. Depends on checkpoint robustness PR #1606. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Remove vLLM deploy test module, 14 shell scripts, and TestVLLMDeploy runner class. Remove vLLM-specific conftest entries and STATUS.md sections. vLLM tests will land in a follow-up PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Add ci.checkpoint_robustness section to 28 recipe YAMLs with model-specific test args (KL thresholds, TP overrides, tokenizer names). Common args (max_steps=5, dataset_limit=500, etc.) handled in launcher. Append robustness test block to finetune_launcher.sh that runs after finetune completes, gated by presence of ci.checkpoint_robustness. Add 20 missing model configs to nightly_recipes.yml. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

These scripts are superseded by ci.checkpoint_robustness sections in recipe YAMLs. Kept locally for manual debugging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

TestCheckpointRobustness class called the removed .sh scripts via run_test_script(). No longer needed — CI runs robustness tests directly from finetune_launcher.sh. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

Move YAML config parsing for model-specific robustness args from finetune_launcher.sh into test_checkpoint_robustness_llm.py. The launcher now only detects if ci.checkpoint_robustness exists and passes common args. The test script reads model-specific values (KL thresholds, TP overrides, tokenizer names, etc.) directly from the YAML's ci.checkpoint_robustness section, with CLI args taking precedence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: adil-a <adil.asif2000@hotmail.com>

adil-a · 2026-04-02T19:35:21Z

/ok to test d77ea17

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

thomasdhc · 2026-04-06T19:52:42Z

/ok to test 321ac05

thomasdhc · 2026-04-06T22:59:50Z

/ok to test 3bec651

adil-a linked an issue Mar 25, 2026 that may be closed by this pull request

E2E Checkpointing Robustness #1586

Open

85 tasks

adil-a and others added 19 commits March 25, 2026 09:11

Merge remote-tracking branch 'origin/main' into adil-a/checkpoint-rob…

5928505

…ustness-test

docs: update checkpoint robustness STATUS with all test results

2f1a5a9

Super-49B confirmed multi-node only (OOM on 8 GPUs with TP=4 PP=2). Updated all model results including vLLM pass/fail status. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: comprehensive STATUS update with all results and TODOs

e14e246

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into adil-a/checkpoint-rob…

c7fa2a9

…ustness-test

adil-a mentioned this pull request Apr 2, 2026

test: add vLLM deployment tests for checkpoint robustness #1656

Merged

2 tasks

adil-a and others added 4 commits April 2, 2026 18:15

copy-pr-bot bot had a problem deploying to nemo-ci April 6, 2026 17:36 Failure

Add recipe: TrainFinetuneRecipeForNextTokenPrediction for new recipes

321ac05

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>

thomasdhc dismissed their stale review via 321ac05 April 6, 2026 19:52

copy-pr-bot bot temporarily deployed to test April 6, 2026 19:53 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 6, 2026 19:53 Inactive

thomasdhc previously approved these changes Apr 6, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to nemo-ci April 6, 2026 19:53 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 6, 2026 20:14 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 6, 2026 20:36 Inactive

thomasdhc enabled auto-merge (squash) April 6, 2026 21:00

Merge branch 'main' into adil-a/checkpoint-robustness-test

3bec651

akoumpa approved these changes Apr 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add checkpoint robustness functional tests#1606

test: add checkpoint robustness functional tests#1606
thomasdhc merged 43 commits intomainfrom
adil-a/checkpoint-robustness-test

adil-a commented Mar 25, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 25, 2026

Uh oh!

adil-a commented Apr 2, 2026

Uh oh!

thomasdhc commented Apr 6, 2026

Uh oh!

thomasdhc commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

adil-a commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Infrastructure

CI Integration

Features

Results

Passing Models (8 single-node + 3 multi-node)

Known Issues

Test plan

Uh oh!

copy-pr-bot bot commented Mar 25, 2026

Uh oh!

adil-a commented Apr 2, 2026

Uh oh!

thomasdhc commented Apr 6, 2026

Uh oh!

thomasdhc commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adil-a commented Mar 25, 2026 •

edited

Loading