Skip to content

test: add checkpoint robustness functional tests#1606

Merged
thomasdhc merged 43 commits intomainfrom
adil-a/checkpoint-robustness-test
Apr 6, 2026
Merged

test: add checkpoint robustness functional tests#1606
thomasdhc merged 43 commits intomainfrom
adil-a/checkpoint-robustness-test

Conversation

@adil-a
Copy link
Copy Markdown
Collaborator

@adil-a adil-a commented Mar 25, 2026

Summary

Comprehensive checkpoint robustness testing for all supported models. Tests the full lifecycle: load → SFT/PEFT (few steps) → save → reload → verify correctness.

Tracks #1586.

Note: vLLM deployment tests moved to separate PR (#1656).

Test Infrastructure

test_checkpoint_robustness_llm.py — 6-phase test harness:

  • Phase 1: Train for 5 steps, checkpoint
  • Phase 2: Capture reference logits
  • Phase 3: Reload from consolidated checkpoint, assert KL = 0 (exact match)
  • Phase 4: Load into vanilla HF AutoModelForCausalLM, assert KL < threshold
  • Phase 5 (optional): Cross-TP reload (save at TP=1, reload at TP=2)
  • Phase 6 (optional): Training resumption — baseline vs resumed loss continuity

test_checkpoint_robustness_biencoder.py — Biencoder variant using cosine similarity for embedding models (Embed-1B-v2).

CI Integration

Robustness tests run automatically after finetune in the same Slurm allocation. Configured via ci.checkpoint_robustness section in recipe YAMLs:

  • Common args (max_steps=5, dataset_limit=500, etc.) defined once in finetune_launcher.sh
  • Model-specific args (KL thresholds, TP overrides, tokenizer names) in each YAML's ci: section
  • 28 recipe YAMLs configured (14 SFT + 14 PEFT)
  • 20 new configs added to nightly_recipes.yml

Features

  • --hf_device_map_auto: Spread Phase 4 HF model across all GPUs for large models (49B+)
  • --resume_loss_threshold: Configurable resume loss comparison threshold
  • --tokenizer_name: Dynamic tokenization for non-Llama models
  • --max_vram_gb / --max_cpu_gb: Peak memory regression assertions
  • --check_fused_qkv_keys: Verify PEFT adapter has split q/k/v projections
  • --check_phantom_keys: Scan for leaked mxFP4 keys in consolidated checkpoints

Results

Passing Models (8 single-node + 3 multi-node)

# Model SFT PEFT TP HF KL (SFT) HF KL (PEFT) Resume
1 Llama 3.2 3B PASS PASS 1 5e-3 5e-3 PASS
2 GPT-OSS 20B PASS PASS 1 5e-2 5e-2 Disabled (MoE)
3 Nemotron Nano V3 PASS PASS 1 7e-2 1e-1 Disabled (MoE)
4 Gemma 3 270m PASS PASS 1 3.8e-3 7.5e-3 PASS
5 Phi-4 PASS PASS 2 7.6e-4 6.4e-4 PASS (t=7e-3)
6 Qwen2.5 7B PASS PASS 2 5.9e-3 5.5e-2 PASS
7 Nemotron-Nano-8B-v1 PASS PASS 2 4.2e-4 2.1e-3 Disabled (Mamba)
8 Qwen3-MoE 30B PASS FAIL 1 6.4e-5
14 Embed-1B-v2 PASS N/A 1 cosine=1.0 N/A PASS (t=2e-2)
15 Super-120B PASS FAIL* EP=32 (4N) PASS KL=8.5e-2 Disabled (MoE)
13 Super-49B Phase 1-3 FAIL* TP=4 (2N) KL=10.6 KL=10.5

*Phase 4 failures due to combined QKV projection keys in consolidated checkpoints — vanilla HF can't load them. Phases 1-3 (training + automodel reload) all pass.

Known Issues

  • Combined QKV Phase 4: Super-49B/120B PEFT produce combined projection keys that vanilla HF can't load. StateDictAdapter needed.
  • MoE resume non-determinism: DeepEP expert routing causes loss diff. --check_resume disabled for MoE.
  • Qwen3-MoE PEFT bug: Phase 3 KL=0.84 — real checkpoint reload bug in Qwen3MoeStateDictAdapter.
  • 5 failing models: Flash 1B (triton_attention.py missing), Nano V2 (FSDP wrap), Baichuan (meta tensor), Mistral3 (FP8 scalars).

Test plan

  • Validate all passing models (SFT + PEFT)
  • Multi-node: Super-120B SFT (4 nodes), Super-49B (2 nodes), Embed-1B-v2
  • CI integration via ci.checkpoint_robustness in recipe YAMLs
  • Add --hf_device_map_auto for large model Phase 4
  • Add --resume_loss_threshold flag
  • Add 20 missing models to nightly_recipes.yml
  • Remove vLLM tests to separate PR (test: add vLLM deployment tests for checkpoint robustness #1656)
  • Investigate 5 failing models (transformers 5.3 compat, FP8)
  • Fix combined QKV Phase 4 for Super-49B/120B

🤖 Generated with Claude Code

Add end-to-end checkpoint robustness tests that verify checkpoint
save/load round-trips produce bitwise-identical logits. Tests cover
both SFT and PEFT workflows:

- Phase 1: Train for N steps and save checkpoint
- Phase 2: Capture reference logits
- Phase 3: Reload automodel from consolidated checkpoint (SFT) or
  auto-resume from checkpoint dir (PEFT), assert zero KL divergence
- Phase 4: Load into vanilla HF, assert KL within relaxed threshold
  (accounts for kernel/attention implementation differences)

Also adds a vLLM deployment smoke test that verifies greedy decoding
matches between HF and vLLM for consolidated checkpoints.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@adil-a adil-a linked an issue Mar 25, 2026 that may be closed by this pull request
85 tasks
adil-a and others added 19 commits March 25, 2026 09:11
Add Phase 5 that reloads consolidated checkpoint with a different TP
size (e.g., train at TP=1, reload at TP=2). Exercises FSDP2 DTensor
resharding and QKV interleaving under different sharding layouts.

Opt-in via --cross_tp_size <int> with separate --cross_tp_kl_threshold
(default 5e-3) since TP resharding introduces forward pass numerical
differences similar to the HF comparison.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
- Add GPT-OSS 20B SFT and PEFT checkpoint robustness shell scripts
  with hf_kl_threshold=5e-2 (higher for MoE due to expert routing
  numerical divergence from RoPE precision and attention kernel diffs)
- Add vLLM PEFT support via native LoRA (enable_lora + LoRARequest)
- Add --vllm_smoke_test mode for models where model_impl="transformers"
  is unavailable (e.g., MoE with transformers<5.0): loads model into
  vLLM native backend and verifies non-empty output without HF comparison
- Add vLLM step to Llama PEFT shell script
- Handle models returning raw tensors instead of CausalLMOutput in
  _get_logits

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Add LoRA support to vLLM smoke test path (enable_lora + LoRARequest).
Fix GPT-OSS model name to openai/gpt-oss-20b in PEFT script and add
vLLM deployment step. Update hf_kl_threshold to 5e-2 for MoE.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Update all checkpoint robustness shell scripts to use 8 GPUs
(CUDA_VISIBLE_DEVICES=0-7, nproc_per_node=8). Add cross-TP test
(--cross_tp_size 2) to Llama SFT script.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Combine separate SFT/PEFT scripts into one per model. Add dedicated
vLLM deployment scripts that reuse checkpoints from robustness runs.

Shell scripts:
- L2_Checkpoint_Robustness_Llama3_2_3B.sh (SFT + cross-TP + PEFT)
- L2_Checkpoint_Robustness_GPT_OSS_20B.sh (SFT + PEFT, ep_size=8)
- L2_vLLM_Deploy_Llama3_2_3B.sh (SFT greedy + PEFT LoRA)
- L2_vLLM_Deploy_GPT_OSS_20B.sh (SFT smoke + PEFT LoRA smoke)

All scripts use 8 GPUs, hardcoded /adasif/checkpoints/ paths, and
LATEST symlink for step dir resolution. vLLM scripts must run in
an environment with vllm installed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Add SFT and PEFT checkpoint robustness tests for Nemotron Nano V3
(hybrid Mamba2+Attention+MoE, 30B/3B-active). Uses experts_implementation=
grouped_mm for HF comparison to match automodel's batched GEMM backend,
reducing KL divergence from bf16 numerical noise.

Also fixes transformers >= 5.2 compatibility where check_model_inputs
was split into merge_with_config_defaults + capture_outputs but the
deprecated import still exists with a different signature.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
- Dynamic tokenizer: --tokenizer_name flag for non-Llama models
- Memory tracking: --max_vram_gb / --max_cpu_gb with peak VRAM and RSS assertions
- Phantom key check: --check_phantom_keys scans consolidated safetensors for leaked _blocks/_scales keys (GPT-OSS mxFP4)
- Fused QKV check: --check_fused_qkv_keys verifies PEFT adapter has split q/k/v projections
- Resume loss continuity: --check_resume trains baseline + resumed run, compares per-step losses (disabled for MoE due to DeepEP non-determinism)
- vLLM token comparison: assert length equality before content comparison
- Audit fixes: no vacuous passes for phantom keys, resume, or vLLM checks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New models: Nemotron Flash 1B, Gemma 3 270m, Phi-4, Nemotron Nano V2 9B,
Baichuan 2 7B, Qwen2.5 7B, Qwen3-MoE 30B, Nemotron Super 120B,
Llama-3.3-Super-49B, Mistral3 3B, Nemotron-Nano-8B-v1, llama-nemotron-embed-1b-v2

- 12 robustness shell scripts (SFT + PEFT per model)
- 11 vLLM deploy shell scripts (no vLLM for biencoder)
- 5 new YAML configs (Mistral3 SFT/PEFT, Nano-8B-v1 SFT/PEFT, Qwen3-MoE SFT)
- Biencoder test (test_checkpoint_robustness_biencoder.py) with cosine similarity
- 12 test methods in TestCheckpointRobustness + 11 in TestVLLMDeploy

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --dataset.limit_dataset_samples 500 / --dataset.num_samples_limit 500
  to all robustness scripts (squad and hellaswag respectively) to cut
  dataset mapping time from ~60s to ~1s per run
- Add --max_vram_gb / --max_cpu_gb thresholds to Gemma 3 and Phi-4
  based on observed peak usage (~1.2x headroom)
- Fix Gemma 3 to TP=1 (1 KV head not divisible by TP=2)
- Fix Phi-4 to TP=1 (DTensor redistribution assertion with TP=2)
- Tighten HF KL thresholds based on observed values:
  Gemma 3 SFT: 6e-3, PEFT: 8e-3
  Phi-4 SFT: 1.2e-3, PEFT: 1e-3
- Register dataset.num_samples_limit in conftest.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Qwen2.5 7B: tighten SFT KL to 9e-3, PEFT to 8e-2, add cross-TP, memory limits
- Qwen3-MoE 30B: tighten SFT KL to 1e-4, add memory limits
- Nemotron-Nano-8B-v1: tighten SFT KL to 7e-4, add cross-TP, disable resume (Mamba hybrid non-determinism)
- Baichuan/Mistral3: add cross-TP to SFT step
- Add __main__ block to test_checkpoint_robustness_llm.py for direct execution

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Super-49B confirmed multi-node only (OOM on 8 GPUs with TP=4 PP=2).
Updated all model results including vLLM pass/fail status.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Observed peak memory (1.2x headroom applied):
- Llama 3.2 3B: SFT 3.91→5 GB VRAM, PEFT 3.89→5 GB VRAM
- GPT-OSS 20B: SFT 19.24→24 GB VRAM, PEFT 9.49→12 GB VRAM
- Nemotron Nano V3: SFT 29.02→35 GB VRAM, PEFT 12.47→15 GB VRAM

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lerance

Allows MoE and Mamba hybrid models to use a looser threshold for
training resumption loss continuity checks (default: 5e-3 for dense SFT).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phi-4 DTensor bug at TP=2 fixed on main. Both SFT and PEFT pass.
Added configurable --resume_loss_threshold CLI arg (default 5e-3).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Enables device_map="auto" in Phase 4 to spread vanilla HF model across
all GPUs on rank 0's node. Required for 49B+ models that don't fit on
1 GPU (98GB at bf16 > 80GB H100). Validated under torchrun on 8 GPUs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Results:
- Super-120B SFT: PASS (4 nodes, EP=32, device_map=auto for Phase 4)
- Super-49B SFT: Phase 1-3 PASS (2 nodes, TP=4), Phase 4 FAIL (combined QKV keys)
- Super-49B/120B PEFT: Phase 1-3 PASS, Phase 4 FAIL (combined QKV in adapter)
- Embed-1B-v2: PASS (cosine=1.0, resume with t=2e-2)

Changes:
- Add --hf_device_map_auto flag for Phase 4 large model HF loading
- Fix biencoder import (recipes.biencoder -> recipes.retrieval)
- Fix biencoder tokenizer compatibility (NeMoAutoTokenizer + return_tensors)
- Add --resume_loss_threshold to biencoder test
- Register new CLI flags in conftest.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
adil-a added a commit that referenced this pull request Apr 2, 2026
vLLM deployment verification tests that load consolidated checkpoints
and compare greedy output token-for-token against HuggingFace.
Supports both full comparison and smoke test mode.

Depends on checkpoint robustness PR #1606.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Remove vLLM deploy test module, 14 shell scripts, and TestVLLMDeploy
runner class. Remove vLLM-specific conftest entries and STATUS.md
sections. vLLM tests will land in a follow-up PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
adil-a and others added 4 commits April 2, 2026 18:15
Add ci.checkpoint_robustness section to 28 recipe YAMLs with
model-specific test args (KL thresholds, TP overrides, tokenizer names).
Common args (max_steps=5, dataset_limit=500, etc.) handled in launcher.

Append robustness test block to finetune_launcher.sh that runs after
finetune completes, gated by presence of ci.checkpoint_robustness.

Add 20 missing model configs to nightly_recipes.yml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
These scripts are superseded by ci.checkpoint_robustness sections in
recipe YAMLs. Kept locally for manual debugging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
TestCheckpointRobustness class called the removed .sh scripts via
run_test_script(). No longer needed — CI runs robustness tests
directly from finetune_launcher.sh.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Move YAML config parsing for model-specific robustness args from
finetune_launcher.sh into test_checkpoint_robustness_llm.py. The
launcher now only detects if ci.checkpoint_robustness exists and
passes common args. The test script reads model-specific values
(KL thresholds, TP overrides, tokenizer names, etc.) directly from
the YAML's ci.checkpoint_robustness section, with CLI args taking
precedence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
@adil-a
Copy link
Copy Markdown
Collaborator Author

adil-a commented Apr 2, 2026

/ok to test d77ea17

Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com>
@thomasdhc
Copy link
Copy Markdown
Contributor

/ok to test 321ac05

thomasdhc
thomasdhc previously approved these changes Apr 6, 2026
@thomasdhc
Copy link
Copy Markdown
Contributor

/ok to test 3bec651

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

E2E Checkpointing Robustness

3 participants