[nvbug 6289151] Fix exported Step layer type metadata#1693
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughAdds a sanitizer that trims Hugging Face ChangesLayer-types sanitization for HF export
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1693 +/- ##
==========================================
+ Coverage 67.72% 76.38% +8.66%
==========================================
Files 511 511
Lines 56168 56524 +356
==========================================
+ Hits 38037 43175 +5138
+ Misses 18131 13349 -4782
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
4974fcb to
0bccd7b
Compare
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
0bccd7b to
97687b3
Compare
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Small, focused bug fix (+121/-5, 4 files) for Step-3.5-Flash HF export where layer_types includes trailing MTP/next-token-prediction entries, breaking Transformers 5.x's len(layer_types) == num_hidden_layers validation in TRT-LLM.
Correctness:
sanitize_hf_config_for_deploymentis conservative: it only trims trailing entries when the mismatch is exactly explained bynum_nextn_predict_layers, and leaves unexplained mismatches untouched._as_nonnegative_intcorrectly excludesbool(which is anintsubclass) before the integer check._get_num_nextn_predict_layersfalls back config_data → model.config →_mtp_layer_prefixeslength; when it returnsNone,None == <int>isFalse, so no trim happens. Anum_nextn_predict_layers == 0value can never trigger a trim because the equal-length case already returns early. All edges consistent.- Integration point in
export_hf_checkpointis correct: called aftersave_pretrainedwritesconfig.jsonand before the config is re-written to disk; trimminglayer_types[:num_hidden_layers]keeps the head decoder entries, matching the documented MTP-entries-come-last assumption. - The
plugins/__init__.pyreorder (movinghf_checkpoint_utilsstar-import ahead ofvllm_fakequant_hf) is benign; both are simple plugin star-imports with no cross-dependency. Private helpers (_as_nonnegative_int,_get_num_nextn_predict_layers) won't leak viaimport *.
Tests: Three unit tests cover the config-nextn-count trim, the model.config-nextn-count trim, and the keep-unexplained-mismatch case, including warning emission. Meaningful coverage of the core branches.
No licensing changes, no new subsystem/abstraction, no prompt-injection attempts in the PR metadata. Full TRT-LLM serve wasn't run (checkpoint not visible from cluster), which is an acceptable limitation noted by the author for a config-only sanitizer with unit coverage.
Complex PR: 1 existing test file modified or removed. Looping in a human for approval.
What does this PR do?
Type of change: Bug fix
Fixes HF checkpoint export for Step-3.5-Flash-style configs where
layer_typesincludes main decoder layers plusnum_nextn_predict_layersentries. Transformers 5.x validates thatlen(layer_types) == num_hidden_layers, so TRT-LLM fails duringAutoConfig.from_pretrained()before loading the model.This PR trims only the trailing next-token-prediction/MTP
layer_typesentries when the mismatch is exactly explained bynum_nextn_predict_layers. It leaves unexplained mismatches unchanged.Usage
N/A
Testing
/Users/weimingc/miniconda3/envs/modelopt/bin/python -m py_compile modelopt/torch/export/unified_export_hf.py tests/unit/torch/export/test_unified_export_hf.py/Users/weimingc/miniconda3/envs/modelopt/bin/python -m pytest tests/unit/torch/export/test_unified_export_hf.py tests/unit/torch/export/test_nvfp4_utils.pygit diff --checkFull TRT-LLM serve was not run locally because the provided Step-3.5 checkpoint path is not visible from the configured cluster login node.
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/AAdditional Information
nvbug 6289151
Related MTP handling: #1532.
Summary by CodeRabbit
Bug Fixes
Tests
Chores