Fix Nemotron-H: add mlp layer type support#45300
Fix Nemotron-H: add mlp layer type support#45300w4nderlust wants to merge 2 commits intohuggingface:mainfrom
Conversation
Nemotron-H models use standalone MLP layers in their hybrid_override_pattern (the '-' character), but the config parser, validators, and modeling code only handle mamba/attention/moe. This means every Nemotron-H model on the hub fails to load: KeyError: '-' in _pattern_to_list() Changes: - _pattern_to_list: add '-' -> 'mlp' mapping - validate_layers_block_type: add 'mlp' to valid_types - MIXER_TYPES: add 'mlp' -> NemotronHMLP - block_type_to_mask: add 'mlp' -> None - NemotronHMLP.__init__: accept **kwargs (layer_idx passed by NemotronHBlock) - ALLOWED_LAYER_TYPES: add 'mlp' - modular_nemotron_h.py: same changes (source of truth for modeling code)
2556332 to
54b8841
Compare
|
Interesting, and seems like a big gap in our tests. How did we miss this one, and is it possible to make a regression test for it that doesn't have to load a multi-billion parameter model? |
- Add 4 regression tests that exercise the MLP layer type end-to-end: config parsing, forward pass, generation, and real Nemotron-H patterns - Fix _list_to_pattern missing "mlp": "-" mapping (would crash on roundtrip) - Fix _check_past_key_values_for_generate to handle "mlp" layer type - Extend test_pattern_conversion_methods with MLP roundtrip coverage All tests use tiny models (hidden_size=32, ~5-8 layers) — no downloads needed.
|
[For maintainers] Suggested jobs to run (before merge) run-slow: nemotron_h |
|
I beleive the gap happened because the existing test suite only exercised patterns with M (mamba), * (attention), and E (moe), but all the actual Nemotron-H models on the Hub use - (standalone MLP) in their hybrid_override_pattern. The NemotronHMLP class was already in the codebas, it just wasn't wired into the dispatcher or validation. NVIDIA's trust_remote_code path handled it, which likely delayed reports. I've pushed a follow-up commit with regression tests that don't need any model downloads as they instantiate tiny models (hidden_size=32, 5-8 layers) with MLP-containing patterns and exercise the full path:
Also found and fixed two secondary bugs while writing the tests:
Let me know if these tests sound good, happy to change them if not. by the way, some of the CI errors seem flaky / transient, not caused by my changes. |
|
this is a duplicate of #44763 please check it out |
Happy to change in the way you want it, but right now the Nemotron models are unusable, so I would either merge this or figure out a better way and I'll implement it, don't leave it hanging please. |
What does this PR do?
Nemotron-H models use standalone MLP layers in their
hybrid_override_pattern(the-character), but the config parser, validators, and modeling code only know aboutmamba/attention/moe. This means every Nemotron-H model on the hub (nvidia/Nemotron-H-8B-Base-8K,nvidia/Nemotron-H-56B-Base-8K, and all Nemotron-3-Nano variants) crashes on load:in
_pattern_to_list().NVIDIA's own hub modeling code (the
trust_remote_code=Truepath) handles-as"mlp"and dispatches toNemotronHMLPin its mixer, but the native transformers integration missed it. TheNemotronHMLPclass already exists in the codebase, it just wasn't wired up.The fix is small, 4 files, all one-liners:
configuration_nemotron_h.py: add"-": "mlp"to_pattern_to_listmapping, add"mlp"tovalid_typesinvalidate_layers_block_typemodeling_nemotron_h.py: add"mlp": NemotronHMLPtoMIXER_TYPES, add"mlp": Nonetoblock_type_to_mask, accept**kwargsinNemotronHMLP.__init__(becauseNemotronHBlockpasseslayer_idx=)configuration_utils.py: add"mlp"toALLOWED_LAYER_TYPESTested locally: loaded
nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16(patternM-M-M-MM-M-M*-M-M*-M-M-M*-M-M-MM*-MMM-M-M-), ranmodel.generate()with a chat prompt, got coherent output. Before the fix it crashes on config parsing. Also ranmake fix-repochecks (ruff, auto_mappings, copies, dummies) -- all pass.Code Agent Policy
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker @Cyrilvallez