feat: MoE model benchmarks, LoRA configs, and flops calculators by hemildesai · Pull Request #1676 · NVIDIA-NeMo/Automodel

hemildesai · 2026-04-03T21:31:00Z

Summary

Benchmark configs: Add pre-training benchmark configs for 8 new MoE models: GLM-4.7, GLM-4.7-Flash, GLM-5, MiniMax-M2.5, Mistral Small 4, Qwen3.5-MoE, Qwen3-VL-235B, Step-3.5-Flash — all with TE + DeepEP backends
LoRA benchmark configs: Add LoRA (PEFT) benchmark configs for 12 MoE models covering DeepSeek V3.2, GLM-4.7/4.7-Flash/5, Kimi-K2.5-VL, MiniMax-M2.5, Mistral Small 4, Nemotron Super V3, Qwen3.5-MoE, Qwen3-VL-235B, Step-3.5-Flash
FLOPs calculators: Add new flops_utils functions for MiniMax-M2, Qwen3.5 (hybrid GDN/full attention), Step-3.5-Flash, Mistral Small 4 (MLA), GLM-4/GLM-4-MoE families; extend DeepSeek V3 calculator with DSA sparse attention support; improve Mamba layer FLOPs formula clarity
VL model support: Fix _precompute_stage_shapes in pipeline parallelism to fall back to text_config for VL composite configs; add text_config fallback in qwen3_flops and qwen3_5_flops
Kimi K2.5 VL: Improve robustness by falling back to vt_* prefixed attributes and mm_hidden_size when standard vision config attributes are missing
Benchmark recipe: Extract _infer_vocab_size() helper to support custom config classes and VL composite configs
gptoss-120b config: Tune for hybridep dispatcher with TE attention, reduce local batch size, disable activation checkpointing
MoE parallelizer: Switch FSDP reduce_dtype from fp32 to bf16

Test plan

Verify pre-training benchmark runs for representative models (e.g., Qwen3.5-MoE, GLM-4.7-Flash)
Verify LoRA benchmark runs for representative models
Validate FLOPs calculations against known reference values
Test VL composite config path (Kimi-K2.5-VL, Qwen3-VL) in pipeline parallelism
Confirm Kimi K2.5 VL model loads correctly with both old and new config formats

🤖 Generated with Claude Code

copy-pr-bot · 2026-04-03T21:31:05Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

hemildesai · 2026-04-03T21:50:35Z

/claude review

hemildesai · 2026-04-03T21:50:37Z

/ok to test 37b8848

hemildesai · 2026-04-03T21:55:04Z

/claude review

hemildesai · 2026-04-03T21:55:05Z

/ok to test 5599a9a

akoumpa · 2026-04-06T18:24:03Z

/ok to test bde8305

akoumpa · 2026-04-06T19:49:03Z

/ok to test 78be559

akoumpa · 2026-04-06T20:48:43Z

@hemildesai do you know why dos & pip install fail?

Add TE+DeepEP benchmark configs for GLM-4.7-Flash, GLM-4.7, GLM-5, MiniMax-M2.5, Mistral Small 4, Qwen3.5 MoE, Qwen3 VL 235B, and Step3.5-Flash. Fix VL composite config handling in benchmark recipe, flops utils, and pipeline sharding to fall back to text_config for models where vocab_size, hidden_size, and num_hidden_layers are nested under text_config (e.g. Qwen3.5 MoE, Qwen3 VL 235B). Add support for custom config classes (e.g. DeepseekV32Config) in benchmark vocab_size inference by respecting the _target_ field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

Add TE+DeepEP LoRA benchmark configs with moe_rank_scaling enabled for GLM-4.7-Flash (30B) and Qwen3.5 MoE (35B-A3B). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

Signed-off-by: hemildesai <hemild@nvidia.com>

Rename te_deepep_lora configs to _lora for consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: hemildesai <hemild@nvidia.com>

Switch to TE attention backend, hybridep dispatcher with 64 SMs, torch_mm experts, disable activation checkpointing & reshard_after_forward, and reduce local batch size to 1 for better memory/perf profile. Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

- Fix Kimi K2.5 VL model to handle missing vision_config attributes by falling back to vt_* prefixed and mm_hidden_size attributes - Enable reshard_after_forward for GLM-4.7-Flash LoRA config - Use bf16 reduce_dtype in MoE parallelizer FSDP mixed precision policy Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

…itive The HuggingFace model ID nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 triggers the Base64 High Entropy String detector as a false positive. Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

…faults" This reverts commit a185e34. Signed-off-by: hemildesai <hemild@nvidia.com>

…alues Add 38 new tests covering: - minimax_m2_flops (basic, gbs scaling, MTP, precomputed) - qwen3_5_flops MoE and Dense variants (GDN/full attention hybrid) - mla_moe_flops (Kimi K2, GLM-5, Mistral Small 4) - step3_5_flash_flops (hybrid full/SWA + MoE) - deepseekv3_flops DSA sparse attention extension - _mamba_layer_flops refactored formula - _hybrid_model_flops conditional accumulation - VL composite config text_config fallback in qwen3_flops and qwen3_5_flops - get_flops_formula_for_hf_config dispatch for new model types Update nemotronh precomputed values to match refactored Mamba scan formula. Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

… to benchmark configs Add Apache 2.0 copyright header, recipe: BenchmarkingRecipeForNextTokenPrediction declaration, and automodel run instructions to all 21 new benchmark configs. Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

…ode, tests - Fix _infer_vocab_size string _target_ import: use rsplit to correctly extract module_path and class_name instead of discarding class name - Remove duplicate trust_remote_code under config: section in 5 YAML configs (step_3.5_flash, qwen3.5_moe, qwen3.5_moe_te_deepep, qwen3.5_moe_lora, step35flash_lora) - Add unit tests for string _target_ and VL text_config fallback paths in _infer_vocab_size Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

- Use from_pretrained() instead of direct constructor call for callable config targets in _infer_vocab_size - Guard mla_moe_flops VL text_config fallback with num_hidden_layers check, consistent with qwen3_flops and qwen3_5_flops patterns Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

Verify that VL configs without hidden_size at root level correctly fall back to text_config for pipeline stage shape precomputation. Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

When moe_layer_freq is an integer, use it as a frequency (every Nth layer is MoE) rather than marking all post-dense layers as MoE. Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: hemildesai <hemild@nvidia.com>

The previous commit accidentally broke _infer_vocab_size by removing the callable config_target branch and introducing duplicate code with bad indentation, causing a SyntaxError. Restore the correct if/elif structure. Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

hemildesai · 2026-04-06T23:54:06Z

/ok to test e3b3c1f

akoumpa · 2026-04-07T05:21:47Z

automation review is triggered for .github/workflows/config/.secrets.baseline changing. I'm FMing to saving everyone time.

hemildesai requested review from HuiyingLi, ZhiyuLi-Nvidia, adil-a, akoumpa and pthombre as code owners April 3, 2026 21:31

hemildesai requested a review from a team as a code owner April 3, 2026 21:43

hemildesai force-pushed the hemil/moe-benchmarks branch 2 times, most recently from 9778991 to 37b8848 Compare April 3, 2026 21:50

copy-pr-bot bot had a problem deploying to nemo-ci April 3, 2026 21:50 Error

copy-pr-bot bot temporarily deployed to nemo-ci April 3, 2026 21:50 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci April 3, 2026 21:50 Error

copy-pr-bot bot temporarily deployed to nemo-ci April 3, 2026 21:50 Inactive

copy-pr-bot bot had a problem deploying to test April 3, 2026 21:50 Error

claude bot reviewed Apr 3, 2026

View reviewed changes

Comment thread nemo_automodel/recipes/llm/benchmark.py Outdated

claude bot reviewed Apr 3, 2026

View reviewed changes

Comment thread nemo_automodel/recipes/llm/benchmark.py

copy-pr-bot bot had a problem deploying to test April 3, 2026 21:55 Error

copy-pr-bot bot had a problem deploying to nemo-ci April 3, 2026 21:57 Error

claude bot reviewed Apr 3, 2026

View reviewed changes

Comment thread examples/benchmark/configs/step_3.5_flash_te_deepep.yaml

copy-pr-bot bot temporarily deployed to nemo-ci April 3, 2026 22:59 Inactive

hemildesai and others added 17 commits April 6, 2026 16:51

feat: add renamed LoRA benchmark configs for MoE models

43c1d65

Signed-off-by: hemildesai <hemild@nvidia.com>

feat: add LoRA configs for Qwen3.5 MoE and GLM-4.7-Flash

f10989a

Rename te_deepep_lora configs to _lora for consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: hemildesai <hemild@nvidia.com>

Revert "fix: improve Kimi VL model robustness and MoE parallelizer de…

c71a0e4

…faults" This reverts commit a185e34. Signed-off-by: hemildesai <hemild@nvidia.com>

ci: update secrets baseline line number after copyright header addition

fee6922

Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>

Update nemo_automodel/recipes/llm/benchmark.py

e9f394a

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: hemildesai <hemild@nvidia.com>

akoumpa approved these changes Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: MoE model benchmarks, LoRA configs, and flops calculators#1676

feat: MoE model benchmarks, LoRA configs, and flops calculators#1676
akoumpa merged 17 commits intomainfrom
hemil/moe-benchmarks

hemildesai commented Apr 3, 2026

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

hemildesai commented Apr 3, 2026

Uh oh!

hemildesai commented Apr 3, 2026

Uh oh!

Uh oh!

Uh oh!

hemildesai commented Apr 3, 2026

Uh oh!

hemildesai commented Apr 3, 2026

Uh oh!

Uh oh!

akoumpa commented Apr 6, 2026

Uh oh!

akoumpa commented Apr 6, 2026

Uh oh!

akoumpa commented Apr 6, 2026

Uh oh!

hemildesai commented Apr 6, 2026

Uh oh!

akoumpa commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hemildesai commented Apr 3, 2026

Summary

Test plan

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

hemildesai commented Apr 3, 2026

Uh oh!

hemildesai commented Apr 3, 2026

Uh oh!

Uh oh!

Uh oh!

hemildesai commented Apr 3, 2026

Uh oh!

hemildesai commented Apr 3, 2026

Uh oh!

Uh oh!

akoumpa commented Apr 6, 2026

Uh oh!

akoumpa commented Apr 6, 2026

Uh oh!

akoumpa commented Apr 6, 2026

Uh oh!

hemildesai commented Apr 6, 2026

Uh oh!

akoumpa commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants