Skip to content

feat: MoE model benchmarks, LoRA configs, and flops calculators#1676

Merged
akoumpa merged 17 commits intomainfrom
hemil/moe-benchmarks
Apr 7, 2026
Merged

feat: MoE model benchmarks, LoRA configs, and flops calculators#1676
akoumpa merged 17 commits intomainfrom
hemil/moe-benchmarks

Conversation

@hemildesai
Copy link
Copy Markdown
Contributor

Summary

  • Benchmark configs: Add pre-training benchmark configs for 8 new MoE models: GLM-4.7, GLM-4.7-Flash, GLM-5, MiniMax-M2.5, Mistral Small 4, Qwen3.5-MoE, Qwen3-VL-235B, Step-3.5-Flash — all with TE + DeepEP backends
  • LoRA benchmark configs: Add LoRA (PEFT) benchmark configs for 12 MoE models covering DeepSeek V3.2, GLM-4.7/4.7-Flash/5, Kimi-K2.5-VL, MiniMax-M2.5, Mistral Small 4, Nemotron Super V3, Qwen3.5-MoE, Qwen3-VL-235B, Step-3.5-Flash
  • FLOPs calculators: Add new flops_utils functions for MiniMax-M2, Qwen3.5 (hybrid GDN/full attention), Step-3.5-Flash, Mistral Small 4 (MLA), GLM-4/GLM-4-MoE families; extend DeepSeek V3 calculator with DSA sparse attention support; improve Mamba layer FLOPs formula clarity
  • VL model support: Fix _precompute_stage_shapes in pipeline parallelism to fall back to text_config for VL composite configs; add text_config fallback in qwen3_flops and qwen3_5_flops
  • Kimi K2.5 VL: Improve robustness by falling back to vt_* prefixed attributes and mm_hidden_size when standard vision config attributes are missing
  • Benchmark recipe: Extract _infer_vocab_size() helper to support custom config classes and VL composite configs
  • gptoss-120b config: Tune for hybridep dispatcher with TE attention, reduce local batch size, disable activation checkpointing
  • MoE parallelizer: Switch FSDP reduce_dtype from fp32 to bf16

Test plan

  • Verify pre-training benchmark runs for representative models (e.g., Qwen3.5-MoE, GLM-4.7-Flash)
  • Verify LoRA benchmark runs for representative models
  • Validate FLOPs calculations against known reference values
  • Test VL composite config path (Kimi-K2.5-VL, Qwen3-VL) in pipeline parallelism
  • Confirm Kimi K2.5 VL model loads correctly with both old and new config formats

🤖 Generated with Claude Code

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@hemildesai hemildesai requested a review from a team as a code owner April 3, 2026 21:43
@hemildesai hemildesai force-pushed the hemil/moe-benchmarks branch 2 times, most recently from 9778991 to 37b8848 Compare April 3, 2026 21:50
@hemildesai
Copy link
Copy Markdown
Contributor Author

/claude review

@hemildesai
Copy link
Copy Markdown
Contributor Author

/ok to test 37b8848

Comment thread nemo_automodel/recipes/llm/benchmark.py Outdated
Comment thread nemo_automodel/recipes/llm/benchmark.py
@hemildesai
Copy link
Copy Markdown
Contributor Author

/claude review

@hemildesai
Copy link
Copy Markdown
Contributor Author

/ok to test 5599a9a

Comment thread examples/benchmark/configs/step_3.5_flash_te_deepep.yaml
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 6, 2026

/ok to test bde8305

@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 6, 2026

/ok to test 78be559

@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 6, 2026

@hemildesai do you know why dos & pip install fail?

hemildesai and others added 17 commits April 6, 2026 16:51
Add TE+DeepEP benchmark configs for GLM-4.7-Flash, GLM-4.7, GLM-5,
MiniMax-M2.5, Mistral Small 4, Qwen3.5 MoE, Qwen3 VL 235B, and
Step3.5-Flash.

Fix VL composite config handling in benchmark recipe, flops utils, and
pipeline sharding to fall back to text_config for models where
vocab_size, hidden_size, and num_hidden_layers are nested under
text_config (e.g. Qwen3.5 MoE, Qwen3 VL 235B).

Add support for custom config classes (e.g. DeepseekV32Config) in
benchmark vocab_size inference by respecting the _target_ field.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Add TE+DeepEP LoRA benchmark configs with moe_rank_scaling enabled
for GLM-4.7-Flash (30B) and Qwen3.5 MoE (35B-A3B).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Rename te_deepep_lora configs to _lora for consistency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Switch to TE attention backend, hybridep dispatcher with 64 SMs,
torch_mm experts, disable activation checkpointing & reshard_after_forward,
and reduce local batch size to 1 for better memory/perf profile.

Signed-off-by: hemildesai <hemild@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
- Fix Kimi K2.5 VL model to handle missing vision_config attributes
  by falling back to vt_* prefixed and mm_hidden_size attributes
- Enable reshard_after_forward for GLM-4.7-Flash LoRA config
- Use bf16 reduce_dtype in MoE parallelizer FSDP mixed precision policy

Signed-off-by: hemildesai <hemild@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
…itive

The HuggingFace model ID nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
triggers the Base64 High Entropy String detector as a false positive.

Signed-off-by: hemildesai <hemild@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
…faults"

This reverts commit a185e34.

Signed-off-by: hemildesai <hemild@nvidia.com>
…alues

Add 38 new tests covering:
- minimax_m2_flops (basic, gbs scaling, MTP, precomputed)
- qwen3_5_flops MoE and Dense variants (GDN/full attention hybrid)
- mla_moe_flops (Kimi K2, GLM-5, Mistral Small 4)
- step3_5_flash_flops (hybrid full/SWA + MoE)
- deepseekv3_flops DSA sparse attention extension
- _mamba_layer_flops refactored formula
- _hybrid_model_flops conditional accumulation
- VL composite config text_config fallback in qwen3_flops and qwen3_5_flops
- get_flops_formula_for_hf_config dispatch for new model types

Update nemotronh precomputed values to match refactored Mamba scan formula.

Signed-off-by: hemildesai <hemild@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
… to benchmark configs

Add Apache 2.0 copyright header, recipe: BenchmarkingRecipeForNextTokenPrediction
declaration, and automodel run instructions to all 21 new benchmark configs.

Signed-off-by: hemildesai <hemild@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
…ode, tests

- Fix _infer_vocab_size string _target_ import: use rsplit to correctly
  extract module_path and class_name instead of discarding class name
- Remove duplicate trust_remote_code under config: section in 5 YAML configs
  (step_3.5_flash, qwen3.5_moe, qwen3.5_moe_te_deepep, qwen3.5_moe_lora,
  step35flash_lora)
- Add unit tests for string _target_ and VL text_config fallback paths in
  _infer_vocab_size

Signed-off-by: hemildesai <hemild@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
- Use from_pretrained() instead of direct constructor call for callable
  config targets in _infer_vocab_size
- Guard mla_moe_flops VL text_config fallback with num_hidden_layers check,
  consistent with qwen3_flops and qwen3_5_flops patterns

Signed-off-by: hemildesai <hemild@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Verify that VL configs without hidden_size at root level correctly
fall back to text_config for pipeline stage shape precomputation.

Signed-off-by: hemildesai <hemild@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
When moe_layer_freq is an integer, use it as a frequency (every Nth
layer is MoE) rather than marking all post-dense layers as MoE.

Signed-off-by: hemildesai <hemild@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
The previous commit accidentally broke _infer_vocab_size by removing the
callable config_target branch and introducing duplicate code with bad
indentation, causing a SyntaxError. Restore the correct if/elif structure.

Signed-off-by: hemildesai <hemild@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
@hemildesai
Copy link
Copy Markdown
Contributor Author

/ok to test e3b3c1f

@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 7, 2026

automation review is triggered for .github/workflows/config/.secrets.baseline changing. I'm FMing to saving everyone time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants