feat: MoE model benchmarks, LoRA configs, and flops calculators#1676
Merged
feat: MoE model benchmarks, LoRA configs, and flops calculators#1676
Conversation
9778991 to
37b8848
Compare
Contributor
Author
|
/claude review |
Contributor
Author
|
/ok to test 37b8848 |
Contributor
Author
|
/claude review |
Contributor
Author
|
/ok to test 5599a9a |
Contributor
|
/ok to test bde8305 |
Contributor
|
/ok to test 78be559 |
Contributor
|
@hemildesai do you know why dos & pip install fail? |
Add TE+DeepEP benchmark configs for GLM-4.7-Flash, GLM-4.7, GLM-5, MiniMax-M2.5, Mistral Small 4, Qwen3.5 MoE, Qwen3 VL 235B, and Step3.5-Flash. Fix VL composite config handling in benchmark recipe, flops utils, and pipeline sharding to fall back to text_config for models where vocab_size, hidden_size, and num_hidden_layers are nested under text_config (e.g. Qwen3.5 MoE, Qwen3 VL 235B). Add support for custom config classes (e.g. DeepseekV32Config) in benchmark vocab_size inference by respecting the _target_ field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
Add TE+DeepEP LoRA benchmark configs with moe_rank_scaling enabled for GLM-4.7-Flash (30B) and Qwen3.5 MoE (35B-A3B). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
Rename te_deepep_lora configs to _lora for consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Signed-off-by: hemildesai <hemild@nvidia.com>
Switch to TE attention backend, hybridep dispatcher with 64 SMs, torch_mm experts, disable activation checkpointing & reshard_after_forward, and reduce local batch size to 1 for better memory/perf profile. Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
- Fix Kimi K2.5 VL model to handle missing vision_config attributes by falling back to vt_* prefixed and mm_hidden_size attributes - Enable reshard_after_forward for GLM-4.7-Flash LoRA config - Use bf16 reduce_dtype in MoE parallelizer FSDP mixed precision policy Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
…itive The HuggingFace model ID nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 triggers the Base64 High Entropy String detector as a false positive. Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
…faults" This reverts commit a185e34. Signed-off-by: hemildesai <hemild@nvidia.com>
…alues Add 38 new tests covering: - minimax_m2_flops (basic, gbs scaling, MTP, precomputed) - qwen3_5_flops MoE and Dense variants (GDN/full attention hybrid) - mla_moe_flops (Kimi K2, GLM-5, Mistral Small 4) - step3_5_flash_flops (hybrid full/SWA + MoE) - deepseekv3_flops DSA sparse attention extension - _mamba_layer_flops refactored formula - _hybrid_model_flops conditional accumulation - VL composite config text_config fallback in qwen3_flops and qwen3_5_flops - get_flops_formula_for_hf_config dispatch for new model types Update nemotronh precomputed values to match refactored Mamba scan formula. Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
… to benchmark configs Add Apache 2.0 copyright header, recipe: BenchmarkingRecipeForNextTokenPrediction declaration, and automodel run instructions to all 21 new benchmark configs. Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
…ode, tests - Fix _infer_vocab_size string _target_ import: use rsplit to correctly extract module_path and class_name instead of discarding class name - Remove duplicate trust_remote_code under config: section in 5 YAML configs (step_3.5_flash, qwen3.5_moe, qwen3.5_moe_te_deepep, qwen3.5_moe_lora, step35flash_lora) - Add unit tests for string _target_ and VL text_config fallback paths in _infer_vocab_size Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
- Use from_pretrained() instead of direct constructor call for callable config targets in _infer_vocab_size - Guard mla_moe_flops VL text_config fallback with num_hidden_layers check, consistent with qwen3_flops and qwen3_5_flops patterns Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
Verify that VL configs without hidden_size at root level correctly fall back to text_config for pipeline stage shape precomputation. Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
When moe_layer_freq is an integer, use it as a frequency (every Nth layer is MoE) rather than marking all post-dense layers as MoE. Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: hemildesai <hemild@nvidia.com>
The previous commit accidentally broke _infer_vocab_size by removing the callable config_target branch and introducing duplicate code with bad indentation, causing a SyntaxError. Restore the correct if/elif structure. Signed-off-by: hemildesai <hemild@nvidia.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com>
Contributor
Author
|
/ok to test e3b3c1f |
akoumpa
approved these changes
Apr 7, 2026
Contributor
|
automation review is triggered for |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
flops_utilsfunctions for MiniMax-M2, Qwen3.5 (hybrid GDN/full attention), Step-3.5-Flash, Mistral Small 4 (MLA), GLM-4/GLM-4-MoE families; extend DeepSeek V3 calculator with DSA sparse attention support; improve Mamba layer FLOPs formula clarity_precompute_stage_shapesin pipeline parallelism to fall back totext_configfor VL composite configs; addtext_configfallback inqwen3_flopsandqwen3_5_flopsvt_*prefixed attributes andmm_hidden_sizewhen standard vision config attributes are missing_infer_vocab_size()helper to support custom config classes and VL composite configsreduce_dtypefrom fp32 to bf16Test plan
🤖 Generated with Claude Code