Skip to content

feat: add Gemma4 support#2224

Draft
sharonyu-115 wants to merge 19 commits intoNVIDIA-NeMo:mainfrom
sharonyu-115:gemma4-support
Draft

feat: add Gemma4 support#2224
sharonyu-115 wants to merge 19 commits intoNVIDIA-NeMo:mainfrom
sharonyu-115:gemma4-support

Conversation

@sharonyu-115
Copy link
Copy Markdown
Contributor

@sharonyu-115 sharonyu-115 commented Apr 7, 2026

What does this PR do ?

To address #2212

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@sharonyu-115
Copy link
Copy Markdown
Contributor Author

/ok to test b3b4d3c

@sharonyu-115 sharonyu-115 added the CI:L1 Run doctests, unit tests, and functional tests label Apr 8, 2026
@zpqiu zpqiu changed the title Gemma4 support feat: add Gemma4 support Apr 8, 2026
@zpqiu zpqiu added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Apr 8, 2026
@sharonyu-115 sharonyu-115 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Apr 8, 2026
@zpqiu zpqiu marked this pull request as ready for review April 8, 2026 05:36
@zpqiu zpqiu requested review from a team as code owners April 8, 2026 05:36
@zpqiu zpqiu added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Apr 8, 2026
@zpqiu zpqiu marked this pull request as draft April 8, 2026 05:37
@zpqiu
Copy link
Copy Markdown
Contributor

zpqiu commented Apr 8, 2026

/ok to test 360cb8a

@sharonyu-115
Copy link
Copy Markdown
Contributor Author

/ok to test 7353904

@sharonyu-115
Copy link
Copy Markdown
Contributor Author

/ok to test e90e80c

@sharonyu-115
Copy link
Copy Markdown
Contributor Author

/ok to test 04fc41c

@sharonyu-115
Copy link
Copy Markdown
Contributor Author

/ok to test 9d9fd36

sharonyu-115 and others added 18 commits April 18, 2026 06:43
- Register gemma4 in AUTOMODEL_FACTORY (utils.py)
- Add KV sharing support: use_cache=True for models with num_kv_shared_layers > 0 (train.py)
- Freeze visual/audio encoders for text-only training to fix checkpoint resume (setup.py)
- Inject mm_token_type_ids for Gemma4 text-only inputs (train.py, dtensor_policy_worker.py)
- Extend skip_tokenizer_init workaround to Gemma4ForConditionalGeneration (vllm_worker.py)
- Bump transformers 5.3.0 -> 5.5.0, vllm 0.17.1 -> 0.19.0 (pyproject.toml)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
- grpo-gemma4-e2b-it-1n8g-fsdp2-automodel.yaml: E2B-it on 1 node
- dapo-gemma4-31b-it-4n8g-fsdp2.yaml: 31B-it DAPO with dynamic sampling

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
- Apply ruff-format line wrapping to setup.py, train.py, dtensor_policy_worker.py
- Minimize recipe YAMLs (remove redundant defaults matching base config)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Regenerated lockfile to match pyproject.toml dependency bumps.
Required for CI build container (uv sync --locked).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Regenerated with pinned Automodel submodule (92635e74) in
CI base image (cuda-dl-base:25.05) to match CI's uv sync --locked.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Update Automodel submodule from 92635e74 to 3a3f6858 (latest main).
Fixes CI test_automodel_types.py TypeError caused by check_model_inputs
API change in transformers 5.5.0. Regenerate uv.lock.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
vLLM 0.19.0 refactored chat preprocessing from OpenAIServingChat into a
new OpenAIServingRender service class. This broke the NeMo-RL HTTP server
in two ways: (1) OpenAIServingChat/Tokenization now require an
openai_serving_render constructor arg, and (2) the _preprocess_chat method
override was silently dead since it moved to OpenAIServingRender.

Move the prefix-token replacement logic into a NeMoRLOpenAIServingRender
subclass that overrides preprocess_chat, and pass it to both serving classes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Add missing test suite scripts for the two new Gemma 4 recipe configs
to fix the test_all_recipe_yamls_accounted_for_in_test_suites CI check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Temporarily remove --exitfirst (-x) from pytest addopts so CI runs
all tests instead of stopping at the first failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
…sion)

Pin Automodel submodule to sharonyu-115/Automodel gemma4-support branch
at 6d5971c3, which reverts the 2013a4dd FSDP2 prefetch commit that causes
Gemma4 26B-A4B MoE expert weights to be float32 (RuntimeError in
grouped_gemm). See automodel_gemma4_moe_dtype_bug.md for details.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Picks up 7804b703 which passes unnormalized residual to the MoE gate,
fixing incorrect routing that caused gen_kl_error=0.116 on 26B-A4B.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Align naming with other Gemma4 recipes (E2B, E4B, 26B-A4B) that use
the -automodel suffix. Renames DAPO config, test script, and updates
disabled.txt reference.

Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Use on-policy training (batch_size=512), activation checkpointing off,
and gpu_memory_utilization=0.5. This config showed clear convergence:
val accuracy 13.8% → 21.0% over 224 steps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
The Automodel submodule now tracks the fix/gemma4-moe-gate-double-norm
branch on the shuangy fork, which is rebased on upstream main
(bd942f20) and carries only the single MoE-gate double-norm fix plus
its regression tests. This drops the three transformers 5.5 compat
patches that have since landed upstream (NVIDIA-NeMo#1734, NVIDIA-NeMo#1769, NVIDIA-NeMo#1764) and
collapses our carry-stack from four patches down to one.

gemma4-support is preserved on the fork as an A/B fallback — flip
.gitmodules branch + re-checkout the submodule to swap.

Signed-off-by: Shuang Yu <shuangy@nvidia.com>
The Gemma4 MoE gate double-norm fix (PR NVIDIA-NeMo#1895) and the transformers
5.5 compat patches have all landed on NVIDIA-NeMo/Automodel main. Drop
the sharonyu-115 fork / fix branch and track upstream main directly.

New submodule SHA: fb62eb48 (upstream/main tip).

Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Refresh uv.lock after bumping Automodel submodule to NVIDIA-NeMo main
(fb62eb48). Generated with the 2026-04-17 nemo-rl nightly container.
No pyproject.toml changes — transformers==5.5.0 and vllm==0.19.0 pins
are preserved.

Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Commit 684dc56 renamed the DAPO Gemma4 31B recipe to add the
-automodel suffix but left the original files behind. Remove them to
complete the rename.

Signed-off-by: Shuang Yu <shuangy@nvidia.com>
DAPO has converged on DAPOMath17K + AIME2024 eval for Gemma4 E2B-it on
1N8G, delivering better alignment and stability than GRPO on the same
setup. Retire the GRPO E2B recipe and enable DAPO E2B in the nightly
suite.

- Add tests/test_suites/llm/dapo-gemma4-e2b-it-1n8g-fsdp2-automodel.sh
  (1 node, 20 steps, ~90 min). Metric thresholds derived from wandb run
  kouzgkf3 (dapo-gemma4-e2b-it-1n8g-fsdp2-automodel-2k-onpolicy-noactckpt)
  under ys_fishcool-nvidia/nemorl-gemma4:
    median(train/token_mult_prob_error) < 1.1   (baseline 1.0093)
    train/token_mult_prob_error[20]     < 1.05  (baseline 1.0094)
    train/reward[20]                    > -1.15 (baseline -1.087)
    train/filtered_reward[20]           > -1.10 (baseline -1.028)
    train/gen_kl_error[20]              < 0.001 (baseline 5.4e-4)
- Register the new script in tests/test_suites/nightly.txt under a new
  DAPO section.
- Remove tests/test_suites/disabled.txt entry for the deleted GRPO E2B
  script; keep the 31B DAPO entry (no convergence baseline yet).
- Delete the GRPO E2B recipe and test script.

Signed-off-by: Shuang Yu <shuangy@nvidia.com>
@github-actions
Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: e0e686d (PR #2224 from gemma4-support)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@sharonyu-115
Copy link
Copy Markdown
Contributor Author

/ok to test e0e686d

- Minimize DAPO Gemma4 E2B-it recipe to satisfy configs-minimize-check hook.
- Guard `_needs_kv_cache_for_shared_layers` against non-int
  `num_kv_shared_layers` (fixes `TypeError` when tests pass a bare
  MagicMock as the model). Noted in a TODO that this workaround can be
  removed once transformers>=5.5.2 (huggingface/transformers#45312) lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
@github-actions
Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 16f3747 (PR #2224 from gemma4-support)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@sharonyu-115
Copy link
Copy Markdown
Contributor Author

/ok to test 16f3747

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants