fix: skip embedding[padding_idx] = 0 with TP by akoumpa · Pull Request #1675 · NVIDIA-NeMo/Automodel

akoumpa · 2026-04-03T21:27:02Z

What does this PR do ?

Context:
HF may initialize models and set the padding_idx to zero. When the embedding layer is row-wise sharded this can cause the following error:

  File "/opt/Automodel/nemo_automodel/components/checkpoint/checkpointing.py", line 538, in initialize_model_weights
    model.initialize_weights()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2410, in initialize_weights
    self.smart_apply(self._initialize_weights, self.is_remote_code())
  File "/opt/venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2401, in smart_apply
    module.smart_apply(module._initialize_weights, is_remote_code)
  File "/opt/venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2403, in smart_apply
    module.smart_apply(fn, is_remote_code)
  File "/opt/venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2404, in smart_apply
    fn(self, is_remote_code)
  File "/opt/venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2381, in _initialize_weights
    self._init_weights(module)
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 2327, in _init_weights
    init.zeros_(module.weight[module.padding_idx])
                ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_dispatch.py", line 261, in _dispatch_get_local_results_slow_path
    self.redistribute_local_args(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_dispatch.py", line 486, in redistribute_local_args
    resharded_local_tensor = redistribute_local_tensor(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_redistribute.py", line 864, in redistribute_local_tensor
    transform_infos = _gen_transform_infos(
                      ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_redistribute.py", line 826, in _gen_transform_infos
    return _gen_transform_infos_non_cached(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/tensor/_redistribute.py", line 790, in _gen_transform_infos_non_cached
    assert src_shard_order is not None and dst_shard_order is not None
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

the [] operator fails in this case.

Alternatively, we could remap the padding_idx to the local shard and zero as follows, but I'm not sure that's better either.

def _zero_dtensor_embedding_padding_row(embedding: nn.Embedding) -> None:
    """Zero the ``padding_idx`` row of a TP-sharded embedding weight via local tensor ops.

    When the weight is a DTensor sharded along dim 0 (vocab-parallel), only the
    rank whose local shard contains the ``padding_idx`` row performs the zeroing.
    For replicated weights or weights sharded on other dims, every rank zeros
    the row in its local tensor.
    """
    padding_idx = embedding.padding_idx
    weight = embedding.weight
    if padding_idx is None:
        return
    if type(weight).__name__ != "DTensor":
        return

    local = weight._local_tensor
    spec = weight._spec

    for mesh_dim, placement in enumerate(spec.placements):
        if placement.is_shard() and placement.dim == 0:
            mesh = spec.mesh
            tp_size = mesh.size(mesh_dim)
            rank = mesh.get_local_rank(mesh_dim)
            vocab_size = weight.shape[0]

            chunk = vocab_size // tp_size
            rem = vocab_size % tp_size
            if rank < rem:
                local_off = rank * (chunk + 1)
                local_size = chunk + 1
            else:
                local_off = rem * (chunk + 1) + (rank - rem) * chunk
                local_size = chunk

            local_idx = padding_idx - local_off
            if 0 <= local_idx < local_size:
                with torch.no_grad():
                    local[local_idx].zero_()
            return

    with torch.no_grad():
        local[padding_idx].zero_()

Changelog

Add specific line by line info of high level changes in this PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

copy-pr-bot · 2026-04-03T21:27:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-04-03T21:41:36Z

/ok to test de12f90

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-04-03T22:40:55Z

/ok to test 5d20089

akoumpa · 2026-04-05T02:31:22Z

/claude review

* skip embedding[padding_idx] = 0 Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * remove code Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

…0` (#1771) fix: skip embedding[padding_idx] = 0 with TP (#1675) * skip embedding[padding_idx] = 0 * fix * remove code --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

* skip embedding[padding_idx] = 0 Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * remove code Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: vgauraha62 <vaibhavgauraha62.com>

* skip embedding[padding_idx] = 0 Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * remove code Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* skip embedding[padding_idx] = 0 Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * remove code Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: Edison <edisonggacc@gmail.com>

skip embedding[padding_idx] = 0

838ddad

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

fix

de12f90

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

copy-pr-bot bot temporarily deployed to nemo-ci April 3, 2026 21:41 Inactive

copy-pr-bot bot temporarily deployed to test April 3, 2026 21:41 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 3, 2026 21:44 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 3, 2026 21:51 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 3, 2026 22:10 Inactive

remove code

5d20089

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

copy-pr-bot bot temporarily deployed to test April 3, 2026 22:41 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 3, 2026 22:41 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 3, 2026 22:52 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 3, 2026 23:01 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 3, 2026 23:18 Inactive

claude bot reviewed Apr 5, 2026

View reviewed changes

Comment thread nemo_automodel/components/checkpoint/checkpointing.py

akoumpa marked this pull request as ready for review April 10, 2026 03:27

akoumpa requested review from HuiyingLi, ZhiyuLi-Nvidia, adil-a, hemildesai and pthombre as code owners April 10, 2026 03:27

akoumpa added the r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Apr 10, 2026

adil-a approved these changes Apr 10, 2026

View reviewed changes

akoumpa merged commit 1389350 into main Apr 10, 2026
50 of 52 checks passed

akoumpa deleted the akoumparouli/fix_zeroing_padd_idx branch April 10, 2026 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: skip embedding[padding_idx] = 0 with TP#1675

fix: skip embedding[padding_idx] = 0 with TP#1675
akoumpa merged 3 commits intomainfrom
akoumparouli/fix_zeroing_padd_idx

akoumpa commented Apr 3, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

akoumpa commented Apr 3, 2026

Uh oh!

akoumpa commented Apr 3, 2026

Uh oh!

akoumpa commented Apr 5, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

akoumpa commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Apr 3, 2026

Uh oh!

akoumpa commented Apr 3, 2026

Uh oh!

akoumpa commented Apr 3, 2026

Uh oh!

akoumpa commented Apr 5, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akoumpa commented Apr 3, 2026 •

edited

Loading