Feature/unswizzle by int-smart · Pull Request #2732 · NVIDIA/TransformerEngine

int-smart · 2026-03-04T05:09:04Z

Description

This PR adds unswizzle support for scaling factors and extends the swizzle module so scaling tensors can be converted from GEMM-swizzled layout back to compact layout, including multi-tensor paths. It also adds round-trip and standalone tests to validate unswizzle correctness.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Added unswizzle APIs and implementation in transformer_engine/common/swizzle/swizzle.cu and declarations in transformer_engine/common/include/transformer_engine/swizzle.h
Added multi-tensor unswizzle support with swizzle-like validation assumptions (homogeneous scaling mode/layout, swizzled input and compact output expectations)
Refactored multi-tensor unswizzle launch/kernels to mirror swizzle structure (split row-wise and column-wise kernels) for easier readability
Added/extended tests in tests/cpp/operator/test_swizzle.cu, including standalone unswizzle and swizzle→unswizzle round-trip coverage

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

- Introduced `nvte_unswizzle_scaling_factors` to convert swizzled scaling factors back to row-major format. - Implemented `regs_unshuffle_with_bit_shifts` and `regs_unshuffle` for unshuffling operations in CUDA kernels. - Added `unswizzle_row_scaling_kernel_impl` and `unswizzle_col_scaling_kernel_impl` for handling unswizzling in row and column scaling respectively. These changes enhance the functionality of the swizzle module, enabling better handling of scaling factors in tensor operations. Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

These enhancements tests the changes introduced for unswizzling Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

- Introduced `compute_ref_unswizzle` to handle the conversion of swizzled scaling factors back to their original format. - Added `performTestUnswizzle1D` to validate the unswizzling process with various scaling modes. - Created `UnswizzleTestSuite` for comprehensive testing of unswizzling operations. Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

- Moved the definition of `swizzle_row_scaling_kernel` to a new location for better organization. - Ensured the kernel implementation is now properly defined and accessible for scaling operations in the swizzle module. Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

- Introduced `multi_tensor_unswizzle_scaling_factors` to convert swizzled scaling factors back to their original row-major format. - Implemented CUDA kernels for unswizzling in both row and column scaling, enhancing the swizzle module's functionality. - Updated the launch function to handle multiple tensor unswizzling operations efficiently. These changes improve the handling of scaling factors in tensor operations, ensuring better performance and organization within the swizzle module. Signed-off-by: Abhishek <abhi.dtu11@gmail.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-03-04T05:17:43Z

Greptile Summary

This PR adds unswizzle (GEMM-swizzled → compact) support for MXFP8/NVFP4 1D scaling factors, mirroring the existing swizzle infrastructure. It introduces nvte_unswizzle_scaling_factors and nvte_multi_tensor_unswizzle_scaling_factors C APIs, new CUDA kernels (unswizzle_row/col_scaling_kernel_impl) with the corresponding inverse register-shuffle helpers (regs_unshuffle, regs_unshuffle_with_bit_shifts), and round-trip + standalone tests.

Key findings from this review:

Output-size validation bug (single-tensor path): unswizzle_scaling_factors validates the output buffer size using original_M * original_K (data-derived, potentially unpadded) instead of m * k (padded scale dimensions). For any matrix whose M dimension is not a multiple of 128, this check will reject a correctly-sized padded output tensor. The multi-tensor path and all swizzle paths correctly use m * k. Both the rowwise (line 1276) and columnwise (line 1348) paths share this bug.
API inconsistency: multi_tensor_unswizzle_scaling_factors accepts const std::vector<Tensor*>& output while the symmetric multi_tensor_swizzle_scaling_factors uses std::vector<Tensor*>& output. The const has no practical effect (the contained Tensor* pointers are non-const and the tensor data is mutated), but the inconsistency is unexpected.
The kernel logic, inverse shuffle operations, and test coverage all look correct. The new test functions properly avoid the uninitialized-variable issue present in the pre-existing performTestSwizzle1D.

Confidence Score: 3/5

Safe to merge for padded tensors (all current tests pass), but unswizzle_scaling_factors will incorrectly reject valid output tensors when M is not a multiple of 128 in production use.
The CUDA kernel implementations and register-shuffle inverses are algorithmically correct and well-tested. The round-trip tests provide good coverage. However, the output-size validation bug in the single-tensor unswizzle_scaling_factors (using original_M * original_K instead of m * k) is a latent logic error that will surface in production when the data M dimension is not already a multiple of 128 — a realistic scenario for arbitrary-size model layers.
transformer_engine/common/swizzle/swizzle.cu — specifically the output-size NVTE_CHECK calls at lines 1276-1279 (rowwise) and 1348-1351 (columnwise) inside unswizzle_scaling_factors.

Important Files Changed

Filename	Overview
transformer_engine/common/swizzle/swizzle.cu	Adds `unswizzle_scaling_factors` and `multi_tensor_unswizzle_scaling_factors` — the inverse of the existing swizzle operations. New kernel implementations (`unswizzle_row_scaling_kernel_impl`, `unswizzle_col_scaling_kernel_impl`) and the `regs_unshuffle`/`regs_unshuffle_with_bit_shifts` helpers look algorithmically correct as inverses of their swizzle counterparts. However, the single-tensor `unswizzle_scaling_factors` validates the output buffer size using `original_M * original_K` (unpadded data dimensions) while every other path uses `m * k` (padded scale dimensions); this will incorrectly reject valid outputs whenever the data M is not a multiple of 128. Also, `multi_tensor_unswizzle_scaling_factors` uses `const std::vector<Tensor*>& output` inconsistently with the swizzle variant.
transformer_engine/common/include/transformer_engine/swizzle.h	Adds declarations for `nvte_unswizzle_scaling_factors` and `nvte_multi_tensor_unswizzle_scaling_factors` with well-written docstrings that mirror the swizzle counterparts. No issues found.
tests/cpp/operator/test_swizzle.cu	Adds `compute_ref_unswizzle`, `performTestUnswizzle1D`, and `performTestSwizzleUnswizzleRoundtrip` along with their test suite instantiations. The new test functions correctly handle the `!rowwise && !columnwise` edge case with a dedicated skip message (without touching uninitialized variables), unlike the pre-existing `performTestSwizzle1D`. Reference implementation and round-trip coverage are thorough.