Skip to content

Feature/unswizzle#2732

Open
int-smart wants to merge 10 commits intoNVIDIA:mainfrom
int-smart:feature/unswizzle
Open

Feature/unswizzle#2732
int-smart wants to merge 10 commits intoNVIDIA:mainfrom
int-smart:feature/unswizzle

Conversation

@int-smart
Copy link

@int-smart int-smart commented Mar 4, 2026

Description

This PR adds unswizzle support for scaling factors and extends the swizzle module so scaling tensors can be converted from GEMM-swizzled layout back to compact layout, including multi-tensor paths. It also adds round-trip and standalone tests to validate unswizzle correctness.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Added unswizzle APIs and implementation in transformer_engine/common/swizzle/swizzle.cu and declarations in transformer_engine/common/include/transformer_engine/swizzle.h
  • Added multi-tensor unswizzle support with swizzle-like validation assumptions (homogeneous scaling mode/layout, swizzled input and compact output expectations)
  • Refactored multi-tensor unswizzle launch/kernels to mirror swizzle structure (split row-wise and column-wise kernels) for easier readability
  • Added/extended tests in tests/cpp/operator/test_swizzle.cu, including standalone unswizzle and swizzle→unswizzle round-trip coverage

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

int-smart and others added 6 commits March 3, 2026 20:40
- Introduced `nvte_unswizzle_scaling_factors` to convert swizzled scaling factors back to row-major format.
- Implemented `regs_unshuffle_with_bit_shifts` and `regs_unshuffle` for unshuffling operations in CUDA kernels.
- Added `unswizzle_row_scaling_kernel_impl` and `unswizzle_col_scaling_kernel_impl` for handling unswizzling in row and column scaling respectively.

These changes enhance the functionality of the swizzle module, enabling better handling of scaling factors in tensor operations.

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
These enhancements tests the changes introduced for unswizzling

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
- Introduced `compute_ref_unswizzle` to handle the conversion of swizzled scaling factors back to their original format.
- Added `performTestUnswizzle1D` to validate the unswizzling process with various scaling modes.
- Created `UnswizzleTestSuite` for comprehensive testing of unswizzling operations.

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
- Moved the definition of `swizzle_row_scaling_kernel` to a new location for better organization.
- Ensured the kernel implementation is now properly defined and accessible for scaling operations in the swizzle module.

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
- Introduced `multi_tensor_unswizzle_scaling_factors` to convert swizzled scaling factors back to their original row-major format.
- Implemented CUDA kernels for unswizzling in both row and column scaling, enhancing the swizzle module's functionality.
- Updated the launch function to handle multiple tensor unswizzling operations efficiently.

These changes improve the handling of scaling factors in tensor operations, ensuring better performance and organization within the swizzle module.

Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 4, 2026

Greptile Summary

This PR adds unswizzle (GEMM-swizzled → compact) support for MXFP8/NVFP4 1D scaling factors, mirroring the existing swizzle infrastructure. It introduces nvte_unswizzle_scaling_factors and nvte_multi_tensor_unswizzle_scaling_factors C APIs, new CUDA kernels (unswizzle_row/col_scaling_kernel_impl) with the corresponding inverse register-shuffle helpers (regs_unshuffle, regs_unshuffle_with_bit_shifts), and round-trip + standalone tests.

Key findings from this review:

  • Output-size validation bug (single-tensor path): unswizzle_scaling_factors validates the output buffer size using original_M * original_K (data-derived, potentially unpadded) instead of m * k (padded scale dimensions). For any matrix whose M dimension is not a multiple of 128, this check will reject a correctly-sized padded output tensor. The multi-tensor path and all swizzle paths correctly use m * k. Both the rowwise (line 1276) and columnwise (line 1348) paths share this bug.

  • API inconsistency: multi_tensor_unswizzle_scaling_factors accepts const std::vector<Tensor*>& output while the symmetric multi_tensor_swizzle_scaling_factors uses std::vector<Tensor*>& output. The const has no practical effect (the contained Tensor* pointers are non-const and the tensor data is mutated), but the inconsistency is unexpected.

  • The kernel logic, inverse shuffle operations, and test coverage all look correct. The new test functions properly avoid the uninitialized-variable issue present in the pre-existing performTestSwizzle1D.

Confidence Score: 3/5

  • Safe to merge for padded tensors (all current tests pass), but unswizzle_scaling_factors will incorrectly reject valid output tensors when M is not a multiple of 128 in production use.
  • The CUDA kernel implementations and register-shuffle inverses are algorithmically correct and well-tested. The round-trip tests provide good coverage. However, the output-size validation bug in the single-tensor unswizzle_scaling_factors (using original_M * original_K instead of m * k) is a latent logic error that will surface in production when the data M dimension is not already a multiple of 128 — a realistic scenario for arbitrary-size model layers.
  • transformer_engine/common/swizzle/swizzle.cu — specifically the output-size NVTE_CHECK calls at lines 1276-1279 (rowwise) and 1348-1351 (columnwise) inside unswizzle_scaling_factors.

Important Files Changed

Filename Overview
transformer_engine/common/swizzle/swizzle.cu Adds unswizzle_scaling_factors and multi_tensor_unswizzle_scaling_factors — the inverse of the existing swizzle operations. New kernel implementations (unswizzle_row_scaling_kernel_impl, unswizzle_col_scaling_kernel_impl) and the regs_unshuffle/regs_unshuffle_with_bit_shifts helpers look algorithmically correct as inverses of their swizzle counterparts. However, the single-tensor unswizzle_scaling_factors validates the output buffer size using original_M * original_K (unpadded data dimensions) while every other path uses m * k (padded scale dimensions); this will incorrectly reject valid outputs whenever the data M is not a multiple of 128. Also, multi_tensor_unswizzle_scaling_factors uses const std::vector<Tensor*>& output inconsistently with the swizzle variant.
transformer_engine/common/include/transformer_engine/swizzle.h Adds declarations for nvte_unswizzle_scaling_factors and nvte_multi_tensor_unswizzle_scaling_factors with well-written docstrings that mirror the swizzle counterparts. No issues found.
tests/cpp/operator/test_swizzle.cu Adds compute_ref_unswizzle, performTestUnswizzle1D, and performTestSwizzleUnswizzleRoundtrip along with their test suite instantiations. The new test functions correctly handle the !rowwise && !columnwise edge case with a dedicated skip message (without touching uninitialized variables), unlike the pre-existing performTestSwizzle1D. Reference implementation and round-trip coverage are thorough.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant nvte_unswizzle as nvte_unswizzle_scaling_factors
    participant unswizzle as unswizzle_scaling_factors
    participant row_kernel as unswizzle_row_scaling_kernel
    participant col_kernel as unswizzle_col_scaling_kernel

    Caller->>nvte_unswizzle: (swizzled_tensor, compact_tensor, stream)
    nvte_unswizzle->>unswizzle: validate scaling_mode, dtype, flags
    unswizzle->>unswizzle: derive m, k from swizzled input shape
    unswizzle->>unswizzle: choose rowwise_unswizzle / columnwise_unswizzle

    alt rowwise_unswizzle
        unswizzle->>row_kernel: launch<<<(K/tiles,M_tiles), (32,32), slm>>>
        row_kernel->>row_kernel: load swizzled tiles into SLM
        row_kernel->>row_kernel: regs_unshuffle (inverse of regs_shuffle)
        row_kernel->>row_kernel: write compact bytes (bounds-checked)
    end

    alt columnwise_unswizzle
        unswizzle->>col_kernel: launch<<<(K_tiles,M/tiles), (32,32), slm>>>
        col_kernel->>col_kernel: load swizzled tiles into SLM
        col_kernel->>col_kernel: regs_unshuffle_with_bit_shifts
        col_kernel->>col_kernel: write compact bytes (bounds-checked)
    end

    unswizzle-->>Caller: output compact tensor populated
Loading

Last reviewed commit: d7b6d2d

@vthumbe1503 vthumbe1503 added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Mar 4, 2026
Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
@int-smart int-smart force-pushed the feature/unswizzle branch from 85ea04b to 17dbb33 Compare March 5, 2026 02:13
int-smart and others added 2 commits March 4, 2026 18:49
Comment on lines +1276 to +1279
NVTE_CHECK(static_cast<size_t>(original_M) * original_K == output->scale_inv.numel(),
"Expected output tensor to have ", static_cast<size_t>(original_M) * original_K,
" row-wise scaling factors, but got shape=", output->scale_inv.shape, ".");
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output size validation uses unpadded dimensions instead of padded

unswizzle_scaling_factors validates the rowwise output size with original_M * original_K, where original_M = output->flat_first_dim() is the actual data-matrix row count—not the padded scale row count m. This check will incorrectly reject a perfectly valid compact output tensor whenever M is not already a multiple of 128.

For example, with a matrix of shape [200, 4096] (M=200, K=4096):

  • m = ceil(200/128)*128 = 256 (required by the swizzle padding constraint)
  • original_M = 200, original_K = 128
  • output->scale_inv.numel() = 256 * 128 = 32768 (padded compact tensor)
  • But this check would require 200 * 128 = 25600 — and fail.

The equivalent check in swizzle_scaling_factors correctly uses m * k (see line 672-673). The corresponding check in multi_tensor_unswizzle_scaling_factors also uses m * k (line 1463), making this single-tensor path the outlier.

Suggested change
NVTE_CHECK(static_cast<size_t>(original_M) * original_K == output->scale_inv.numel(),
"Expected output tensor to have ", static_cast<size_t>(original_M) * original_K,
" row-wise scaling factors, but got shape=", output->scale_inv.shape, ".");
break;
NVTE_CHECK(static_cast<size_t>(m) * k == output->scale_inv.numel(),
"Expected output tensor to have ", static_cast<size_t>(m) * k,
" row-wise scaling factors, but got shape=", output->scale_inv.shape, ".");

Comment on lines +1348 to +1351
NVTE_CHECK(static_cast<size_t>(original_M) * original_K == output->columnwise_scale_inv.numel(),
"Expected output tensor to have ", static_cast<size_t>(original_M) * original_K,
" column-wise scaling factors, but got shape=", output->columnwise_scale_inv.shape,
".");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same output-size validation bug in the columnwise unswizzle path

The same issue as in the rowwise path above: the check uses original_M * original_K (where original_M = output->flat_last_dim()) instead of the padded scale dimensions m * k. For any tensor where the column-wise scale M dimension is not already a multiple of 128, this will incorrectly fail even though the output buffer is correctly sized.

The equivalent check in multi_tensor_unswizzle_scaling_factors (line 1522) correctly accumulates the output shape, and swizzle_scaling_factors (line 678-681) uses m * k. This path should follow the same pattern:

Suggested change
NVTE_CHECK(static_cast<size_t>(original_M) * original_K == output->columnwise_scale_inv.numel(),
"Expected output tensor to have ", static_cast<size_t>(original_M) * original_K,
" column-wise scaling factors, but got shape=", output->columnwise_scale_inv.shape,
".");
NVTE_CHECK(static_cast<size_t>(m) * k == output->columnwise_scale_inv.numel(),
"Expected output tensor to have ", static_cast<size_t>(m) * k,
" column-wise scaling factors, but got shape=", output->columnwise_scale_inv.shape,
".");

Comment on lines +1389 to +1391
void multi_tensor_unswizzle_scaling_factors(const std::vector<Tensor*>& input,
const std::vector<Tensor*>& output,
cudaStream_t stream) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent const qualifier on output parameter

The output parameter here is const std::vector<Tensor*>&, but the analogous multi_tensor_swizzle_scaling_factors (line 990) uses std::vector<Tensor*>& (non-const). While adding const to the vector reference doesn't prevent mutating the pointed-to Tensor objects—so it has no effect on correctness—the inconsistency is surprising to callers and departs from the established pattern in this file. Consider aligning the signatures:

Suggested change
void multi_tensor_unswizzle_scaling_factors(const std::vector<Tensor*>& input,
const std::vector<Tensor*>& output,
cudaStream_t stream) {
void multi_tensor_unswizzle_scaling_factors(const std::vector<Tensor*>& input,
std::vector<Tensor*>& output,
cudaStream_t stream) {

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PRs from external contributor outside the core maintainers, representing community-driven work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants