Skip to content

gfx1250 mxfp8 gemm: loosen restrictions on K#627

Open
matthiasdiener wants to merge 2 commits into
devfrom
mdiener/loosen-gfx1250-mxfp8-k-restriction
Open

gfx1250 mxfp8 gemm: loosen restrictions on K#627
matthiasdiener wants to merge 2 commits into
devfrom
mdiener/loosen-gfx1250-mxfp8-k-restriction

Conversation

@matthiasdiener

@matthiasdiener matthiasdiener commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Description

Loosen restrictions on K on gfx1250 mxfp8 gemm (K must be multiple of 32), confirmed with hipblaslt developers.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@matthiasdiener matthiasdiener self-assigned this Jun 16, 2026
@matthiasdiener matthiasdiener added the ci-level 1 CI test level 1 label Jun 16, 2026
@matthiasdiener matthiasdiener changed the title gfx1250 gemm: loosen restrictions on K gfx1250 mxfp8 gemm: loosen restrictions on K Jun 16, 2026
// Check that K is compatible with the MXFP8 scale layout, and M/N are multiples of 16
if (inputA->scaling_mode == NVTE_MXFP8_1D_SCALING || inputB->scaling_mode == NVTE_MXFP8_1D_SCALING) {
const bool is_gfx1250 = cuda::sm_arch() == 125;
const int required_k_multiple = is_gfx1250 ? 32 : 128;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a TODO here to change this for gfx950 after scale preswizzle is in hipblasLt.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 3a7dd8f

}
if (params.k % 128) {
GTEST_SKIP() << "MXFP8 requires K to be a multiple of 128";
const size_t required_k_multiple = (prop.major == 12 && prop.minor == 5) ? 32 : 128;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be under HIP_PLATFORM_AMD below

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of 3a7dd8f? Or do you prefer to move the whole if (use_mxfp8) part into the #ifdef __HIP_PLATFORM_AMD__ below?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that even previous constrains were actually ROCm specific despite the idea that ROCm specific constrains are added below under ifdef.
Moreover, some ROCm constrains are more relaxed than naive ones (multiple of 16 vs multiple of 32) so we cannot have naive generic constrains here.
Saying that, and bearing in mind it is AMD originated test it might be not worth of efforts to separate ROCm and generic constrains so you may revert guarding or keep it as is. Sorry for confusion.

}
if (params.k % 128) {
GTEST_SKIP() << "MXFP8 requires K to be a multiple of 128";
const size_t required_k_multiple = (prop.major == 12 && prop.minor == 5) ? 32 : 128;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that even previous constrains were actually ROCm specific despite the idea that ROCm specific constrains are added below under ifdef.
Moreover, some ROCm constrains are more relaxed than naive ones (multiple of 16 vs multiple of 32) so we cannot have naive generic constrains here.
Saying that, and bearing in mind it is AMD originated test it might be not worth of efforts to separate ROCm and generic constrains so you may revert guarding or keep it as is. Sorry for confusion.

const bool use_mxfp8 = params.scaling_mode == NVTEScalingMode::NVTE_MXFP8_1D_SCALING;

cudaDeviceProp prop;
(void)cudaGetDeviceProperties(&prop, 0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the get_arch function here to avoid calling this for every test?

GTEST_SKIP() << "MXFP8 requires K to be a multiple of 128";
size_t required_k_multiple = 128;
#ifdef __HIP_PLATFORM_AMD__
required_k_multiple = (prop.major == 12 && prop.minor == 5) ? 32 : 128;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: gfx1250 requires a multiple of the block size, not necessarily 32. I believe 16 may also be supported.

NVTE_CHECK((k % required_k_multiple) == 0,
"GEMM K dimension must be multiple of ", required_k_multiple,
" for MXFP8 scaling (got K=", k, ")");
NVTE_CHECK((m % 16) == 0, "GEMM M dimension must be multiple of 16 for MXFP8 scaling (got M=", m, ")");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that hipblaslt supports arbitrary M/N for gfx1250?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-level 1 CI test level 1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants