enable blockwise FP8 quantization on rocm by asdfvg123 · Pull Request #609 · ROCm/TransformerEngine

asdfvg123 · 2026-06-03T17:53:36Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Enable blockwise FP8 quantization on rocm

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

remove HIP guard in quantization.py
guard kernels using TMA in quantization.
add branch to handle rocm for different threads per wave

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

matthiasdiener · 2026-06-03T18:28:00Z

 # TODO replace with call to fp8.py when recipe added.
-recipe_available = not IS_HIP_EXTENSION and (get_device_compute_capability() >= (9, 0) and float(torch.version.cuda) >= 12.8)
+if IS_HIP_EXTENSION:
+    recipe_available = get_device_compute_capability() >= (9, 0)


Wouldn't this be always True on ROCm TE?

This test targets MI300 and MI350 so I set to (9,0)

I believe MI250 is (9,0) so this should be a > rather than a >=, or (9,4)

matthiasdiener · 2026-06-03T18:29:11Z

@@ -1 +1 @@
 /*************************************************************************


Needs AMD copyright

matthiasdiener · 2026-06-03T18:30:03Z

+#ifndef __HIP_PLATFORM_AMD__
 #include <cudaTypedefs.h>
+#endif
 #include <cuda_bf16.h>
 #include <cuda_runtime.h>

 #include <cfloat>
+#ifndef __HIP_PLATFORM_AMD__
 #include <cuda/barrier>
+#endif

 #include "common/common.h"
 #include "common/recipe/recipe_common.cuh"
 #include "common/util/cuda_runtime.h"
+#ifndef __HIP_PLATFORM_AMD__
 #include "common/util/ptx.cuh"
+#endif


These #includes should be already disabled via hipify, so probably no need for the #ifndefs here.

matthiasdiener · 2026-06-03T18:43:26Z

+  static constexpr float max = 448.0f;
+  static constexpr float max_inverse = 1.0 / max;


Is this change necessary? fp8e4m3 max depends on the device type on AMD.

quantize_transpose_square_blockwise.cu and quantize_transpose_vector_blockwise.cu use
compute_scale_from_types<IType, fp8e4m3> for the first time, which exposed a latent bug in common.h

The #else branch of TypeExtrema<fp8e4m3> declared max as a static float,
This caused the constexpr static float max_finite_value initializer in TypeInfo in the same file to fail when the template was instantiated on the host.

The fix uses HIP_FP8_TYPE_FNUZ, used in hip_float8.h for selecting FNUZ at compile time, to make the host-pass branch constexpr as well.

If the value really used host size, it should be runtime detected. If it is only for host translation of GPU code (i.e.. results are discarded), you can keep 448, no extra ifdefs is needed

reverted to the original upstream. I instead changed the recipe_common.cuh following the other convention in quantize_transpose_vector_blockwise_fp4.cu L230

alextmagro · 2026-06-03T18:48:39Z

Could you give a description of what you want to achieve with this PR? My understanding is that block fp8 quantization relies on some upstream kernels that will need to be adapted for AMD.

If you're just trying to enable the interface, I would argue that we should do this last, after we have a working quantization and GEMM path (and enabled and passing C++/Python tests).

asdfvg123 · 2026-06-03T18:57:07Z

@alextmagro
This PR is to enable only the quantization in the AMD gpus, not the GEMM. There are two kernels in the upstream which uses TMA for the quantization and does not uses TMA for the quantization. I guarded the kernels which uses TMA and used the non-TMA kernels to quantize for AMD.

I tested with
tests/pytorch/test_float8blockwisetensor.py
and it passes [175 passed / 32 xpassed / 5 warnings]

alextmagro · 2026-06-03T19:10:43Z

@alextmagro This PR is to enable only the quantization in the AMD gpus, not the GEMM. There are two kernels in the upstream which uses TMA for the quantization and does not uses TMA for the quantization. I guarded the kernels which uses TMA and used the non-TMA kernels to quantize for AMD.

I tested with tests/pytorch/test_float8blockwisetensor.py and it passes [175 passed / 32 xpassed / 5 warnings]

OK, in that case we need to add the cpp blockwise tests to the CMake file, and the pytorch test file to ci/pytorch.sh.

…dant HIP guards, revert unnecessary common.h change

alextmagro · 2026-06-04T18:12:20Z

By the way, to run CI you need to add a CI level label. L3 is required before merging, L1 is for lighter testing, mostly sGPU tests, if you are midway through the ticket and expect to make more changes

ipanfilo · 2026-06-05T02:52:57Z

ipanfilo · 2026-06-05T02:58:50Z


+#ifdef __HIP_PLATFORM_AMD__
+using WarpSyncMask = uint64_t;
+constexpr WarpSyncMask kFullWarpMask = 0xFFFFFFFFFFFFFFFFULL;


ROCm should not use it. See how *_sync calls are guarded in other places

removed the mask and use ROCm __shfl instead of __shfl_sync

ipanfilo · 2026-06-05T02:59:39Z

    }
  }
  // Reduce amax in the warp (32x32 tile)
+#ifdef __HIP_PLATFORM_AMD__


The whole this code is under #ifndef HIP_PLATFORM_AMD

removed the dead branch

ipanfilo · 2026-06-05T03:00:07Z

 // const values configuration

+#ifdef __HIP_PLATFORM_AMD__
+constexpr size_t kThreadsPerWarp = 64;


It is platform dependent.

fixed now guarded with gfx1250 for 32 threads

Can we use warpSize from hipruntime here, since kThreadsPerWarp is only needed for device code?

I think warpSize is not constexpr anymore. Or it is ?

I assume you are referring to

inline __device__ const struct { __device__ __attribute__((always_inline, const)) operator int() const noexcept { return __builtin_amdgcn_wavefrontsize(); } } warpSize{};

in amd_warp_functions.h
and this is not constexpr (assigned in the runtime) so cannot used.

ipanfilo · 2026-06-05T03:03:22Z

     transpose/multi_cast_transpose.cu
-     transpose/quantize_transpose_vector_blockwise.cu #CUDA-only
+     transpose/quantize_transpose_vector_blockwise.cu
+     transpose/quantize_transpose_square_blockwise.cu


It should stay in transformer_engine_cuda_arch_specific_sources

ipanfilo · 2026-06-05T03:06:08Z

+  static constexpr float max = 448.0f;
+  static constexpr float max_inverse = 1.0 / max;


If the value really used host size, it should be runtime detected. If it is only for host translation of GPU code (i.e.. results are discarded), you can keep 448, no extra ifdefs is needed

asdfvg123 · 2026-06-10T20:07:06Z

MI300 has 64KB of LDS which makes overflow when loading 128 * 128 FP32 data into LDS. I created a helper and branched the kernel. When loading FP32 data, the kernel loads 128 * 64 chunk of data and iterate to quantize. From the host's view, the kernel quantizes 128 * 128 elements.

alextmagro · 2026-06-11T18:03:27Z

 # TODO replace with call to fp8.py when recipe added.
-recipe_available = not IS_HIP_EXTENSION and (get_device_compute_capability() >= (9, 0) and float(torch.version.cuda) >= 12.8)
+if IS_HIP_EXTENSION:
+    recipe_available = get_device_compute_capability() >= (9, 0)


I believe MI250 is (9,0) so this should be a > rather than a >=, or (9,4)

alextmagro · 2026-06-11T18:20:32Z

 // const values configuration

+#ifdef __HIP_PLATFORM_AMD__
+constexpr size_t kThreadsPerWarp = 64;


Can we use warpSize from hipruntime here, since kThreadsPerWarp is only needed for device code?

alextmagro · 2026-06-11T18:23:01Z

+    const int c_s = warp_in_chunk * num_smem_reads;
+    size_t r_g = static_cast<size_t>(blockIdx.y) * kTileDim + r_s;
+    for (int chunk = 0; chunk < kNumChunks; ++chunk) {
+      __syncthreads();


We can probably skip the syncthreads for the first iteration, also a pragma unroll might help here.

fixed and added

Looking more closely, could we remove this syncthreads completely, and then do a double buffer for load_chunk_to_smem?

ipanfilo · 2026-06-12T03:06:41Z

 // const values configuration

+#ifdef __HIP_PLATFORM_AMD__
+constexpr size_t kThreadsPerWarp = 64;


I think warpSize is not constexpr anymore. Or it is ?

ipanfilo · 2026-06-12T03:08:01Z


 #ifdef __HIP_PLATFORM_AMD__
-__device__ __forceinline__ float blockwise_warp_reduce_max(float val) {
+__device__ __forceinline__ float warp_reduce_max_64(float val) {


is now removed.

ipanfilo · 2026-06-12T03:17:08Z

+      // Step 2.3: Reduce amax
+#pragma unroll
+      for (int delta = kNumThreadsStore / 2; delta > 0; delta /= 2) {
+        const float other_amax = __shfl_down_sync(mask, amax, delta);


Use __shfl_down on ROCm

I removed all *__sync from both AMD only path and AMD & Nvidia common path. I added guard and use non-sync in the AMD path.

ipanfilo · 2026-06-12T03:19:07Z

 using transformer_engine::detail::FP8BlockwiseRowwiseOption;

+#ifdef __HIP_PLATFORM_AMD__
+using WarpSyncMask = uint64_t;


Review where it is used. Wavefront level primitives on ROcm should not use mask

Removed the unnecessary mask definitions together with using only non-sync in the AMD path. Reverted to the upstream.

ipanfilo · 2026-06-12T03:20:04Z

    if IS_HIP_EXTENSION:
-        return False, "FP8 block scaled gemm not yet supported for ROCm"
+        gpu_arch = get_device_compute_capability()
+        if gpu_arch >= (9, 0):


FP8 starts from 9.4

… fix

ipanfilo · 2026-06-16T18:42:50Z

 // const values configuration

+#if defined(__HIP_PLATFORM_AMD__) && !defined(__gfx1250__)
+constexpr size_t kThreadsPerWarp = 64;


is kThreadsPerWarp only used by device code and not any dispatch functions?

It is used to compute NUM_THREADS_Y_IN_WARP in L71 for constexpr computation. And other than this, it is only used in the device code.

ipanfilo · 2026-06-17T04:06:14Z

  // Reduce amax in the warp (32x32 tile)
+#ifdef __HIP_PLATFORM_AMD__
+#pragma unroll
+  for (int delta = kThreadsPerWarp / 2; delta > 0; delta /= 2) {


Can you please clarify this logic with using xor?

The purpose is to share the amax value across the lanes in a wave.
shfl_xor(val, mask, width) computes: lane i gives val to lane i XOR mask and returns lane i XOR mask val.
So in the first iteration (delta = 64 / 2 = 32) lane 0 exchanges amax with lane 32 and lane 1 exchanges amax with lane 33 so on.
in the next iteration (delta = 16) lane 0 exchanges the amax (accumulated amax from the previous step) with lane 16 and lane 1 exchanges the amax with lane 17 and so on.
At the end of the loop, all lanes compute the amax across the wave. So no separate broadcast is needed.

For reference: using xor for warp level all reduce is faster than shuffling down + broadcast since we skip the broadcast instructions, so is generally best practice.

Upstream has subwarp_reduce_max_broadcast and warp_reduce_max_broadcast using the slower reduce + broadcast implementation, so I will put in a PR to fix those up. We can then use those directly here?

enable blockwise FP8 quantization on rocm

8335488

asdfvg123 requested review from alextmagro, wangye805 and wenchenvincent June 3, 2026 17:53

asdfvg123 requested a review from ipanfilo as a code owner June 3, 2026 17:53

matthiasdiener reviewed Jun 3, 2026

View reviewed changes

enable blockwise FP8 C++ tests on ROCm, fix wave64 bugs, remove redun…

6226301

…dant HIP guards, revert unnecessary common.h change

alextmagro requested changes Jun 4, 2026

View reviewed changes

fix constexpr chain, arch guard

bdf905e

asdfvg123 added the ci-level 1 CI test level 1 label Jun 4, 2026

ipanfilo requested changes Jun 5, 2026

View reviewed changes

asdfvg123 added 4 commits June 9, 2026 22:58

guard kThreadsPerWarp for wave32, derive fp8 max on AMD host

676d1f0

revert unnecessary guard, add license

f8a0fc5

use non-sync __shfl on ROCm, drop kFullWarpMask

231e381

fix fp32 LDS overflow in vector blockwise on gfx942

e158d3e

asdfvg123 requested review from alextmagro, ipanfilo and matthiasdiener June 10, 2026 20:07

alextmagro reviewed Jun 11, 2026

View reviewed changes

ipanfilo reviewed Jun 12, 2026

View reviewed changes

drop WarpSyncMask, use __shfl_xor reduce, guard gfx1250 wave32, minor…

70c35df

… fix

asdfvg123 force-pushed the yeonsoo/blockwise_fp8 branch from dc4c5fd to 70c35df Compare June 15, 2026 23:15

use explicit gfx942/host guard

7ede21d

asdfvg123 requested review from alextmagro and ipanfilo June 16, 2026 00:00

ipanfilo reviewed Jun 17, 2026

View reviewed changes

asdfvg123 requested a review from ipanfilo June 17, 2026 17:56

		@@ -1 +1 @@
		/*************************************************************************

		static constexpr float max = 448.0f;
		static constexpr float max_inverse = 1.0 / max;

Conversation

asdfvg123 commented Jun 3, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asdfvg123 Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alextmagro commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asdfvg123 commented Jun 3, 2026

Uh oh!

alextmagro commented Jun 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alextmagro commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asdfvg123 Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asdfvg123 Jun 4, 2026 •

edited

Loading

alextmagro commented Jun 3, 2026 •

edited

Loading

alextmagro commented Jun 4, 2026 •

edited

Loading

asdfvg123 Jun 15, 2026 •

edited

Loading