Introduce a fused padding + cast transpose kernel grouped linear#632
Introduce a fused padding + cast transpose kernel grouped linear#632alextmagro wants to merge 2 commits into
Conversation
Claude code review summaryReviewed PR #632 (fused padding + cast-transpose for grouped linear) against Overall: Approach is sound — a single padded MCT kernel replacing pad-then-MCT, exposed end-to-end through Findings (see inline comments):
Copyright headers: OK — all 11 modified files have correctly updated AMD headers with the 2026 end-year and preserved NVIDIA lines. Not duplicated: @ipanfilo's nit on |
alextmagro
left a comment
There was a problem hiding this comment.
Addressed copilot comments
Fuses the following 2 kernels:
Pad tensors in BF16
Cast from BF16 to FP8, transpose, and store
into a single kernel that:
Cast from BF16 to FP8, transpose, store with padding
2x speedup over unfused kernel, applicable to grouped linear.