Skip to content

Ipanfilo/port fixes to 212#633

Open
ipanfilo wants to merge 14 commits into
release_v2.12_rocmfrom
ipanfilo/port_fixes_to_212
Open

Ipanfilo/port fixes to 212#633
ipanfilo wants to merge 14 commits into
release_v2.12_rocmfrom
ipanfilo/port_fixes_to_212

Conversation

@ipanfilo

Copy link
Copy Markdown
Collaborator

Description

Backport fixes, new frameworks compatibility and infrastructure changes from dev

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Every change is cherry-pick of dev or PR to dev

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

ipanfilo and others added 13 commits June 17, 2026 11:13
…ernels got run (#545)

* [ROCm] fix the bug in hipfied optimized cast tranpose flow that two kernels got run

(cherry picked from commit a6470b0)
* [ROCm] fix the bug in hipfied optimized cast tranpose flow that two kernels got run

* [ROCm] move the if(fallback_to_cost_model_rtc) branch into the upstream rtc branch

* [ROCm] address reviewer comments left in PR545

(cherry picked from commit 789035a)
* GHA to build release wheel set
* Suppress verbose logging from AOTriton build
* Decrease verbosity of hipification

(cherry picked from commit e6b79af)
* CI: Refactor ROCm CI to use GPU-sized runners and build-only jobs

* Update labels

* Shallow clone

* Address comments

* Add a missing submodule

* Address comments

* Cleanup

* Address comments

* Fix NVTE_FRAMEWORK

(cherry picked from commit 6b96c46)
* rocm-ci: scope test container to pod-allocated GPUs via podinfo

The sGPU/mGPU jobs launched the test container with
'--device=/dev/dri --device=/dev/kfd', exposing ALL host GPUs to the
nested (privileged-dind) container regardless of the GPUs Kubernetes
allocated to the pod. Combined with the hard-coded absolute
HIP_VISIBLE_DEVICES=0..3, two jobs co-scheduled on the same node both
pinned physical GPUs 0-3 and collided (OOM/hangs/test failures) while
4-7 sat idle. Jobs only passed when the node was otherwise idle --
arch-independent (mi300x and mi35x).

Build GPU_FLAG from /etc/podinfo/gha-render-devices, which the runner
populates with this pod's allocated '--device /dev/dri/renderD*' flags
(falls back to all GPUs on bare metal). /dev/kfd is always passed. The
container now sees only its allocated GPUs as 0..N-1, so the per-suite
HIP_VISIBLE_DEVICES=0/1/2/3 split is correct and collision-free across
co-scheduled pods.

Requires the runner ScaleSet to populate /etc/podinfo/gha-render-devices
(see companion rocOps change).

---------

Co-authored-by: leo-automation <drleonid@amd.com>
(cherry picked from commit e66f431)
* [ROCm] add the bias all row -inf support for jax unfused-attn

* [ROCm] address reviewer comments and fix the pytest failure

* [ROCm] add the ck guard to newly added all row -inf bias test

* Update tests/jax/test_fused_attn.py

Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com>

---------

Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com>
(cherry picked from commit 001807b)
* Do not use deprecated pxla.thread_resources Meshs on JAX 0.9
* Fix typo in FFI target registration

(cherry picked from commit 5a5f7da)
(cherry picked from commit ef79328)
@ipanfilo ipanfilo added the ci-level 2 CI test level 2 label Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-level 2 CI test level 2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants