Ipanfilo/port fixes to 212#633
Open
ipanfilo wants to merge 14 commits into
Open
Conversation
(cherry picked from commit 4297f03)
* [ROCm] fix the bug in hipfied optimized cast tranpose flow that two kernels got run * [ROCm] move the if(fallback_to_cost_model_rtc) branch into the upstream rtc branch * [ROCm] address reviewer comments left in PR545 (cherry picked from commit 789035a)
* GHA to build release wheel set * Suppress verbose logging from AOTriton build * Decrease verbosity of hipification (cherry picked from commit e6b79af)
* CI: Refactor ROCm CI to use GPU-sized runners and build-only jobs * Update labels * Shallow clone * Address comments * Add a missing submodule * Address comments * Cleanup * Address comments * Fix NVTE_FRAMEWORK (cherry picked from commit 6b96c46)
* rocm-ci: scope test container to pod-allocated GPUs via podinfo The sGPU/mGPU jobs launched the test container with '--device=/dev/dri --device=/dev/kfd', exposing ALL host GPUs to the nested (privileged-dind) container regardless of the GPUs Kubernetes allocated to the pod. Combined with the hard-coded absolute HIP_VISIBLE_DEVICES=0..3, two jobs co-scheduled on the same node both pinned physical GPUs 0-3 and collided (OOM/hangs/test failures) while 4-7 sat idle. Jobs only passed when the node was otherwise idle -- arch-independent (mi300x and mi35x). Build GPU_FLAG from /etc/podinfo/gha-render-devices, which the runner populates with this pod's allocated '--device /dev/dri/renderD*' flags (falls back to all GPUs on bare metal). /dev/kfd is always passed. The container now sees only its allocated GPUs as 0..N-1, so the per-suite HIP_VISIBLE_DEVICES=0/1/2/3 split is correct and collision-free across co-scheduled pods. Requires the runner ScaleSet to populate /etc/podinfo/gha-render-devices (see companion rocOps change). --------- Co-authored-by: leo-automation <drleonid@amd.com> (cherry picked from commit e66f431)
* [ROCm] add the bias all row -inf support for jax unfused-attn * [ROCm] address reviewer comments and fix the pytest failure * [ROCm] add the ck guard to newly added all row -inf bias test * Update tests/jax/test_fused_attn.py Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com> --------- Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com> (cherry picked from commit 001807b)
(cherry picked from commit e0587a9)
(cherry picked from commit ed839f4)
(cherry picked from commit a47087b)
* Do not use deprecated pxla.thread_resources Meshs on JAX 0.9 * Fix typo in FFI target registration (cherry picked from commit 5a5f7da)
(cherry picked from commit 5e7bf04)
(cherry picked from commit ef79328)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Backport fixes, new frameworks compatibility and infrastructure changes from dev
Type of change
Changes
Every change is cherry-pick of dev or PR to dev
Checklist: