Skip to content

ROCm 7.2 on AMD Strix Halo (gfx1151): guide, benchmarks, and MIOpen cache fix #454

@realugbun

Description

@realugbun

Summary

I got Kokoro-FastAPI running on AMD Ryzen AI MAX+ 395 (Strix Halo) with the Radeon 8060S iGPU (gfx1151) using ROCm 7.2. Sharing findings since the existing docker/rocm/ setup targets ROCm 6.4 which doesn't support this GPU architecture.

Result: ~10x speedup over CPU — but only after solving a critical MIOpen kernel caching issue that causes 5-60 second delays on every request without the fix.

Environment

Component Version
CPU / APU AMD Ryzen AI MAX+ 395 (Strix Halo)
GPU AMD Radeon 8060S iGPU (gfx1151), 32GB BAR on 96GB unified LPDDR5
Host OS Proxmox VE 9.x
Host kernel 6.19.2-1-pve (critical — older kernels lack gfx1151 support)
Guest OS Ubuntu 24.04 LTS (unprivileged LXC with GPU passthrough)
ROCm 7.2.0
PyTorch 2.10.0+rocm7.2.0 (from AMD's repo.radeon.com, not pytorch.org)
Kokoro-FastAPI 0.2.4
kokoro 0.9.4

Key Finding: MIOPEN_FIND_MODE=2 is critical

MIOpen compiles optimized GPU kernels for each unique tensor shape. In Kokoro, tensor shapes vary by phoneme count — every unique input length produces a different shape. Without caching, this means:

  • First request per unique phoneme length: 5-60 seconds (MIOpen kernel search)
  • Subsequent requests at same length: 0.3-5 seconds (cached)

The MIOpen cache persists to disk at ~/.config/miopen/ (~2.4MB for all lengths), but by default each new process re-searches instead of using the disk cache. Setting MIOPEN_FIND_MODE=2 forces MIOpen to use the disk cache, eliminating the per-process penalty entirely.

Environment variables for the service

DEVICE=cuda                                    # PyTorch ROCm uses CUDA API (HIP translates)
MIOPEN_FIND_MODE=2                             # Use disk cache, don't re-search
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1      # Faster attention ops on gfx1151

MIOpen Cache Warmup

The disk cache needs to be populated once. I wrote a warmup script that runs forward passes for all phoneme lengths 1-340 (which covers 100% of real-world traffic based on 25 days of production data — the text chunker caps at ~450 phonemes, and in practice nothing exceeded 331).

Warmup takes ~2 hours but only needs to run once. The cache survives reboots and service restarts. It's only invalidated by PyTorch or ROCm version upgrades.

After warmup with MIOPEN_FIND_MODE=2:

# Every request is fast from first call after service restart:
Short sentence:  1.1s  (was 13s on CPU)
Medium sentence: 1.3s  (was 18s on CPU)
Long paragraph:  5.0s  (was 18s/chunk on CPU)
Warmup script
#!/usr/bin/env python3
"""Exhaustive MIOpen kernel warmup for Kokoro TTS on ROCm.
Run WITHOUT MIOPEN_FIND_MODE=2 so it searches for optimal kernels.
After completion, set MIOPEN_FIND_MODE=2 for production."""

import torch
import time
import sys
import os

sys.path.insert(0, '/path/to/kokoro-fastapi')
sys.path.insert(0, '/path/to/kokoro-fastapi/api')
os.environ['ESPEAK_DATA_PATH'] = '/usr/lib/x86_64-linux-gnu/espeak-ng-data'
os.environ['TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL'] = '1'

from kokoro import KModel

print(f"[{time.strftime('%H:%M:%S')}] Loading model on GPU...")
model = KModel(config="api/src/models/v1_0/config.json",
               model="api/src/models/v1_0/kokoro-v1_0.pth").eval().cuda()
voice = torch.load("api/src/voices/v1_0/af_heart.pt", weights_only=True)
print(f"[{time.strftime('%H:%M:%S')}] Starting warmup...")

total_time = 0
for n in range(1, 341):  # 340 covers all practical input lengths
    ps = ('a ' * ((n + 1) // 2))[:n]
    ref_s = voice[min(len(ps), 509)]
    torch.cuda.synchronize()
    t0 = time.time()
    try:
        audio = model(ps, ref_s, speed=1)
    except Exception as e:
        print(f"ERROR at n={n}: {e}")
        continue
    torch.cuda.synchronize()
    elapsed = time.time() - t0
    total_time += elapsed
    if n % 10 == 0:
        print(f"[{time.strftime('%H:%M:%S')}] {n:3d}/340 | this={elapsed:5.1f}s | total={total_time/60:5.1f}m", flush=True)

print(f"\nWarmup complete in {total_time/60:.1f} minutes")

ROCm 7.2 Setup Notes

PyTorch wheels

The stock docker/rocm/ Dockerfile uses torch==2.8.0+rocm6.4 from pytorch.org. This does not include gfx1151 support. For Strix Halo you need AMD's official wheels from repo.radeon.com:

pip install torch --index-url https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/

The pytorch.org ROCm wheels segfault on ROCm 7.2 — they're built against ROCm 6.x libraries.

Kernel version matters

gfx1151 (RDNA 3.5 / Strix Halo) requires a recent Linux kernel with the amdgpu driver supporting this GPU ID. Kernel 6.19+ works. Older kernels (e.g., 6.8) may not recognize the GPU at all.

GPU passthrough (LXC/VM)

If running in an LXC container, these devices need passthrough:

  • /dev/dri/card0
  • /dev/dri/renderD128
  • /dev/kfd

Voice-independent cache

The MIOpen cache is keyed on tensor shapes, not values. Different voices change the style vector values but not dimensions (always [1, 256]). So the cache works for all 67 voice packs without re-warming.

Benchmarks

Tested with Kokoro-FastAPI v0.2.4, kokoro 0.9.4, 67 voice packs:

Input CPU (torch 2.10+cpu) GPU/ROCm (torch 2.10+rocm7.2) Speedup
"Hello world." (13 phonemes) 13.1s 1.1s 12x
20-word sentence (~50 phonemes) 18.2s 1.3s 14x
Long paragraph (~130 phonemes) 17.9s 1.8s 10x
Very long (~300 phonemes) 18.0s 5.0s 3.6x

Model memory footprint is minimal (~1.5GB including PyTorch overhead) — the 82M param model coexists fine with other GPU workloads on the shared iGPU.

Suggestion

Would it be worth adding a MIOPEN_FIND_MODE=2 default to the existing docker/rocm/ setup, and documenting the warmup pattern? The current ROCm Docker image will have the same per-process recompilation issue on any AMD GPU, not just Strix Halo.

Happy to contribute a PR if there's interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions