ROCm 7.2 on AMD Strix Halo (gfx1151): guide, benchmarks, and MIOpen cache fix

---

## Summary

I got Kokoro-FastAPI running on **AMD Ryzen AI MAX+ 395 (Strix Halo)** with the **Radeon 8060S iGPU (gfx1151)** using ROCm 7.2. Sharing findings since the existing `docker/rocm/` setup targets ROCm 6.4 which doesn't support this GPU architecture.

**Result: ~10x speedup over CPU** — but only after solving a critical MIOpen kernel caching issue that causes 5-60 second delays on every request without the fix.

## Environment

| Component | Version |
|---|---|
| CPU / APU | AMD Ryzen AI MAX+ 395 (Strix Halo) |
| GPU | AMD Radeon 8060S iGPU (gfx1151), 32GB BAR on 96GB unified LPDDR5 |
| Host OS | Proxmox VE 9.x |
| Host kernel | **6.19.2-1-pve** (critical — older kernels lack gfx1151 support) |
| Guest OS | Ubuntu 24.04 LTS (unprivileged LXC with GPU passthrough) |
| ROCm | 7.2.0 |
| PyTorch | 2.10.0+rocm7.2.0 (from AMD's repo.radeon.com, **not** pytorch.org) |
| Kokoro-FastAPI | 0.2.4 |
| kokoro | 0.9.4 |

## Key Finding: `MIOPEN_FIND_MODE=2` is critical

MIOpen compiles optimized GPU kernels for each unique tensor shape. In Kokoro, tensor shapes vary by **phoneme count** — every unique input length produces a different shape. Without caching, this means:

- **First request per unique phoneme length**: 5-60 seconds (MIOpen kernel search)
- **Subsequent requests at same length**: 0.3-5 seconds (cached)

The MIOpen cache persists to disk at `~/.config/miopen/` (~2.4MB for all lengths), but by default **each new process re-searches** instead of using the disk cache. Setting `MIOPEN_FIND_MODE=2` forces MIOpen to use the disk cache, eliminating the per-process penalty entirely.

### Environment variables for the service

```bash
DEVICE=cuda                                    # PyTorch ROCm uses CUDA API (HIP translates)
MIOPEN_FIND_MODE=2                             # Use disk cache, don't re-search
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1      # Faster attention ops on gfx1151
```

## MIOpen Cache Warmup

The disk cache needs to be populated once. I wrote a warmup script that runs forward passes for all phoneme lengths 1-340 (which covers 100% of real-world traffic based on 25 days of production data — the text chunker caps at ~450 phonemes, and in practice nothing exceeded 331).

Warmup takes ~2 hours but only needs to run once. The cache survives reboots and service restarts. It's only invalidated by PyTorch or ROCm version upgrades.

After warmup with `MIOPEN_FIND_MODE=2`:

```
# Every request is fast from first call after service restart:
Short sentence:  1.1s  (was 13s on CPU)
Medium sentence: 1.3s  (was 18s on CPU)
Long paragraph:  5.0s  (was 18s/chunk on CPU)
```

<details>
<summary>Warmup script</summary>

```python
#!/usr/bin/env python3
"""Exhaustive MIOpen kernel warmup for Kokoro TTS on ROCm.
Run WITHOUT MIOPEN_FIND_MODE=2 so it searches for optimal kernels.
After completion, set MIOPEN_FIND_MODE=2 for production."""

import torch
import time
import sys
import os

sys.path.insert(0, '/path/to/kokoro-fastapi')
sys.path.insert(0, '/path/to/kokoro-fastapi/api')
os.environ['ESPEAK_DATA_PATH'] = '/usr/lib/x86_64-linux-gnu/espeak-ng-data'
os.environ['TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL'] = '1'

from kokoro import KModel

print(f"[{time.strftime('%H:%M:%S')}] Loading model on GPU...")
model = KModel(config="api/src/models/v1_0/config.json",
               model="api/src/models/v1_0/kokoro-v1_0.pth").eval().cuda()
voice = torch.load("api/src/voices/v1_0/af_heart.pt", weights_only=True)
print(f"[{time.strftime('%H:%M:%S')}] Starting warmup...")

total_time = 0
for n in range(1, 341):  # 340 covers all practical input lengths
    ps = ('a ' * ((n + 1) // 2))[:n]
    ref_s = voice[min(len(ps), 509)]
    torch.cuda.synchronize()
    t0 = time.time()
    try:
        audio = model(ps, ref_s, speed=1)
    except Exception as e:
        print(f"ERROR at n={n}: {e}")
        continue
    torch.cuda.synchronize()
    elapsed = time.time() - t0
    total_time += elapsed
    if n % 10 == 0:
        print(f"[{time.strftime('%H:%M:%S')}] {n:3d}/340 | this={elapsed:5.1f}s | total={total_time/60:5.1f}m", flush=True)

print(f"\nWarmup complete in {total_time/60:.1f} minutes")
```

</details>

## ROCm 7.2 Setup Notes

### PyTorch wheels

The stock `docker/rocm/` Dockerfile uses `torch==2.8.0+rocm6.4` from pytorch.org. This **does not include gfx1151 support**. For Strix Halo you need AMD's official wheels from `repo.radeon.com`:

```bash
pip install torch --index-url https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/
```

The pytorch.org ROCm wheels **segfault** on ROCm 7.2 — they're built against ROCm 6.x libraries.

### Kernel version matters

gfx1151 (RDNA 3.5 / Strix Halo) requires a recent Linux kernel with the `amdgpu` driver supporting this GPU ID. Kernel 6.19+ works. Older kernels (e.g., 6.8) may not recognize the GPU at all.

### GPU passthrough (LXC/VM)

If running in an LXC container, these devices need passthrough:
- `/dev/dri/card0`
- `/dev/dri/renderD128`
- `/dev/kfd`

### Voice-independent cache

The MIOpen cache is keyed on tensor **shapes**, not values. Different voices change the style vector values but not dimensions (always `[1, 256]`). So the cache works for all 67 voice packs without re-warming.

## Benchmarks

Tested with Kokoro-FastAPI v0.2.4, kokoro 0.9.4, 67 voice packs:

| Input | CPU (torch 2.10+cpu) | GPU/ROCm (torch 2.10+rocm7.2) | Speedup |
|-------|---------------------|-------------------------------|---------|
| "Hello world." (13 phonemes) | 13.1s | 1.1s | 12x |
| 20-word sentence (~50 phonemes) | 18.2s | 1.3s | 14x |
| Long paragraph (~130 phonemes) | 17.9s | 1.8s | 10x |
| Very long (~300 phonemes) | 18.0s | 5.0s | 3.6x |

Model memory footprint is minimal (~1.5GB including PyTorch overhead) — the 82M param model coexists fine with other GPU workloads on the shared iGPU.

## Suggestion

Would it be worth adding a `MIOPEN_FIND_MODE=2` default to the existing `docker/rocm/` setup, and documenting the warmup pattern? The current ROCm Docker image will have the same per-process recompilation issue on any AMD GPU, not just Strix Halo.

Happy to contribute a PR if there's interest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ROCm 7.2 on AMD Strix Halo (gfx1151): guide, benchmarks, and MIOpen cache fix #454

Summary

Environment

Key Finding: `MIOPEN_FIND_MODE=2` is critical

Environment variables for the service

MIOpen Cache Warmup

ROCm 7.2 Setup Notes

PyTorch wheels

Kernel version matters

GPU passthrough (LXC/VM)

Voice-independent cache

Benchmarks

Suggestion

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Component	Version
CPU / APU	AMD Ryzen AI MAX+ 395 (Strix Halo)
GPU	AMD Radeon 8060S iGPU (gfx1151), 32GB BAR on 96GB unified LPDDR5
Host OS	Proxmox VE 9.x
Host kernel	6.19.2-1-pve (critical — older kernels lack gfx1151 support)
Guest OS	Ubuntu 24.04 LTS (unprivileged LXC with GPU passthrough)
ROCm	7.2.0
PyTorch	2.10.0+rocm7.2.0 (from AMD's repo.radeon.com, not pytorch.org)
Kokoro-FastAPI	0.2.4
kokoro	0.9.4

Input	CPU (torch 2.10+cpu)	GPU/ROCm (torch 2.10+rocm7.2)	Speedup
"Hello world." (13 phonemes)	13.1s	1.1s	12x
20-word sentence (~50 phonemes)	18.2s	1.3s	14x
Long paragraph (~130 phonemes)	17.9s	1.8s	10x
Very long (~300 phonemes)	18.0s	5.0s	3.6x

Uh oh!

ROCm 7.2 on AMD Strix Halo (gfx1151): guide, benchmarks, and MIOpen cache fix #454

Description

Summary

Environment

Key Finding: MIOPEN_FIND_MODE=2 is critical

Environment variables for the service

MIOpen Cache Warmup

ROCm 7.2 Setup Notes

PyTorch wheels

Kernel version matters

GPU passthrough (LXC/VM)

Voice-independent cache

Benchmarks

Suggestion

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Key Finding: `MIOPEN_FIND_MODE=2` is critical