Summary
I got Kokoro-FastAPI running on AMD Ryzen AI MAX+ 395 (Strix Halo) with the Radeon 8060S iGPU (gfx1151) using ROCm 7.2. Sharing findings since the existing docker/rocm/ setup targets ROCm 6.4 which doesn't support this GPU architecture.
Result: ~10x speedup over CPU — but only after solving a critical MIOpen kernel caching issue that causes 5-60 second delays on every request without the fix.
Environment
| Component |
Version |
| CPU / APU |
AMD Ryzen AI MAX+ 395 (Strix Halo) |
| GPU |
AMD Radeon 8060S iGPU (gfx1151), 32GB BAR on 96GB unified LPDDR5 |
| Host OS |
Proxmox VE 9.x |
| Host kernel |
6.19.2-1-pve (critical — older kernels lack gfx1151 support) |
| Guest OS |
Ubuntu 24.04 LTS (unprivileged LXC with GPU passthrough) |
| ROCm |
7.2.0 |
| PyTorch |
2.10.0+rocm7.2.0 (from AMD's repo.radeon.com, not pytorch.org) |
| Kokoro-FastAPI |
0.2.4 |
| kokoro |
0.9.4 |
Key Finding: MIOPEN_FIND_MODE=2 is critical
MIOpen compiles optimized GPU kernels for each unique tensor shape. In Kokoro, tensor shapes vary by phoneme count — every unique input length produces a different shape. Without caching, this means:
- First request per unique phoneme length: 5-60 seconds (MIOpen kernel search)
- Subsequent requests at same length: 0.3-5 seconds (cached)
The MIOpen cache persists to disk at ~/.config/miopen/ (~2.4MB for all lengths), but by default each new process re-searches instead of using the disk cache. Setting MIOPEN_FIND_MODE=2 forces MIOpen to use the disk cache, eliminating the per-process penalty entirely.
Environment variables for the service
DEVICE=cuda # PyTorch ROCm uses CUDA API (HIP translates)
MIOPEN_FIND_MODE=2 # Use disk cache, don't re-search
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 # Faster attention ops on gfx1151
MIOpen Cache Warmup
The disk cache needs to be populated once. I wrote a warmup script that runs forward passes for all phoneme lengths 1-340 (which covers 100% of real-world traffic based on 25 days of production data — the text chunker caps at ~450 phonemes, and in practice nothing exceeded 331).
Warmup takes ~2 hours but only needs to run once. The cache survives reboots and service restarts. It's only invalidated by PyTorch or ROCm version upgrades.
After warmup with MIOPEN_FIND_MODE=2:
# Every request is fast from first call after service restart:
Short sentence: 1.1s (was 13s on CPU)
Medium sentence: 1.3s (was 18s on CPU)
Long paragraph: 5.0s (was 18s/chunk on CPU)
Warmup script
#!/usr/bin/env python3
"""Exhaustive MIOpen kernel warmup for Kokoro TTS on ROCm.
Run WITHOUT MIOPEN_FIND_MODE=2 so it searches for optimal kernels.
After completion, set MIOPEN_FIND_MODE=2 for production."""
import torch
import time
import sys
import os
sys.path.insert(0, '/path/to/kokoro-fastapi')
sys.path.insert(0, '/path/to/kokoro-fastapi/api')
os.environ['ESPEAK_DATA_PATH'] = '/usr/lib/x86_64-linux-gnu/espeak-ng-data'
os.environ['TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL'] = '1'
from kokoro import KModel
print(f"[{time.strftime('%H:%M:%S')}] Loading model on GPU...")
model = KModel(config="api/src/models/v1_0/config.json",
model="api/src/models/v1_0/kokoro-v1_0.pth").eval().cuda()
voice = torch.load("api/src/voices/v1_0/af_heart.pt", weights_only=True)
print(f"[{time.strftime('%H:%M:%S')}] Starting warmup...")
total_time = 0
for n in range(1, 341): # 340 covers all practical input lengths
ps = ('a ' * ((n + 1) // 2))[:n]
ref_s = voice[min(len(ps), 509)]
torch.cuda.synchronize()
t0 = time.time()
try:
audio = model(ps, ref_s, speed=1)
except Exception as e:
print(f"ERROR at n={n}: {e}")
continue
torch.cuda.synchronize()
elapsed = time.time() - t0
total_time += elapsed
if n % 10 == 0:
print(f"[{time.strftime('%H:%M:%S')}] {n:3d}/340 | this={elapsed:5.1f}s | total={total_time/60:5.1f}m", flush=True)
print(f"\nWarmup complete in {total_time/60:.1f} minutes")
ROCm 7.2 Setup Notes
PyTorch wheels
The stock docker/rocm/ Dockerfile uses torch==2.8.0+rocm6.4 from pytorch.org. This does not include gfx1151 support. For Strix Halo you need AMD's official wheels from repo.radeon.com:
pip install torch --index-url https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/
The pytorch.org ROCm wheels segfault on ROCm 7.2 — they're built against ROCm 6.x libraries.
Kernel version matters
gfx1151 (RDNA 3.5 / Strix Halo) requires a recent Linux kernel with the amdgpu driver supporting this GPU ID. Kernel 6.19+ works. Older kernels (e.g., 6.8) may not recognize the GPU at all.
GPU passthrough (LXC/VM)
If running in an LXC container, these devices need passthrough:
/dev/dri/card0
/dev/dri/renderD128
/dev/kfd
Voice-independent cache
The MIOpen cache is keyed on tensor shapes, not values. Different voices change the style vector values but not dimensions (always [1, 256]). So the cache works for all 67 voice packs without re-warming.
Benchmarks
Tested with Kokoro-FastAPI v0.2.4, kokoro 0.9.4, 67 voice packs:
| Input |
CPU (torch 2.10+cpu) |
GPU/ROCm (torch 2.10+rocm7.2) |
Speedup |
| "Hello world." (13 phonemes) |
13.1s |
1.1s |
12x |
| 20-word sentence (~50 phonemes) |
18.2s |
1.3s |
14x |
| Long paragraph (~130 phonemes) |
17.9s |
1.8s |
10x |
| Very long (~300 phonemes) |
18.0s |
5.0s |
3.6x |
Model memory footprint is minimal (~1.5GB including PyTorch overhead) — the 82M param model coexists fine with other GPU workloads on the shared iGPU.
Suggestion
Would it be worth adding a MIOPEN_FIND_MODE=2 default to the existing docker/rocm/ setup, and documenting the warmup pattern? The current ROCm Docker image will have the same per-process recompilation issue on any AMD GPU, not just Strix Halo.
Happy to contribute a PR if there's interest.
Summary
I got Kokoro-FastAPI running on AMD Ryzen AI MAX+ 395 (Strix Halo) with the Radeon 8060S iGPU (gfx1151) using ROCm 7.2. Sharing findings since the existing
docker/rocm/setup targets ROCm 6.4 which doesn't support this GPU architecture.Result: ~10x speedup over CPU — but only after solving a critical MIOpen kernel caching issue that causes 5-60 second delays on every request without the fix.
Environment
Key Finding:
MIOPEN_FIND_MODE=2is criticalMIOpen compiles optimized GPU kernels for each unique tensor shape. In Kokoro, tensor shapes vary by phoneme count — every unique input length produces a different shape. Without caching, this means:
The MIOpen cache persists to disk at
~/.config/miopen/(~2.4MB for all lengths), but by default each new process re-searches instead of using the disk cache. SettingMIOPEN_FIND_MODE=2forces MIOpen to use the disk cache, eliminating the per-process penalty entirely.Environment variables for the service
MIOpen Cache Warmup
The disk cache needs to be populated once. I wrote a warmup script that runs forward passes for all phoneme lengths 1-340 (which covers 100% of real-world traffic based on 25 days of production data — the text chunker caps at ~450 phonemes, and in practice nothing exceeded 331).
Warmup takes ~2 hours but only needs to run once. The cache survives reboots and service restarts. It's only invalidated by PyTorch or ROCm version upgrades.
After warmup with
MIOPEN_FIND_MODE=2:Warmup script
ROCm 7.2 Setup Notes
PyTorch wheels
The stock
docker/rocm/Dockerfile usestorch==2.8.0+rocm6.4from pytorch.org. This does not include gfx1151 support. For Strix Halo you need AMD's official wheels fromrepo.radeon.com:The pytorch.org ROCm wheels segfault on ROCm 7.2 — they're built against ROCm 6.x libraries.
Kernel version matters
gfx1151 (RDNA 3.5 / Strix Halo) requires a recent Linux kernel with the
amdgpudriver supporting this GPU ID. Kernel 6.19+ works. Older kernels (e.g., 6.8) may not recognize the GPU at all.GPU passthrough (LXC/VM)
If running in an LXC container, these devices need passthrough:
/dev/dri/card0/dev/dri/renderD128/dev/kfdVoice-independent cache
The MIOpen cache is keyed on tensor shapes, not values. Different voices change the style vector values but not dimensions (always
[1, 256]). So the cache works for all 67 voice packs without re-warming.Benchmarks
Tested with Kokoro-FastAPI v0.2.4, kokoro 0.9.4, 67 voice packs:
Model memory footprint is minimal (~1.5GB including PyTorch overhead) — the 82M param model coexists fine with other GPU workloads on the shared iGPU.
Suggestion
Would it be worth adding a
MIOPEN_FIND_MODE=2default to the existingdocker/rocm/setup, and documenting the warmup pattern? The current ROCm Docker image will have the same per-process recompilation issue on any AMD GPU, not just Strix Halo.Happy to contribute a PR if there's interest.