From 35112de85fb8adab7523fc37362e45de7f6ef57c Mon Sep 17 00:00:00 2001
From: geramyloveless <gloveless@jqluv.com>
Date: Wed, 6 May 2026 13:10:07 -0700
Subject: [PATCH] =?UTF-8?q?ci(linux):=20build=20fat=20package=20=E2=80=94?=
 =?UTF-8?q?=20GGML=5FBACKEND=5FDL=20+=20GGML=5FCPU=5FALL=5FVARIANTS?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Replace the AVX2/FMA/F16C portable baseline (#3) with a fat-package build
that produces one libstable-diffusion.so plus a libggml-cpu-*.so per CPU
variant — sandybridge, haswell, skylakex (AVX-512F), icelake (AVX-512 +
VNNI), alderlake (AVX-512 + VNNI + DOTPROD), and a pure-x64 fallback.

At runtime ggml dlopens the variants and picks the highest-tier one the
host CPU supports. AVX-512 hosts get AVX-512 perf; older boxes fall back
gracefully — no -march=native runner lottery, no SIGILL.

Tradeoff: zip grows from ~12 MB → ~50–80 MB. Acceptable for a one-time
download, especially since downstream consumers (Lemonade) cache the
extracted directory across model loads.

Applied to ubuntu-latest-cmake (CPU) and ubuntu-latest-rocm (HIP), since
the HIPBLAS build still uses ggml CPU ops for parts of the pipeline.

Windows AVX2 already pins GGML_NATIVE=OFF + AVX2 only, and macOS arm64
shares a uniform NEON+DOTPROD+i8mm+bf16 baseline across all Apple Silicon
generations, so neither needs the same treatment.

Upstream PR leejet/stable-diffusion.cpp#1448 (commit b8079e2) wired the
runtime backend discovery code into libstable-diffusion.so already; this
just enables the build flag that produces the variant .so files.
---
 .github/workflows/build.yml | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
index eb2e3ab53..1cb149aef 100644
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -53,16 +53,16 @@ jobs:
         run: |
           mkdir build
           cd build
-          # Disable -march=native and pin CPU instruction set to AVX2+FMA+F16C so
-          # the released x86_64 binary runs on hosts without AVX-512.
-          # Without GGML_NATIVE=OFF, ggml's CMake auto-enables every extension
-          # the build runner's CPU has (including AVX-512 on Azure Xeon
-          # Platinum 8370C runners), which then SIGILLs on AVX-512-less hosts.
+          # Build a fat package: one libstable-diffusion.so plus a libggml-cpu-*.so
+          # per CPU variant (sandybridge, haswell, skylakex, icelake, alderlake,
+          # x64). At runtime ggml dlopens whichever variant is highest-priority on
+          # the host CPU, so an AVX-512 host gets AVX-512 perf and an AVX-512-less
+          # host falls back to haswell — same zip, no -march=native runner
+          # lottery, no SIGILL.
           cmake .. \
             -DGGML_NATIVE=OFF \
-            -DGGML_AVX2=ON \
-            -DGGML_FMA=ON \
-            -DGGML_F16C=ON \
+            -DGGML_BACKEND_DL=ON \
+            -DGGML_CPU_ALL_VARIANTS=ON \
             -DSD_BUILD_SHARED_LIBS=ON
           cmake --build . --config Release
 
@@ -513,16 +513,16 @@ jobs:
         run: |
           mkdir build
           cd build
-          # Same portability concern as ubuntu-latest-cmake: pin the host CPU
-          # instruction set so the binary runs on AVX-512-less ROCm hosts too.
+          # Fat package: same approach as ubuntu-latest-cmake. The HIPBLAS build
+          # still uses ggml's CPU ops for parts of the pipeline (CLIP encoding,
+          # etc.), so it benefits from per-CPU variants the same way.
           cmake .. -G Ninja \
             -DCMAKE_HIP_COMPILER="$(hipconfig -l)/clang" \
             -DCMAKE_HIP_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600" \
             -DCMAKE_BUILD_TYPE=Release \
             -DGGML_NATIVE=OFF \
-            -DGGML_AVX2=ON \
-            -DGGML_FMA=ON \
-            -DGGML_F16C=ON \
+            -DGGML_BACKEND_DL=ON \
+            -DGGML_CPU_ALL_VARIANTS=ON \
             -DSD_HIPBLAS=ON \
             -DHIP_PLATFORM=amd \
             -DGPU_TARGETS="${{ matrix.gpu_targets }}" \