From 35112de85fb8adab7523fc37362e45de7f6ef57c Mon Sep 17 00:00:00 2001 From: geramyloveless Date: Wed, 6 May 2026 13:10:07 -0700 Subject: [PATCH] =?UTF-8?q?ci(linux):=20build=20fat=20package=20=E2=80=94?= =?UTF-8?q?=20GGML=5FBACKEND=5FDL=20+=20GGML=5FCPU=5FALL=5FVARIANTS?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace the AVX2/FMA/F16C portable baseline (#3) with a fat-package build that produces one libstable-diffusion.so plus a libggml-cpu-*.so per CPU variant — sandybridge, haswell, skylakex (AVX-512F), icelake (AVX-512 + VNNI), alderlake (AVX-512 + VNNI + DOTPROD), and a pure-x64 fallback. At runtime ggml dlopens the variants and picks the highest-tier one the host CPU supports. AVX-512 hosts get AVX-512 perf; older boxes fall back gracefully — no -march=native runner lottery, no SIGILL. Tradeoff: zip grows from ~12 MB → ~50–80 MB. Acceptable for a one-time download, especially since downstream consumers (Lemonade) cache the extracted directory across model loads. Applied to ubuntu-latest-cmake (CPU) and ubuntu-latest-rocm (HIP), since the HIPBLAS build still uses ggml CPU ops for parts of the pipeline. Windows AVX2 already pins GGML_NATIVE=OFF + AVX2 only, and macOS arm64 shares a uniform NEON+DOTPROD+i8mm+bf16 baseline across all Apple Silicon generations, so neither needs the same treatment. Upstream PR leejet/stable-diffusion.cpp#1448 (commit b8079e2) wired the runtime backend discovery code into libstable-diffusion.so already; this just enables the build flag that produces the variant .so files. --- .github/workflows/build.yml | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index eb2e3ab53..1cb149aef 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -53,16 +53,16 @@ jobs: run: | mkdir build cd build - # Disable -march=native and pin CPU instruction set to AVX2+FMA+F16C so - # the released x86_64 binary runs on hosts without AVX-512. - # Without GGML_NATIVE=OFF, ggml's CMake auto-enables every extension - # the build runner's CPU has (including AVX-512 on Azure Xeon - # Platinum 8370C runners), which then SIGILLs on AVX-512-less hosts. + # Build a fat package: one libstable-diffusion.so plus a libggml-cpu-*.so + # per CPU variant (sandybridge, haswell, skylakex, icelake, alderlake, + # x64). At runtime ggml dlopens whichever variant is highest-priority on + # the host CPU, so an AVX-512 host gets AVX-512 perf and an AVX-512-less + # host falls back to haswell — same zip, no -march=native runner + # lottery, no SIGILL. cmake .. \ -DGGML_NATIVE=OFF \ - -DGGML_AVX2=ON \ - -DGGML_FMA=ON \ - -DGGML_F16C=ON \ + -DGGML_BACKEND_DL=ON \ + -DGGML_CPU_ALL_VARIANTS=ON \ -DSD_BUILD_SHARED_LIBS=ON cmake --build . --config Release @@ -513,16 +513,16 @@ jobs: run: | mkdir build cd build - # Same portability concern as ubuntu-latest-cmake: pin the host CPU - # instruction set so the binary runs on AVX-512-less ROCm hosts too. + # Fat package: same approach as ubuntu-latest-cmake. The HIPBLAS build + # still uses ggml's CPU ops for parts of the pipeline (CLIP encoding, + # etc.), so it benefits from per-CPU variants the same way. cmake .. -G Ninja \ -DCMAKE_HIP_COMPILER="$(hipconfig -l)/clang" \ -DCMAKE_HIP_FLAGS="-mllvm --amdgpu-unroll-threshold-local=600" \ -DCMAKE_BUILD_TYPE=Release \ -DGGML_NATIVE=OFF \ - -DGGML_AVX2=ON \ - -DGGML_FMA=ON \ - -DGGML_F16C=ON \ + -DGGML_BACKEND_DL=ON \ + -DGGML_CPU_ALL_VARIANTS=ON \ -DSD_HIPBLAS=ON \ -DHIP_PLATFORM=amd \ -DGPU_TARGETS="${{ matrix.gpu_targets }}" \