Skip to content

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 6, 2026

Note

Source pull request: leejet/stable-diffusion.cpp#1254

Even though the name "SDXS-09" is similar to "SDXS", it is completely different from SDXS.
For this reason that "SDXS" was renamed to "SDXS (DS)", where DS stands for the DreamShaper edition by IDKiro.

Even though the name "SDXS-09" is similar to "SDXS", it is completely
different from SDXS. For this reason that SDXS was renamed to "SDXS (DS)",
where DS stands for the DreamShaper edition by IDKiro.
@loci-review
Copy link

loci-review bot commented Feb 6, 2026

Overview

Analysis of stable-diffusion.cpp compared 48,089 functions across two versions, finding 83 modified (0.17%), with no new or removed functions. The single commit adds SDXS-09 model variant support.

Binaries Analyzed:

  • build.bin.sd-server: 512,975.76 nJ → 512,800.65 nJ (-0.034%)
  • build.bin.sd-cli: 479,167.23 nJ → 479,107.36 nJ (-0.012%)

Power consumption improved marginally in both binaries despite localized function regressions, indicating stable overall performance.

Function Analysis

Intentional Feature Additions (Expected Overhead):

Four instances of sd_version_is_sd2 (model.h:63-68) across both binaries show identical changes: response time increased from 44.33ns to 56.04ns (+11.71ns, +26.4%), throughput time identical. The function added VERSION_SDXS_09 to SD2 classification logic, extending the conditional from 3 to 4 enum comparisons. This is necessary for SDXS-09 model support and occurs only during initialization.

Two instances of UNet lambda operator get_attention_layer (unet.hpp:260-274) show throughput time increased from 75.65ns to 103.85ns (+28.20ns, +37.3%), while response time remained essentially flat (+0.03-0.04%, ~668μs total). Added conditional logic normalizes SDXS-09 attention parameters (5 heads × 64 dims → 1 head × 320 dims) during model construction. The minimal response time impact confirms child operations (heap allocation) dominate execution.

Standard Library Variations (No Source Changes):

Three STL iterator functions in build.bin.sd-server show consistent regressions: std::_Rb_tree::end() (+183.30ns, +231.5% response time), std::vector::end() (+183.29ns, +227.3%), and std::vector<char>::begin() (+180.81ns, +216.9%). These simple O(1) accessors show 3-4x slowdowns without source modifications, suggesting compiler optimization differences between builds. Used in tensor map operations and tokenization, cumulative impact is ~5.4ms for 10,000 calls per function—negligible compared to model loading time.

std::shared_ptr::operator= for GGMLBlock (shared_ptr_base.h:1626-1630) shows throughput time doubled from 77.98ns to 157.96ns (+102.6%), while response time increased modestly (+8.4%, 957.90ns → 1037.92ns). Standard library code managing neural network block reference counting, with no application source changes.

std::map::operator[] (build.bin.sd-cli) shows throughput time increased from 154.20ns to 216.23ns (+40.2%), response time +1.05%. Used extensively for tensor_storage_map operations during model loading.

Performance Improvements:

ggml_e8m0_to_fp32_half (build.bin.sd-cli) improved significantly: throughput time decreased from 144.82ns to 109.67ns (-35.15ns, -24.3%), response time -23.0%. This GGML library quantization function benefits quantized model loading. If called 1 million times, saves ~35 milliseconds, offsetting other regressions.

Other analyzed functions showed negligible changes.

Additional Findings

ML Operations Impact: No analyzed functions are in the inference hot path. All changes affect initialization and model loading phases. The quantization improvement benefits quantized model workflows, while SDXS-09 additions enable new model variant support without inference overhead.

Cross-Function Patterns: STL regressions show consistent absolute increases (~180ns for iterators) only in the server binary, suggesting binary-specific build configuration differences rather than code quality issues. Feature additions show predictable overhead proportional to added logic complexity. Net effect for quantized models is ~30ms improvement (quantization gains offset STL regressions); for non-quantized models, ~4.4ms overhead is negligible compared to typical model loading times (seconds).

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-review
Copy link

loci-review bot commented Feb 7, 2026

Overview

Analysis of 48,089 functions across two binaries reveals minimal performance impact from SDXS-09 model support addition. Modified functions: 83 (0.17%), with no new or removed functions.

Binaries analyzed:

  • build.bin.sd-server: 512,975.76 nJ → 512,801.31 nJ (-0.034%)
  • build.bin.sd-cli: 479,167.23 nJ → 479,106.95 nJ (-0.013%)

Power consumption improved slightly in both binaries, indicating maintained energy efficiency despite localized regressions.

Function Analysis

Version Detection Functions (sd_version_is_sd2): Four instances across both binaries show identical changes—response and throughput time increased from 44.33ns to 56.04ns (+26.4%, +11.71ns absolute). This results from adding VERSION_SDXS_09 to the conditional check, enabling proper SD2-compatible configuration for the new model variant. Changes occur during initialization only.

UNet Attention Layer Factory (UnetModelBlock lambda operator): Both binaries show throughput time increased from 75.65ns to 103.85ns (+37.3%, +28.19ns absolute) while response time remained constant at ~668µs (+0.03%). Added SDXS-09 parameter remapping logic (n_head: 5→1, d_head: 64→320) accounts for the regression. Constructor-time operation with no inference impact.

Quantization Conversion (ggml_e8m0_to_fp32_half): Significant improvement with throughput time decreased from 144.82ns to 109.67ns (-24.3%, -35.15ns absolute). This upstream GGML optimization directly benefits inference performance for quantized models.

STL Functions: Multiple standard library functions (map/vector iterators, smart pointers) show 40-300% throughput increases with 60-183ns absolute changes. No application source code modifications detected—regressions likely stem from compiler optimization differences (GCC 13, aarch64). These occur primarily during initialization, not inference hot paths.

Additional Findings

The quantization improvement offsets initialization overhead, resulting in net positive power efficiency. All performance-critical inference operations remain unaffected or improved. Cross-binary consistency confirms deterministic implementation of SDXS-09 support with minimal architectural impact.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 5 times, most recently from 3ad80c4 to 74d69ae Compare February 12, 2026 04:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants