Skip to content

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 7, 2026

Note

Source pull request: leejet/stable-diffusion.cpp#1255

After falling on my face with the first PR, it seemed necessary to get up and try again with a different issue.

Setting the default -H (--height) and -W (--width) options from the sd-server command line.

There is nothing special to this. Manually added .default_width and .default_height to struct SDSvrParams and initialized both endpoints with that, instead 512.

@loci-review
Copy link

loci-review bot commented Feb 7, 2026

Overview

Analysis of stable-diffusion.cpp compared 48,089 functions across two binaries following a single commit adding CLI options for image dimensions. Modified functions: 60 (0.12%), new: 2, removed: 1, unchanged: 48,026 (99.87%).

Power Consumption:

  • build.bin.sd-server: +0.044% (512,977 nJ → 513,205 nJ)
  • build.bin.sd-cli: -0.0% (479,167 nJ → 479,167 nJ, negligible)

Function Analysis

SDSvrParams::get_options (directly modified): Throughput +82ns (+9.29%), response +8,959ns (+11.58%). Added two CLI options for default image height/width. The 9μs overhead occurs once at startup, not affecting inference performance. Change is justified by the feature addition.

apply_binary_op<op_div, ggml_bf16_t> (GGML tensor operation): Throughput +79ns (+6.64%), response +93ns (+3.59%). Division operations on bfloat16 tensors used in normalization and attention scaling. Potentially called thousands of times per inference, cumulative impact ~593μs per image. Source in GGML submodule (not accessible); regression warrants investigation.

apply_unary_op<op_hardsigmoid, ggml_bf16_t>: Throughput -71ns (-9.11%), response -71ns (-3.47%). Improvement in hard sigmoid activation partially offsets division regression.

Standard library functions (std::less, std::vector, std::unordered_map operations): Mixed results with throughput changes ranging from -74ns to +45ns. Most are compiler/toolchain artifacts affecting initialization code, not inference paths. Vector copy constructor improved (-33.91%), comparison operator regressed (+68.69%), but net impact is minimal as these operate during model loading.

Other analyzed functions showed negligible changes in non-critical paths.

Additional Findings

The commit modified only CLI parsing code, yet most performance variations stem from compiler/standard library differences between builds. ML inference impact is sub-millisecond (<1ms per image, <0.1% of total generation time). The division operation regression in GGML's bfloat16 handling is the only noteworthy concern for ML workloads, though absolute impact remains small. Overall system maintains excellent performance characteristics with appropriate trade-offs for added functionality.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 342c73d to 8c51734 Compare February 10, 2026 04:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant