diff --git a/docs/vram_offloading.md b/docs/vram_offloading.md new file mode 100644 index 000000000..dbfd91114 --- /dev/null +++ b/docs/vram_offloading.md @@ -0,0 +1,112 @@ +# VRAM Offloading + +Run models larger than your GPU memory by offloading weights to CPU RAM during generation. + +## Offload Modes + +Use `--offload-mode ` to select the offloading strategy: + +| Mode | Description | VRAM Usage | Speed | Quality | +|------|-------------|------------|-------|---------| +| `none` | Everything stays on GPU (default) | Highest | Fastest | No penalty | +| `cond_only` | Offload text encoder after conditioning | High | Near-full speed — only a brief reload between conditioning and diffusion | No penalty | +| `cond_diffusion` | Offload both text encoder and diffusion model between stages | Medium | Slower — model is reloaded to GPU each diffusion step | No penalty | +| `aggressive` | Aggressively offload all components when not in use | Low | Slowest of the non-streaming modes — frequent CPU↔GPU transfers | No penalty | +| `layer_streaming` | Stream transformer layers one-by-one through GPU | Lowest | Depends on model size (see below) | No penalty when using coarse-stage; per-layer streaming is lossless for most architectures | + +The `--offload-to-cpu` flag is a shortcut that picks a reasonable offload mode automatically. + +## Layer Streaming + +Layer streaming is the most memory-efficient mode. Instead of loading the entire diffusion model into VRAM, it loads one transformer block at a time. + +### How it works + +1. **Coarse-stage**: If the model fits in VRAM (e.g., quantized models), all layers are loaded at once and the full graph is executed normally. This is as fast as `--offload-mode none` with no quality penalty — the only overhead is the initial CPU→GPU weight transfer. +2. **Per-layer streaming**: If the model doesn't fit (e.g., bf16 models on small GPUs), each transformer block is loaded, executed as a mini-graph, then offloaded back to CPU before the next block. This uses minimal VRAM but is significantly slower due to per-step CPU↔GPU transfers. Output quality is identical to full-model execution — the computation is mathematically equivalent, just split across separate graph evaluations. + +The mode is chosen automatically based on available VRAM. + +### Supported architectures + +- Flux (double_blocks + single_blocks) +- ZImage / Z-Image-Turbo (context_refiner + noise_refiner + layers) +- MMDiT / SD3 (joint_blocks) +- UNet / SD1.x / SDXL (input_blocks + middle_block + output_blocks) +- Anima (blocks) +- WAN (blocks + vace_blocks) +- Qwen Image (transformer_blocks) + +### Examples + +#### ZImage-Turbo Q8 with layer streaming + +``` +sd-cli --diffusion-model z_image_turbo-Q8_0.gguf \ + --llm Qwen3-4b-Z-Engineer-V2.gguf \ + --vae ae.safetensors \ + -p "a cat" --cfg-scale 1.0 --diffusion-fa \ + -H 1024 -W 688 -s 42 \ + --offload-mode layer_streaming -v +``` + +The Q8 model (6.7 GB) fits in a 12 GB GPU, so coarse-stage streaming is used automatically: +``` +[INFO ] z_image model fits in VRAM, using coarse-stage streaming +[INFO ] z_image coarse-stage streaming completed in 1.66s +``` + +#### Flux-dev Q4 with layer streaming + +``` +sd-cli --diffusion-model flux1-dev-q4_0.gguf \ + --vae ae.safetensors \ + --clip_l clip_l.safetensors \ + --t5xxl t5xxl_fp16.safetensors \ + -p "a lovely cat" --cfg-scale 1.0 --sampling-method euler \ + --offload-mode layer_streaming -v +``` + +#### SD1.5 with aggressive offloading + +``` +sd-cli -m sd-v1-4.ckpt \ + -p "a photograph of an astronaut riding a horse" \ + --offload-mode aggressive -v +``` + +## Combining with other options + +- `--diffusion-fa`: Flash attention reduces VRAM further. Recommended with all offload modes. No quality penalty. +- `--clip-on-cpu`: Run CLIP text encoder on CPU. Saves VRAM but slows conditioning. No quality penalty. +- Quantized models (`q4_0`, `q8_0`, etc.) reduce model size, making coarse-stage streaming more likely (faster). **Quantization does reduce output quality** — lower bit depths produce softer details and may introduce artifacts. See [quantization](./quantization_and_gguf.md) for quality comparisons. `q8_0` is nearly indistinguishable from full precision; `q4_0` and below show visible degradation on fine details. + +## Quality impact summary + +| Technique | Quality Impact | +|-----------|---------------| +| `--offload-mode` (any mode) | **None** — offloading only changes where weights are stored, not the computation | +| `--diffusion-fa` (flash attention) | **None** — mathematically equivalent, just more memory-efficient | +| `--clip-on-cpu` | **None** — same computation on CPU instead of GPU | +| Quantization (`q8_0`) | **Negligible** — nearly identical to full precision | +| Quantization (`q4_0`, `q4_k`) | **Minor** — slight softening, fine details may differ | +| Quantization (`q3_k`, `q2_k`) | **Noticeable** — visible quality loss, best for previews or VRAM-constrained setups | + +## Troubleshooting + +- **OOM during generation**: Try a more aggressive mode. `layer_streaming` uses the least VRAM. +- **Slow generation**: Coarse-stage streaming (model fits in VRAM) is nearly as fast as no offloading. Per-layer streaming is slower due to CPU-GPU transfers each step. Using quantized models often lets you stay in coarse-stage mode. +- **Black or corrupted output**: This is a bug. Please report it with the model, offload mode, and resolution used. +- **One CPU core pegged at 100% while the GPU is working**: this is the CUDA driver spin-waiting on kernel completion. The default schedule policy (`cudaDeviceScheduleAuto`) often picks `Spin` for short-kernel workloads like per-layer streaming, which busy-waits one host thread for each kernel return. It does *not* slow generation down (the wait is wasted heat, not blocking work), but it looks bad on `top`/`nvtop` and is unfriendly to shared-host setups. Two ways to silence it: + + 1. Per-run, no rebuild needed: + ``` + CUDA_DEVICE_SCHEDULE=BlockingSync sd-cli ... + ``` + 2. Per-process, set once at startup: + ```c + cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync); + ``` + Long-lived processes (REST servers, queue workers) should do this. + + CPU drops to near zero; GPU performance is unchanged. diff --git a/examples/cli/main.cpp b/examples/cli/main.cpp index 27513f475..392e9c404 100644 --- a/examples/cli/main.cpp +++ b/examples/cli/main.cpp @@ -698,7 +698,10 @@ int main(int argc, const char* argv[]) { vae_decode_only = false; } - sd_ctx_params_t sd_ctx_params = ctx_params.to_sd_ctx_params_t(vae_decode_only, true, cli_params.taesd_preview); + // For layer_streaming mode, we need smart offload logic instead of immediate freeing + // This allows should_offload_cond_stage_for_diffusion() to be called and offload T5 before streaming + bool free_params_immediately = (ctx_params.offload_config.mode != SD_OFFLOAD_LAYER_STREAMING); + sd_ctx_params_t sd_ctx_params = ctx_params.to_sd_ctx_params_t(vae_decode_only, free_params_immediately, cli_params.taesd_preview); SDImageVec results; int num_results = 0; diff --git a/examples/common/common.cpp b/examples/common/common.cpp index d4c8a72b8..faa9eef6a 100644 --- a/examples/common/common.cpp +++ b/examples/common/common.cpp @@ -538,6 +538,78 @@ ArgOptions SDContextParams::get_options() { return 1; }; + auto on_offload_mode_arg = [&](int argc, const char** argv, int index) { + if (++index >= argc) { + return -1; + } + const char* arg = argv[index]; + offload_config.mode = str_to_offload_mode(arg); + if (offload_config.mode == SD_OFFLOAD_MODE_COUNT) { + LOG_ERROR("error: invalid offload mode %s", arg); + return -1; + } + return 1; + }; + + auto on_vram_estimation_arg = [&](int argc, const char** argv, int index) { + if (++index >= argc) { + return -1; + } + const char* arg = argv[index]; + offload_config.vram_estimation = str_to_vram_estimation(arg); + if (offload_config.vram_estimation == SD_VRAM_EST_COUNT) { + LOG_ERROR("error: invalid VRAM estimation method %s", arg); + return -1; + } + return 1; + }; + + auto on_streaming_prefetch_arg = [&](int argc, const char** argv, int index) { + if (++index >= argc) { + return -1; + } + try { + offload_config.streaming_prefetch_layers = std::stoi(argv[index]); + if (offload_config.streaming_prefetch_layers < 0) { + LOG_ERROR("error: streaming prefetch must be >= 0"); + return -1; + } + } catch (...) { + LOG_ERROR("error: invalid streaming prefetch value %s", argv[index]); + return -1; + } + return 1; + }; + + auto on_streaming_min_vram_arg = [&](int argc, const char** argv, int index) { + if (++index >= argc) { + return -1; + } + try { + int mb = std::stoi(argv[index]); + if (mb < 0) { + LOG_ERROR("error: streaming min VRAM must be >= 0"); + return -1; + } + offload_config.streaming_min_free_vram = static_cast(mb) * 1024 * 1024; + } catch (...) { + LOG_ERROR("error: invalid streaming min VRAM value %s", argv[index]); + return -1; + } + return 1; + }; + + options.bool_options.push_back({"", "--offload-log", "log offload events", true, &offload_config.log_offload_events}); + options.bool_options.push_back({"", "--no-offload-log", "do not log offload events", false, &offload_config.log_offload_events}); + options.bool_options.push_back({"", "--offload-cond-stage", "offload cond stage to CPU after use", true, &offload_config.offload_cond_stage}); + options.bool_options.push_back({"", "--no-offload-cond-stage", "do not offload cond stage", false, &offload_config.offload_cond_stage}); + options.bool_options.push_back({"", "--offload-diffusion", "offload diffusion model to CPU after use", true, &offload_config.offload_diffusion}); + options.bool_options.push_back({"", "--no-offload-diffusion", "do not offload diffusion model", false, &offload_config.offload_diffusion}); + options.bool_options.push_back({"", "--reload-cond-stage", "reload cond stage to GPU before use", true, &offload_config.reload_cond_stage}); + options.bool_options.push_back({"", "--no-reload-cond-stage", "do not reload cond stage", false, &offload_config.reload_cond_stage}); + options.bool_options.push_back({"", "--reload-diffusion", "reload diffusion to GPU before use", true, &offload_config.reload_diffusion}); + options.bool_options.push_back({"", "--no-reload-diffusion", "do not reload diffusion", false, &offload_config.reload_diffusion}); + options.manual_options = { {"", "--type", @@ -564,6 +636,24 @@ ArgOptions SDContextParams::get_options() { "but it usually offers faster inference speed and, in some cases, lower memory usage. " "The at_runtime mode, on the other hand, is exactly the opposite.", on_lora_apply_mode_arg}, + {"", + "--offload-mode", + "dynamic VRAM offloading mode, one of [none, cond_only, cond_diffusion, aggressive, layer_streaming] (default: none). " + "Use 'cond_only' to offload the LLM/CLIP model to CPU after conditioning. " + "Use 'layer_streaming' to stream model layers one-by-one (enables models larger than VRAM).", + on_offload_mode_arg}, + {"", + "--vram-estimation", + "VRAM estimation method for smart offloading, one of [dryrun, formula] (default: dryrun)", + on_vram_estimation_arg}, + {"", + "--streaming-prefetch", + "Number of layers to prefetch ahead during layer streaming (default: 1)", + on_streaming_prefetch_arg}, + {"", + "--streaming-min-vram", + "Minimum VRAM to keep free during layer streaming, in MB (default: 512)", + on_streaming_min_vram_arg}, }; return options; @@ -693,7 +783,14 @@ std::string SDContextParams::to_string() const { << " chroma_t5_mask_pad: " << chroma_t5_mask_pad << ",\n" << " prediction: " << sd_prediction_name(prediction) << ",\n" << " lora_apply_mode: " << sd_lora_apply_mode_name(lora_apply_mode) << ",\n" - << " force_sdxl_vae_conv_scale: " << (force_sdxl_vae_conv_scale ? "true" : "false") << "\n" + << " force_sdxl_vae_conv_scale: " << (force_sdxl_vae_conv_scale ? "true" : "false") << ",\n" + << " offload_config: { mode=" << sd_offload_mode_name(offload_config.mode) + << ", vram_est=" << sd_vram_estimation_name(offload_config.vram_estimation) + << ", offload_cond=" << (offload_config.offload_cond_stage ? "true" : "false") + << ", offload_diff=" << (offload_config.offload_diffusion ? "true" : "false") + << ", reload_cond=" << (offload_config.reload_cond_stage ? "true" : "false") + << ", reload_diff=" << (offload_config.reload_diffusion ? "true" : "false") + << ", log=" << (offload_config.log_offload_events ? "true" : "false") << " }\n" << "}"; return oss.str(); } @@ -751,6 +848,7 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool vae_decode_only, bool f chroma_t5_mask_pad, qwen_image_zero_cond_t, max_vram, + offload_config, }; return sd_ctx_params; } diff --git a/examples/common/common.h b/examples/common/common.h index f87293f3e..8aef9c92c 100644 --- a/examples/common/common.h +++ b/examples/common/common.h @@ -135,6 +135,12 @@ struct SDContextParams { bool force_sdxl_vae_conv_scale = false; float flow_shift = INFINITY; + + // Dynamic tensor offloading configuration + sd_offload_config_t offload_config = {SD_OFFLOAD_NONE, SD_VRAM_EST_DRYRUN, true, false, false, true, true, + 0, 2ULL * 1024 * 1024 * 1024, + false, 1, 0, 512ULL * 1024 * 1024}; + ArgOptions get_options(); void build_embedding_map(); bool resolve(SDMode mode); diff --git a/include/stable-diffusion.h b/include/stable-diffusion.h index c4c14949c..28c138c2f 100644 --- a/include/stable-diffusion.h +++ b/include/stable-diffusion.h @@ -147,6 +147,53 @@ enum lora_apply_mode_t { LORA_APPLY_MODE_COUNT, }; +// Component identifiers for dynamic tensor offloading +enum sd_component_t { + SD_COMPONENT_COND_STAGE, // LLM/CLIP text embedder + SD_COMPONENT_CLIP_VISION, // CLIP vision encoder (for SVD/Wan i2v) + SD_COMPONENT_DIFFUSION, // UNet/DiT/Flux diffusion model + SD_COMPONENT_VAE, // VAE encoder/decoder + SD_COMPONENT_CONTROL_NET, // ControlNet (if loaded) + SD_COMPONENT_PMID, // PhotoMaker ID encoder (if loaded) + SD_COMPONENT_COUNT +}; + +// Offload mode for automatic GPU memory management +enum sd_offload_mode_t { + SD_OFFLOAD_NONE, // Keep all components on GPU (default, fastest) + SD_OFFLOAD_COND_ONLY, // Offload only conditioning (LLM/CLIP) after use + SD_OFFLOAD_COND_DIFFUSION, // Offload conditioning + diffusion, keep VAE + SD_OFFLOAD_AGGRESSIVE, // Offload each component after use (saves most VRAM) + SD_OFFLOAD_LAYER_STREAMING, // Stream layers one-by-one (enables models larger than VRAM) + SD_OFFLOAD_MODE_COUNT +}; + +// VRAM estimation method for smart offloading decisions +enum sd_vram_estimation_t { + SD_VRAM_EST_DRYRUN, // Dry-run graph allocation for exact size (default, accurate) + SD_VRAM_EST_FORMULA, // Formula-based estimation (faster, approximate) + SD_VRAM_EST_COUNT +}; + +// Offload configuration for fine-grained control +typedef struct { + enum sd_offload_mode_t mode; // Offload mode + enum sd_vram_estimation_t vram_estimation; // VRAM estimation method + bool offload_cond_stage; // Offload LLM/CLIP after conditioning + bool offload_diffusion; // Offload diffusion model after sampling + bool reload_cond_stage; // Reload LLM/CLIP for next generation + bool reload_diffusion; // Reload diffusion model for next generation + bool log_offload_events; // Log offload/reload events + size_t min_offload_size; // Minimum component size to offload (bytes), 0 = no minimum + size_t target_free_vram; // Target free VRAM before VAE decode (bytes), 0 = always offload when mode is set + + // Layer streaming configuration (for SD_OFFLOAD_LAYER_STREAMING mode) + bool layer_streaming_enabled; // Enable layer-by-layer streaming execution + int streaming_prefetch_layers; // Number of layers to prefetch ahead (default: 1) + int streaming_keep_layers_behind; // Layers to keep after execution (for skip connections) + size_t streaming_min_free_vram; // Minimum VRAM to keep free during streaming (bytes) +} sd_offload_config_t; + typedef struct { bool enabled; int tile_size_x; @@ -203,7 +250,8 @@ typedef struct { bool chroma_use_t5_mask; int chroma_t5_mask_pad; bool qwen_image_zero_cond_t; - float max_vram; + float max_vram; // GiB budget for graph-cut segmented param offload (0 = disabled) + sd_offload_config_t offload_config; // Cross-stage and layer-streaming offload configuration } sd_ctx_params_t; typedef struct { @@ -393,6 +441,11 @@ SD_API const char* sd_preview_name(enum preview_t preview); SD_API enum preview_t str_to_preview(const char* str); SD_API const char* sd_lora_apply_mode_name(enum lora_apply_mode_t mode); SD_API enum lora_apply_mode_t str_to_lora_apply_mode(const char* str); +SD_API const char* sd_offload_mode_name(enum sd_offload_mode_t mode); +SD_API enum sd_offload_mode_t str_to_offload_mode(const char* str); +SD_API const char* sd_vram_estimation_name(enum sd_vram_estimation_t method); +SD_API enum sd_vram_estimation_t str_to_vram_estimation(const char* str); +SD_API void sd_offload_config_init(sd_offload_config_t* config); SD_API const char* sd_hires_upscaler_name(enum sd_hires_upscaler_t upscaler); SD_API enum sd_hires_upscaler_t str_to_sd_hires_upscaler(const char* str); @@ -411,6 +464,9 @@ SD_API char* sd_sample_params_to_str(const sd_sample_params_t* sample_params); SD_API enum sample_method_t sd_get_default_sample_method(const sd_ctx_t* sd_ctx); SD_API enum scheduler_t sd_get_default_scheduler(const sd_ctx_t* sd_ctx, enum sample_method_t sample_method); +// Get the model architecture/version name (e.g., "SD 1.x", "SDXL", "Flux", "Z-Image", etc.) +SD_API const char* sd_get_model_version_name(const sd_ctx_t* sd_ctx); + SD_API void sd_img_gen_params_init(sd_img_gen_params_t* sd_img_gen_params); SD_API char* sd_img_gen_params_to_str(const sd_img_gen_params_t* sd_img_gen_params); SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* sd_img_gen_params); @@ -450,6 +506,34 @@ SD_API bool preprocess_canny(sd_image_t image, SD_API const char* sd_commit(void); SD_API const char* sd_version(void); +// Dynamic tensor offloading API +// These functions allow runtime GPU memory management by moving model components +// between CPU and GPU. This enables running larger models on limited VRAM by +// keeping only the currently-active component on GPU. + +// Offload component from GPU to CPU (frees GPU memory) +// Returns true on success, false if component doesn't exist or is already on CPU +SD_API bool sd_offload_to_cpu(sd_ctx_t* sd_ctx, enum sd_component_t component); + +// Reload component from CPU to GPU (allocates GPU memory) +// Returns true on success, false if component doesn't exist or allocation failed +SD_API bool sd_reload_to_gpu(sd_ctx_t* sd_ctx, enum sd_component_t component); + +// Query whether component is currently on GPU +// Returns true if on GPU, false if on CPU or component doesn't exist +SD_API bool sd_is_on_gpu(sd_ctx_t* sd_ctx, enum sd_component_t component); + +// Get component's current memory usage in bytes +// Returns the buffer size if component exists, 0 otherwise +SD_API size_t sd_get_component_vram(sd_ctx_t* sd_ctx, enum sd_component_t component); + +// Get human-readable name for a component +SD_API const char* sd_component_name(enum sd_component_t component); + +// Free all GPU resources (offload all components to CPU and clear LoRAs) +// Call this before unloading a model to ensure GPU memory is released +SD_API void sd_free_gpu_resources(sd_ctx_t* sd_ctx); + #ifdef __cplusplus } #endif diff --git a/src/anima.hpp b/src/anima.hpp index 4bfc04749..7da40fbf8 100644 --- a/src/anima.hpp +++ b/src/anima.hpp @@ -8,6 +8,7 @@ #include "common_block.hpp" #include "flux.hpp" +#include "layer_streaming.hpp" #include "rope.hpp" namespace Anima { @@ -516,6 +517,87 @@ namespace Anima { return x; } + + struct StreamingInputResult { + ggml_tensor* x; // [N, h*w, hidden_size] + ggml_tensor* encoder_hidden_states; // [N, 512, hidden_size] + ggml_tensor* embedded_timestep; // [N, hidden_size] + ggml_tensor* temb; // [N, hidden_size * 3] + }; + + StreamingInputResult forward_input_stage(GGMLRunnerContext* ctx, + struct ggml_tensor* x, + struct ggml_tensor* timestep, + struct ggml_tensor* encoder_hidden_states, + struct ggml_tensor* t5_ids, + struct ggml_tensor* t5_weights, + struct ggml_tensor* adapter_q_pe, + struct ggml_tensor* adapter_k_pe, + int64_t H, int64_t W) { + auto x_embedder = std::dynamic_pointer_cast(blocks["x_embedder"]); + auto t_embedder = std::dynamic_pointer_cast(blocks["t_embedder"]); + auto t_embedding_norm = std::dynamic_pointer_cast(blocks["t_embedding_norm"]); + auto llm_adapter = std::dynamic_pointer_cast(blocks["llm_adapter"]); + + // Add padding mask and patchify + auto padding_mask = ggml_ext_zeros(ctx->ggml_ctx, x->ne[0], x->ne[1], 1, x->ne[3]); + x = ggml_concat(ctx->ggml_ctx, x, padding_mask, 2); // [N, C + 1, H, W] + x = DiT::pad_and_patchify(ctx, x, patch_size, patch_size); // [N, h*w, (C+1)*ph*pw] + x = x_embedder->forward(ctx, x); + + // Timestep embedding + auto timestep_proj = ggml_ext_timestep_embedding(ctx->ggml_ctx, timestep, static_cast(hidden_size)); + auto temb = t_embedder->forward(ctx, timestep_proj); + auto embedded_timestep = t_embedding_norm->forward(ctx, timestep_proj); + + // LLM adapter (if T5 is used) + if (t5_ids != nullptr) { + auto adapted_context = llm_adapter->forward(ctx, encoder_hidden_states, t5_ids, adapter_q_pe, adapter_k_pe); + if (t5_weights != nullptr) { + auto w = t5_weights; + if (ggml_n_dims(w) == 1) { + w = ggml_reshape_3d(ctx->ggml_ctx, w, 1, w->ne[0], 1); + } + w = ggml_repeat_4d(ctx->ggml_ctx, w, adapted_context->ne[0], adapted_context->ne[1], adapted_context->ne[2], 1); + adapted_context = ggml_mul(ctx->ggml_ctx, adapted_context, w); + } + if (adapted_context->ne[1] < 512) { + auto pad_ctx = ggml_ext_zeros(ctx->ggml_ctx, + adapted_context->ne[0], + 512 - adapted_context->ne[1], + adapted_context->ne[2], + 1); + adapted_context = ggml_concat(ctx->ggml_ctx, adapted_context, pad_ctx, 1); + } else if (adapted_context->ne[1] > 512) { + adapted_context = ggml_ext_slice(ctx->ggml_ctx, adapted_context, 1, 0, 512); + } + encoder_hidden_states = adapted_context; + } + + return {x, encoder_hidden_states, embedded_timestep, temb}; + } + + ggml_tensor* forward_block(GGMLRunnerContext* ctx, + int block_idx, + struct ggml_tensor* x, + struct ggml_tensor* encoder_hidden_states, + struct ggml_tensor* embedded_timestep, + struct ggml_tensor* temb, + struct ggml_tensor* image_pe) { + auto block = std::dynamic_pointer_cast(blocks["blocks." + std::to_string(block_idx)]); + return block->forward(ctx, x, encoder_hidden_states, embedded_timestep, temb, image_pe); + } + + ggml_tensor* forward_output_stage(GGMLRunnerContext* ctx, + struct ggml_tensor* x, + struct ggml_tensor* embedded_timestep, + struct ggml_tensor* temb) { + auto final_layer = std::dynamic_pointer_cast(blocks["final_layer"]); + return final_layer->forward(ctx, x, embedded_timestep, temb); // [N, h*w, ph*pw*C] + } + + int64_t get_num_layers() const { return num_layers; } + int get_patch_size() const { return patch_size; } }; struct AnimaRunner : public GGMLRunner { @@ -524,6 +606,13 @@ namespace Anima { std::vector adapter_q_pe_vec; std::vector adapter_k_pe_vec; AnimaNet net; + int64_t num_layers_ = 28; // Store for streaming + + // Static layer cache decided on the first sampling step. -1 = not yet + // computed; 0..N = number of "blocks.X" kept resident across steps. + int resident_blocks_ = -1; + + public: AnimaRunner(ggml_backend_t backend, bool offload_params_to_cpu, @@ -549,6 +638,7 @@ namespace Anima { if (num_layers <= 0) { num_layers = 28; } + num_layers_ = num_layers; // Store for streaming LOG_INFO("anima net layers: %" PRId64, num_layers); net = AnimaNet(num_layers); @@ -672,6 +762,79 @@ namespace Anima { return gf; } + // Raw tensor build_graph used by streaming infrastructure + ggml_cgraph* build_graph(ggml_tensor* x, + ggml_tensor* timesteps, + ggml_tensor* context, + ggml_tensor* t5_ids = nullptr, + ggml_tensor* t5_weights = nullptr) { + GGML_ASSERT(x->ne[3] == 1); + ggml_cgraph* gf = new_graph_custom(ANIMA_GRAPH_SIZE); + + x = to_backend(x); + timesteps = to_backend(timesteps); + context = to_backend(context); + t5_ids = to_backend(t5_ids); + t5_weights = to_backend(t5_weights); + + int64_t pad_h = (net.patch_size - x->ne[1] % net.patch_size) % net.patch_size; + int64_t pad_w = (net.patch_size - x->ne[0] % net.patch_size) % net.patch_size; + int64_t h_pad = x->ne[1] + pad_h; + int64_t w_pad = x->ne[0] + pad_w; + + image_pe_vec = gen_anima_image_pe_vec(1, + static_cast(h_pad), + static_cast(w_pad), + static_cast(net.patch_size), + net.theta, + net.axes_dim, + 4.0f, 4.0f, 1.0f); + int64_t image_pos_len = static_cast(image_pe_vec.size()) / (2 * 2 * (net.head_dim / 2)); + auto image_pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, net.head_dim / 2, image_pos_len); + set_backend_tensor_data(image_pe, image_pe_vec.data()); + + ggml_tensor* adapter_q_pe = nullptr; + ggml_tensor* adapter_k_pe = nullptr; + if (t5_ids != nullptr) { + int64_t target_len = t5_ids->ne[0]; + int64_t source_len = context->ne[1]; + + adapter_q_pe_vec = gen_1d_rope_pe_vec(target_len, 64, 10000.f); + adapter_k_pe_vec = gen_1d_rope_pe_vec(source_len, 64, 10000.f); + + int64_t target_pos_len = static_cast(adapter_q_pe_vec.size()) / (2 * 2 * 32); + int64_t source_pos_len = static_cast(adapter_k_pe_vec.size()) / (2 * 2 * 32); + + adapter_q_pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, 32, target_pos_len); + adapter_k_pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, 32, source_pos_len); + set_backend_tensor_data(adapter_q_pe, adapter_q_pe_vec.data()); + set_backend_tensor_data(adapter_k_pe, adapter_k_pe_vec.data()); + } + + auto runner_ctx = get_context(); + auto out = net.forward(&runner_ctx, x, timesteps, context, image_pe, + t5_ids, t5_weights, adapter_q_pe, adapter_k_pe); + ggml_build_forward_expand(gf, out); + return gf; + } + + // Raw tensor compute used by streaming infrastructure + bool compute(int n_threads, + ggml_tensor* x, + ggml_tensor* timesteps, + ggml_tensor* context, + ggml_tensor* t5_ids = nullptr, + ggml_tensor* t5_weights = nullptr, + ggml_tensor** output = nullptr, + ggml_context* output_ctx = nullptr, + bool skip_param_offload = false) { + auto get_graph = [&]() -> ggml_cgraph* { + return build_graph(x, timesteps, context, t5_ids, t5_weights); + }; + return GGMLRunner::compute(get_graph, n_threads, false, output, output_ctx, skip_param_offload); + } + + // Upstream sd::Tensor compute interface sd::Tensor compute(int n_threads, const sd::Tensor& x, const sd::Tensor& timesteps, @@ -683,6 +846,366 @@ namespace Anima { }; return restore_trailing_singleton_dims(GGMLRunner::compute(get_graph, n_threads, false), x.dim()); } + + void enable_layer_streaming(const LayerStreaming::StreamingConfig& config = {}) { + std::map tensor_map; + net.get_param_tensors(tensor_map, "model.diffusion_model.net"); + init_streaming(config, tensor_map, LayerStreaming::anima_layer_pattern); + LOG_INFO("%s layer streaming enabled with %zu layers", + get_desc().c_str(), streaming_engine_->get_registry().get_layer_count()); + } + + bool compute_streaming(int n_threads, + struct ggml_tensor* x, + struct ggml_tensor* timesteps, + struct ggml_tensor* context, + struct ggml_tensor* t5_ids = nullptr, + struct ggml_tensor* t5_weights = nullptr, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr) { + if (!is_streaming_enabled()) { + LOG_ERROR("%s streaming not enabled", get_desc().c_str()); + return false; + } + + int64_t t0 = ggml_time_ms(); + auto analysis = analyze_vram_budget(); + + if (analysis.fits_in_vram) { + LOG_INFO("%s model fits in VRAM, using coarse-stage streaming", get_desc().c_str()); + load_all_layers_coarse(); + bool result = compute(n_threads, x, timesteps, context, t5_ids, t5_weights, + output, output_ctx, true); + int64_t t1 = ggml_time_ms(); + LOG_INFO("%s coarse-stage streaming completed in %.2fs", get_desc().c_str(), (t1 - t0) / 1000.0); + free_compute_buffer(); + return result; + } + + LOG_INFO("%s remaining %.2f GB exceeds available %.2f GB, using per-layer streaming", + get_desc().c_str(), + analysis.remaining_to_load / (1024.0 * 1024.0 * 1024.0), + analysis.available_vram / (1024.0 * 1024.0 * 1024.0)); + + return compute_streaming_true(n_threads, x, timesteps, context, t5_ids, t5_weights, output, output_ctx); + } + + bool compute_streaming_true(int n_threads, + struct ggml_tensor* x, + struct ggml_tensor* timesteps, + struct ggml_tensor* context, + struct ggml_tensor* t5_ids = nullptr, + struct ggml_tensor* t5_weights = nullptr, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr) { + auto& registry = streaming_engine_->get_registry(); + int64_t t_start = ggml_time_ms(); + + const int64_t num_blocks = net.get_num_layers(); + const int patch_size = net.get_patch_size(); + const int64_t W = x->ne[0]; + const int64_t H = x->ne[1]; + + LOG_INFO("TRUE per-layer streaming - %lld blocks", num_blocks); + + // Load global layers + LOG_DEBUG("Loading global layers"); + if (!registry.move_layer_to_gpu("_global")) { + LOG_ERROR("Failed to load _global to GPU"); + return false; + } + + // Prepare PE tensors + int64_t pad_h = (patch_size - H % patch_size) % patch_size; + int64_t pad_w = (patch_size - W % patch_size) % patch_size; + int64_t h_pad = H + pad_h; + int64_t w_pad = W + pad_w; + image_pe_vec = gen_anima_image_pe_vec(1, + static_cast(h_pad), + static_cast(w_pad), + patch_size, + net.theta, + net.axes_dim, + 4.0f, // h_extrapolation_ratio + 4.0f, // w_extrapolation_ratio + 1.0f); // t_extrapolation_ratio + + // Persistent storage. Backed by a single GPU-pinned host buffer + // (ensure_pinned_act_buffers) so per-block ggml_backend_tensor_get + // / set_backend_tensor_data run at full PCIe bandwidth. context + // is optional in some Anima variants. + std::vector persistent_x_fallback; + std::vector persistent_context_fallback; + std::vector persistent_embedded_ts_fallback; + std::vector persistent_temb_fallback; + float* persistent_x = nullptr; + float* persistent_context = nullptr; + float* persistent_embedded_ts = nullptr; + float* persistent_temb = nullptr; + size_t persistent_x_count = 0; + size_t persistent_context_count = 0; + size_t persistent_embedded_ts_count = 0; + size_t persistent_temb_count = 0; + int64_t x_ne[4], context_ne[4], embedded_ts_ne[4], temb_ne[4]; + + LOG_DEBUG("Executing input stage"); + { + ggml_tensor* x_output = nullptr; + ggml_tensor* context_output = nullptr; + ggml_tensor* embedded_ts_output = nullptr; + ggml_tensor* temb_output = nullptr; + + auto get_input_graph = [&]() -> struct ggml_cgraph* { + struct ggml_cgraph* gf = new_graph_custom(ANIMA_GRAPH_SIZE / 4); + auto runner_ctx = get_context(); + + ggml_tensor* x_backend = to_backend(x); + ggml_tensor* timesteps_backend = to_backend(timesteps); + ggml_tensor* context_backend = context ? to_backend(context) : nullptr; + ggml_tensor* t5_ids_backend = t5_ids ? to_backend(t5_ids) : nullptr; + ggml_tensor* t5_weights_backend = t5_weights ? to_backend(t5_weights) : nullptr; + + // Adapter PE (if needed) + ggml_tensor* adapter_q_pe_t = nullptr; + ggml_tensor* adapter_k_pe_t = nullptr; + if (t5_ids != nullptr && !adapter_q_pe_vec.empty()) { + adapter_q_pe_t = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, 64, 512); + adapter_k_pe_t = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, 64, 512); + set_backend_tensor_data(adapter_q_pe_t, adapter_q_pe_vec.data()); + set_backend_tensor_data(adapter_k_pe_t, adapter_k_pe_vec.data()); + } + + auto result = net.forward_input_stage(&runner_ctx, x_backend, timesteps_backend, + context_backend, t5_ids_backend, t5_weights_backend, + adapter_q_pe_t, adapter_k_pe_t, H, W); + + x_output = result.x; + context_output = result.encoder_hidden_states; + embedded_ts_output = result.embedded_timestep; + temb_output = result.temb; + + ggml_build_forward_expand(gf, x_output); + if (context_output) ggml_build_forward_expand(gf, context_output); + ggml_build_forward_expand(gf, embedded_ts_output); + ggml_build_forward_expand(gf, temb_output); + + return gf; + }; + + // Don't free compute buffer immediately - we need to read outputs first + if (!GGMLRunner::compute(get_input_graph, n_threads, false, nullptr, nullptr, true)) { + LOG_ERROR("Input stage failed"); + return false; + } + + // Extract to persistent storage + if (x_output && embedded_ts_output && temb_output) { + size_t x_size = ggml_nelements(x_output); + size_t embedded_ts_size = ggml_nelements(embedded_ts_output); + size_t temb_size = ggml_nelements(temb_output); + size_t context_size = context_output ? ggml_nelements(context_output) : 0; + + persistent_x_count = x_size; + persistent_embedded_ts_count = embedded_ts_size; + persistent_temb_count = temb_size; + persistent_context_count = context_size; + + std::vector ptrs; + if (ensure_pinned_act_buffers({x_size * sizeof(float), + embedded_ts_size * sizeof(float), + temb_size * sizeof(float), + context_size * sizeof(float)}, ptrs)) { + persistent_x = ptrs[0]; + persistent_embedded_ts = ptrs[1]; + persistent_temb = ptrs[2]; + persistent_context = context_size ? ptrs[3] : nullptr; + } else { + persistent_x_fallback.resize(x_size); + persistent_embedded_ts_fallback.resize(embedded_ts_size); + persistent_temb_fallback.resize(temb_size); + persistent_x = persistent_x_fallback.data(); + persistent_embedded_ts = persistent_embedded_ts_fallback.data(); + persistent_temb = persistent_temb_fallback.data(); + if (context_size) { + persistent_context_fallback.resize(context_size); + persistent_context = persistent_context_fallback.data(); + } + } + + ggml_backend_tensor_get(x_output, persistent_x, 0, x_size * sizeof(float)); + ggml_backend_tensor_get(embedded_ts_output, persistent_embedded_ts, 0, embedded_ts_size * sizeof(float)); + ggml_backend_tensor_get(temb_output, persistent_temb, 0, temb_size * sizeof(float)); + + for (int i = 0; i < 4; i++) { + x_ne[i] = x_output->ne[i]; + embedded_ts_ne[i] = embedded_ts_output->ne[i]; + temb_ne[i] = temb_output->ne[i]; + } + + if (context_output) { + ggml_backend_tensor_get(context_output, persistent_context, 0, context_size * sizeof(float)); + for (int i = 0; i < 4; i++) { + context_ne[i] = context_output->ne[i]; + } + } + } else { + LOG_ERROR("Failed to get input stage outputs"); + free_compute_buffer(); + return false; + } + + // Now safe to free compute buffer + free_compute_buffer(); + } + + LOG_DEBUG("Input stage done, x=%ldx%ldx%ld", x_ne[0], x_ne[1], x_ne[2]); + + auto block_name_at = [](int i) { return "blocks." + std::to_string(i); }; + + if (resident_blocks_ < 0 && streaming_engine_) { + resident_blocks_ = streaming_engine_->compute_resident_block_count( + "blocks.0", static_cast(num_blocks)); + LOG_INFO("%s blocks cache: %d resident, %d streamed per step", + get_desc().c_str(), + resident_blocks_, + static_cast(num_blocks) - resident_blocks_); + } + + int prefetch_start = 0; + while (prefetch_start < static_cast(num_blocks) && + registry.is_layer_on_gpu(block_name_at(prefetch_start))) { + prefetch_start++; + } + if (streaming_engine_) { + streaming_engine_->prime_prefetch(block_name_at, prefetch_start, static_cast(num_blocks)); + } + + for (int64_t block_idx = 0; block_idx < num_blocks; block_idx++) { + std::string block_name = block_name_at(static_cast(block_idx)); + int64_t t_block_start = ggml_time_ms(); + + // Wait for this block's prefetch to complete (if async prefetch was started) + if (streaming_engine_) { + streaming_engine_->wait_for_prefetch(block_name); + } + + // Load this block's weights (sync load if prefetch didn't happen) + if (!registry.move_layer_to_gpu(block_name)) { + LOG_ERROR("Failed to load %s", block_name.c_str()); + return false; + } + + // Keep the prefetch window full + if (streaming_engine_) { + streaming_engine_->advance_prefetch(block_name_at, static_cast(block_idx), + static_cast(num_blocks)); + } + + ggml_tensor* x_out = nullptr; + + auto get_block_graph = [&]() -> struct ggml_cgraph* { + struct ggml_cgraph* gf = new_graph_custom(ANIMA_GRAPH_SIZE / 4); + + // Create input tensors from persistent storage + ggml_tensor* x_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, x_ne[0], x_ne[1], x_ne[2], x_ne[3]); + ggml_tensor* embedded_ts_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, embedded_ts_ne[0], embedded_ts_ne[1], embedded_ts_ne[2], embedded_ts_ne[3]); + ggml_tensor* temb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, temb_ne[0], temb_ne[1], temb_ne[2], temb_ne[3]); + + x_in = to_backend(x_in); + embedded_ts_in = to_backend(embedded_ts_in); + temb_in = to_backend(temb_in); + + set_backend_tensor_data(x_in, persistent_x); + set_backend_tensor_data(embedded_ts_in, persistent_embedded_ts); + set_backend_tensor_data(temb_in, persistent_temb); + + ggml_tensor* context_in = nullptr; + if (persistent_context_count > 0) { + context_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, context_ne[0], context_ne[1], context_ne[2], context_ne[3]); + context_in = to_backend(context_in); + set_backend_tensor_data(context_in, persistent_context); + } + + // Image PE tensor (shape matches [2, 2, head_dim/2, pos_len]) + int64_t image_pos_len = static_cast(image_pe_vec.size()) / (2 * 2 * (net.head_dim / 2)); + ggml_tensor* image_pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, net.head_dim / 2, image_pos_len); + set_backend_tensor_data(image_pe, image_pe_vec.data()); + + auto runner_ctx = get_context(); + x_out = net.forward_block(&runner_ctx, static_cast(block_idx), x_in, context_in, + embedded_ts_in, temb_in, image_pe); + + ggml_build_forward_expand(gf, x_out); + + return gf; + }; + + // Don't free compute buffer immediately - we need to read outputs first + if (!GGMLRunner::compute(get_block_graph, n_threads, false, nullptr, nullptr, true)) { + LOG_ERROR("Block %lld execution failed", block_idx); + return false; + } + + // Extract output to persistent storage + if (x_out) { + ggml_backend_tensor_get(x_out, persistent_x, 0, persistent_x_count * sizeof(float)); + for (int i = 0; i < 4; i++) { + x_ne[i] = x_out->ne[i]; + } + } + + // Now safe to free compute buffer + free_compute_buffer(); + + // Resident blocks stay on GPU across sampling steps. + if (static_cast(block_idx) >= resident_blocks_) { + registry.move_layer_to_cpu(block_name); + } + + LOG_DEBUG("Block %lld/%lld done (%.2fms)", + block_idx + 1, num_blocks, (ggml_time_ms() - t_block_start) / 1.0); + } + + LOG_DEBUG("Executing output stage"); + { + auto get_output_graph = [&]() -> struct ggml_cgraph* { + struct ggml_cgraph* gf = new_graph_custom(ANIMA_GRAPH_SIZE / 4); + + ggml_tensor* x_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, x_ne[0], x_ne[1], x_ne[2], x_ne[3]); + ggml_tensor* embedded_ts_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, embedded_ts_ne[0], embedded_ts_ne[1], embedded_ts_ne[2], embedded_ts_ne[3]); + ggml_tensor* temb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, temb_ne[0], temb_ne[1], temb_ne[2], temb_ne[3]); + + x_in = to_backend(x_in); + embedded_ts_in = to_backend(embedded_ts_in); + temb_in = to_backend(temb_in); + + set_backend_tensor_data(x_in, persistent_x); + set_backend_tensor_data(embedded_ts_in, persistent_embedded_ts); + set_backend_tensor_data(temb_in, persistent_temb); + + auto runner_ctx = get_context(); + auto final_out = net.forward_output_stage(&runner_ctx, x_in, embedded_ts_in, temb_in); + + // Unpatchify + final_out = DiT::unpatchify_and_crop(compute_ctx, final_out, H, W, patch_size, patch_size, false); + + ggml_build_forward_expand(gf, final_out); + + return gf; + }; + + if (!GGMLRunner::compute(get_output_graph, n_threads, true, output, output_ctx, true)) { + LOG_ERROR("Output stage failed"); + return false; + } + } + + int64_t t_end = ggml_time_ms(); + LOG_INFO("TRUE per-layer streaming completed in %.2fs (%lld blocks)", + (t_end - t_start) / 1000.0, num_blocks); + + return true; + } }; } // namespace Anima diff --git a/src/chunk_graph.hpp b/src/chunk_graph.hpp new file mode 100644 index 000000000..0ee676930 --- /dev/null +++ b/src/chunk_graph.hpp @@ -0,0 +1,232 @@ +#ifndef __CHUNK_GRAPH_HPP__ +#define __CHUNK_GRAPH_HPP__ + +#include +#include +#include +#include +#include + +#include "ggml-alloc.h" +#include "ggml-backend.h" +#include "ggml.h" + +#include "util.h" + +namespace LayerStreaming { + +// Shared helper that compiles K consecutive transformer layers into a single +// ggml graph and dispatches them as one ggml_backend_graph_compute call, +// instead of one tiny graph per layer. Reusable across runners (z_image, +// flux, mmdit, anima, qwen_image, ...). +// +// Cached state (ggml_context, gallocr, cgraph) survives across compute() calls +// on the runner's main compute_ctx. Inputs are shape-bound, so the graph is +// rebuilt whenever shape / layer count changes (e.g. between two queue jobs +// with different prompt lengths). +class ChunkGraph { +public: + using BuildFn = std::function& inputs, + int K)>; + + ChunkGraph() = default; + ~ChunkGraph() { clear(); } + ChunkGraph(const ChunkGraph&) = delete; + ChunkGraph& operator=(const ChunkGraph&) = delete; + + // Build (or keep cached) a graph for K layers with the given input shapes. + // The cached graph is reused only if K, every input shape, AND the + // caller-supplied state_token match the last build; otherwise the old + // graph is freed and a fresh one is built. + // + // state_token: caller-computed fingerprint of any external state that the + // graph captures by reference and can become stale (e.g. weight_adapter + // pointer when LoRAs change, or runner flag bits like flash_attn). If two + // builds would topologically differ, give them different tokens. + // + // build_fn receives the freshly created input tensors (one per entry of + // input_shapes, in the same order) and must wire them through K layers, + // returning the output tensor. The output is automatically marked as a + // graph output. + // + // Returns false on allocator / context failure; on success the graph is + // ready to dispatch. + bool ensure_built(ggml_backend_t backend, + int K, + const std::vector>& input_shapes, + ggml_type input_type, + uint64_t state_token, + BuildFn build_fn, + size_t graph_node_capacity, + const std::string& desc_tag) { + if (gf_ != nullptr + && layer_count_ == K + && state_token_ == state_token + && shapes_match(input_shapes)) { + return true; + } + clear(); + + // 16 MB headroom for op metadata is plenty for typical K (~30 layers). + size_t ctx_size = 16 * 1024 * 1024; + ctx_ = ggml_init({ctx_size, nullptr, true}); + if (ctx_ == nullptr) { + LOG_ERROR("%s chunk_ctx alloc failed", desc_tag.c_str()); + return false; + } + + gf_ = ggml_new_graph_custom(ctx_, graph_node_capacity, false); + + inputs_.clear(); + inputs_.reserve(input_shapes.size()); + for (const auto& shape : input_shapes) { + ggml_tensor* t = ggml_new_tensor_4d(ctx_, input_type, + shape[0], shape[1], shape[2], shape[3]); + ggml_set_input(t); + inputs_.push_back(t); + } + + // Mirror GGMLRunner::prepare_build_in_tensor_before(): create the + // named build-in scalar tensors on the chunk context so anything in + // build_fn that uses ggml_ext_full / ggml_ext_zeros / ggml_ext_ones / + // ggml_ext_cast_f32 (all of which look these up by name via + // ggml_get_tensor) finds them. Without this they're null in our + // standalone context and the next op SEGVs — surfaces in attention's + // KV-pad mask creation when token sequences are short. + // ggml_set_input is required: without it the gallocr treats these as + // regular scratch nodes and may reuse their buffer slot for op + // intermediates, overwriting our uploaded scalar values before compute + // reads them. (GGMLRunner avoids this by registering them via + // set_backend_tensor_data, which keeps the data outside the allocator.) + one_tensor_ = ggml_new_tensor_1d(ctx_, GGML_TYPE_F32, 1); + ggml_set_name(one_tensor_, "ggml_runner_build_in_tensor:one"); + ggml_set_input(one_tensor_); + zero_int_tensor_ = ggml_new_tensor_1d(ctx_, GGML_TYPE_I32, 1); + ggml_set_name(zero_int_tensor_, "ggml_runner_build_in_tensor:zero_int"); + ggml_set_input(zero_int_tensor_); + + out_ = build_fn(ctx_, inputs_, K); + if (out_ == nullptr) { + LOG_ERROR("%s chunk build_fn returned null", desc_tag.c_str()); + clear(); + return false; + } + ggml_set_output(out_); + ggml_build_forward_expand(gf_, one_tensor_); + ggml_build_forward_expand(gf_, zero_int_tensor_); + ggml_build_forward_expand(gf_, out_); + + allocr_ = ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend)); + if (allocr_ == nullptr) { + LOG_ERROR("%s chunk gallocr_new failed", desc_tag.c_str()); + clear(); + return false; + } + if (!ggml_gallocr_reserve(allocr_, gf_)) { + LOG_ERROR("%s chunk gallocr_reserve failed", desc_tag.c_str()); + clear(); + return false; + } + size_t buf_size = ggml_gallocr_get_buffer_size(allocr_, 0); + LOG_INFO("%s chunk graph: %d layers, compute buffer = %.2f MB", + desc_tag.c_str(), K, buf_size / (1024.0 * 1024.0)); + + layer_count_ = K; + cached_shapes_ = input_shapes; + state_token_ = state_token; + return true; + } + + // Allocate/upload-inputs/compute/read-output for one step. host_data and + // host_nbytes must have one entry per input (matching the order passed to + // ensure_built). out_buf must be sized for at least ggml_nbytes(out_). + bool dispatch(ggml_backend_t backend, + const std::vector& host_data, + const std::vector& host_nbytes, + void* out_buf, + size_t out_nbytes) { + if (gf_ == nullptr) { + return false; + } + if (host_data.size() != inputs_.size() || host_nbytes.size() != inputs_.size()) { + LOG_ERROR("chunk dispatch: host_data/host_nbytes size mismatch"); + return false; + } + if (!ggml_gallocr_alloc_graph(allocr_, gf_)) { + LOG_ERROR("chunk alloc_graph failed"); + return false; + } + for (size_t i = 0; i < inputs_.size(); i++) { + ggml_backend_tensor_set(inputs_[i], host_data[i], 0, host_nbytes[i]); + } + // Upload the build-in scalars each dispatch (gallocr_alloc_graph may + // re-bind tensor data offsets within the compute buffer). + static constexpr float kOneVal = 1.0f; + static constexpr int32_t kZeroIntVal = 0; + ggml_backend_tensor_set(one_tensor_, &kOneVal, 0, sizeof(kOneVal)); + ggml_backend_tensor_set(zero_int_tensor_, &kZeroIntVal, 0, sizeof(kZeroIntVal)); + + ggml_status status = ggml_backend_graph_compute(backend, gf_); + if (status != GGML_STATUS_SUCCESS) { + LOG_ERROR("chunk compute failed: %s", ggml_status_to_string(status)); + return false; + } + ggml_backend_tensor_get(out_, out_buf, 0, out_nbytes); + return true; + } + + ggml_tensor* output() const { return out_; } + int layer_count() const { return layer_count_; } + bool is_built() const { return gf_ != nullptr; } + + void clear() { + if (allocr_ != nullptr) { + ggml_gallocr_free(allocr_); + allocr_ = nullptr; + } + if (ctx_ != nullptr) { + ggml_free(ctx_); + ctx_ = nullptr; + } + gf_ = nullptr; + out_ = nullptr; + one_tensor_ = nullptr; + zero_int_tensor_ = nullptr; + inputs_.clear(); + cached_shapes_.clear(); + layer_count_ = 0; + state_token_ = 0; + } + +private: + bool shapes_match(const std::vector>& shapes) const { + if (shapes.size() != cached_shapes_.size()) { + return false; + } + for (size_t i = 0; i < shapes.size(); i++) { + for (int j = 0; j < 4; j++) { + if (shapes[i][j] != cached_shapes_[i][j]) { + return false; + } + } + } + return true; + } + + ggml_context* ctx_ = nullptr; + ggml_gallocr_t allocr_ = nullptr; + ggml_cgraph* gf_ = nullptr; + std::vector inputs_; + ggml_tensor* out_ = nullptr; + ggml_tensor* one_tensor_ = nullptr; + ggml_tensor* zero_int_tensor_ = nullptr; + int layer_count_ = 0; + uint64_t state_token_ = 0; + std::vector> cached_shapes_; +}; + +} // namespace LayerStreaming + +#endif diff --git a/src/conditioner.hpp b/src/conditioner.hpp index 4907938b0..ff73ab3b8 100644 --- a/src/conditioner.hpp +++ b/src/conditioner.hpp @@ -95,6 +95,13 @@ struct Conditioner { virtual std::string remove_trigger_from_prompt(const std::string& prompt) { GGML_ABORT("Not implemented yet!"); } + + // Dynamic tensor offloading interface + virtual bool is_params_on_gpu() const { return false; } + virtual bool move_params_to_cpu() { return false; } + virtual bool move_params_to_gpu() { return false; } + virtual size_t get_params_vram_size() const { return 0; } + virtual void set_auto_offload(bool enabled) {} }; // ldm.modules.encoders.modules.FrozenCLIPEmbedder @@ -187,6 +194,46 @@ struct FrozenCLIPEmbedderWithCustomWords : public Conditioner { } } + // Dynamic tensor offloading + bool is_params_on_gpu() const override { + bool on_gpu = text_model->is_params_on_gpu(); + if (sd_version_is_sdxl(version) && text_model2) { + on_gpu = on_gpu && text_model2->is_params_on_gpu(); + } + return on_gpu; + } + + bool move_params_to_cpu() override { + bool success = text_model->move_params_to_cpu(); + if (sd_version_is_sdxl(version) && text_model2) { + success = text_model2->move_params_to_cpu() && success; + } + return success; + } + + bool move_params_to_gpu() override { + bool success = text_model->move_params_to_gpu(); + if (sd_version_is_sdxl(version) && text_model2) { + success = text_model2->move_params_to_gpu() && success; + } + return success; + } + + size_t get_params_vram_size() const override { + size_t size = text_model->get_params_vram_size(); + if (sd_version_is_sdxl(version) && text_model2) { + size += text_model2->get_params_vram_size(); + } + return size; + } + + void set_auto_offload(bool enabled) override { + text_model->set_auto_offload(enabled); + if (sd_version_is_sdxl(version) && text_model2) { + text_model2->set_auto_offload(enabled); + } + } + bool load_embedding(std::string embd_name, std::string embd_path, std::vector& bpe_tokens) { ModelLoader model_loader; if (!model_loader.init_from_file_and_convert_name(embd_path)) { @@ -825,6 +872,75 @@ struct SD3CLIPEmbedder : public Conditioner { } } + // Dynamic tensor offloading + bool is_params_on_gpu() const override { + bool on_gpu = true; + if (clip_l) { + on_gpu = on_gpu && clip_l->is_params_on_gpu(); + } + if (clip_g) { + on_gpu = on_gpu && clip_g->is_params_on_gpu(); + } + if (t5) { + on_gpu = on_gpu && t5->is_params_on_gpu(); + } + return on_gpu; + } + + bool move_params_to_cpu() override { + bool success = true; + if (clip_l) { + success = clip_l->move_params_to_cpu() && success; + } + if (clip_g) { + success = clip_g->move_params_to_cpu() && success; + } + if (t5) { + success = t5->move_params_to_cpu() && success; + } + return success; + } + + bool move_params_to_gpu() override { + bool success = true; + if (clip_l) { + success = clip_l->move_params_to_gpu() && success; + } + if (clip_g) { + success = clip_g->move_params_to_gpu() && success; + } + if (t5) { + success = t5->move_params_to_gpu() && success; + } + return success; + } + + size_t get_params_vram_size() const override { + size_t size = 0; + if (clip_l) { + size += clip_l->get_params_vram_size(); + } + if (clip_g) { + size += clip_g->get_params_vram_size(); + } + if (t5) { + size += t5->get_params_vram_size(); + } + return size; + } + + void set_auto_offload(bool enabled) override { + if (clip_l) { + clip_l->set_auto_offload(enabled); + } + if (clip_g) { + clip_g->set_auto_offload(enabled); + } + if (t5) { + t5->set_auto_offload(enabled); + } + } + std::vector, std::vector>> tokenize(std::string text, size_t min_length = 0, size_t max_length = 0, @@ -1171,6 +1287,60 @@ struct FluxCLIPEmbedder : public Conditioner { } } + // Dynamic tensor offloading + bool is_params_on_gpu() const override { + bool on_gpu = true; + if (clip_l) { + on_gpu = on_gpu && clip_l->is_params_on_gpu(); + } + if (t5) { + on_gpu = on_gpu && t5->is_params_on_gpu(); + } + return on_gpu; + } + + bool move_params_to_cpu() override { + bool success = true; + if (clip_l) { + success = clip_l->move_params_to_cpu() && success; + } + if (t5) { + success = t5->move_params_to_cpu() && success; + } + return success; + } + + bool move_params_to_gpu() override { + bool success = true; + if (clip_l) { + success = clip_l->move_params_to_gpu() && success; + } + if (t5) { + success = t5->move_params_to_gpu() && success; + } + return success; + } + + size_t get_params_vram_size() const override { + size_t size = 0; + if (clip_l) { + size += clip_l->get_params_vram_size(); + } + if (t5) { + size += t5->get_params_vram_size(); + } + return size; + } + + void set_auto_offload(bool enabled) override { + if (clip_l) { + clip_l->set_auto_offload(enabled); + } + if (t5) { + t5->set_auto_offload(enabled); + } + } + std::vector, std::vector>> tokenize(std::string text, size_t min_length = 0, size_t max_length = 0) { @@ -1525,6 +1695,29 @@ struct T5CLIPEmbedder : public Conditioner { conditioner_params.clip_skip, conditioner_params.zero_out_masked); } + + // Dynamic tensor offloading + bool is_params_on_gpu() const override { + return t5 ? t5->is_params_on_gpu() : false; + } + + bool move_params_to_cpu() override { + return t5 ? t5->move_params_to_cpu() : false; + } + + bool move_params_to_gpu() override { + return t5 ? t5->move_params_to_gpu() : false; + } + + size_t get_params_vram_size() const override { + return t5 ? t5->get_params_vram_size() : 0; + } + + void set_auto_offload(bool enabled) override { + if (t5) { + t5->set_auto_offload(enabled); + } + } }; struct AnimaConditioner : public Conditioner { @@ -1572,6 +1765,27 @@ struct AnimaConditioner : public Conditioner { llm->set_weight_adapter(adapter); } + // Dynamic tensor offloading - delegate to LLM + bool is_params_on_gpu() const override { + return llm->is_params_on_gpu(); + } + + bool move_params_to_cpu() override { + return llm->move_params_to_cpu(); + } + + bool move_params_to_gpu() override { + return llm->move_params_to_gpu(); + } + + size_t get_params_vram_size() const override { + return llm->get_params_vram_size(); + } + + void set_auto_offload(bool enabled) override { + llm->set_auto_offload(enabled); + } + std::tuple, std::vector, std::vector, std::vector> tokenize(std::string text) { auto parsed_attention = parse_prompt_attention(text); @@ -1999,6 +2213,29 @@ struct LLMEmbedder : public Conditioner { result.extra_c_crossattns = std::move(extra_hidden_states_vec); return result; } + + // Dynamic tensor offloading + bool is_params_on_gpu() const override { + return llm ? llm->is_params_on_gpu() : false; + } + + bool move_params_to_cpu() override { + return llm ? llm->move_params_to_cpu() : false; + } + + bool move_params_to_gpu() override { + return llm ? llm->move_params_to_gpu() : false; + } + + size_t get_params_vram_size() const override { + return llm ? llm->get_params_vram_size() : 0; + } + + void set_auto_offload(bool enabled) override { + if (llm) { + llm->set_auto_offload(enabled); + } + } }; #endif diff --git a/src/diffusion_model.hpp b/src/diffusion_model.hpp index 1a202a1a7..66ec562a8 100644 --- a/src/diffusion_model.hpp +++ b/src/diffusion_model.hpp @@ -37,6 +37,43 @@ static inline const sd::Tensor& tensor_or_empty(const sd::Tensor* tensor) return tensor != nullptr ? *tensor : kEmpty; } +// Helper to convert sd::Tensor pointers to temporary ggml_tensor* for streaming code paths. +// The returned ggml_tensors live in the provided ggml_context and point to the sd::Tensor's data. +struct StreamingParamConverter { + ggml_context* ctx = nullptr; + + StreamingParamConverter() { + struct ggml_init_params params = { + /*.mem_size =*/ 16 * ggml_tensor_overhead(), + /*.mem_buffer =*/ nullptr, + /*.no_alloc =*/ true, + }; + ctx = ggml_init(params); + } + + ~StreamingParamConverter() { + if (ctx) ggml_free(ctx); + } + + template + ggml_tensor* convert(const sd::Tensor* tensor) { + if (tensor == nullptr || tensor->numel() == 0) return nullptr; + ggml_tensor* t = sd::make_ggml_tensor(ctx, *tensor, false); + t->data = const_cast(static_cast(tensor->data())); + return t; + } + + std::vector convert_vec(const std::vector>* tensors) { + std::vector result; + if (tensors == nullptr) return result; + for (const auto& t : *tensors) { + sd::Tensor tmp_ref = t; // non-const copy for convert + result.push_back(convert(&tmp_ref)); + } + return result; + } +}; + struct DiffusionModel { virtual std::string get_desc() = 0; virtual sd::Tensor compute(int n_threads, @@ -51,6 +88,81 @@ struct DiffusionModel { virtual void set_flash_attention_enabled(bool enabled) = 0; virtual void set_max_graph_vram_bytes(size_t max_vram_bytes) = 0; virtual void set_circular_axes(bool circular_x, bool circular_y) = 0; + + // Dynamic tensor offloading interface + virtual bool is_params_on_gpu() const { return false; } + virtual bool move_params_to_cpu() { return false; } + virtual bool move_params_to_gpu() { return false; } + virtual size_t get_params_vram_size() const { return 0; } + + // Layer streaming interface (for granular tensor offloading) + virtual bool supports_layer_streaming() const { return false; } + virtual void enable_layer_streaming(int prefetch_layers = 1, size_t min_free_vram = 512 * 1024 * 1024) { + (void)prefetch_layers; + (void)min_free_vram; + } + virtual void disable_layer_streaming() {} + virtual bool is_layer_streaming_enabled() const { return false; } + virtual bool compute_streaming(int n_threads, + DiffusionParams diffusion_params, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr) { + // Default: fall back to regular compute, copy result to output + auto result = compute(n_threads, diffusion_params); + if (output != nullptr && result.numel() > 0) { + if (*output == nullptr && output_ctx != nullptr) { + auto shape = result.shape(); + int n_dims = std::min(static_cast(shape.size()), GGML_MAX_DIMS); + std::array ne = {1, 1, 1, 1}; + for (int i = 0; i < n_dims; i++) ne[i] = shape[i]; + *output = ggml_new_tensor(output_ctx, GGML_TYPE_F32, n_dims, ne.data()); + } + if (*output != nullptr) { + memcpy((*output)->data, result.data(), result.numel() * sizeof(float)); + } + } + return result.numel() > 0; + } + // Offload all streaming layers to CPU (free GPU memory after diffusion) + virtual void offload_streaming_layers() {} + + // Bridge: dispatch to streaming or regular compute based on layer streaming state, + // returning sd::Tensor for compatibility with the upstream sample loop. + // + // Streaming output shape matches the input x shape (diffusion preserves shape). + // We pre-allocate the destination sd::Tensor and have the streaming runner write + // directly into its memory via a tiny no_alloc ggml_context — no per-step malloc. + sd::Tensor compute_dispatch(int n_threads, const DiffusionParams& diffusion_params) { + if (!is_layer_streaming_enabled()) { + return compute(n_threads, diffusion_params); + } + if (diffusion_params.x == nullptr) { + LOG_ERROR("compute_dispatch: diffusion_params.x is null"); + return {}; + } + + // Pre-allocate result with x's shape; stream writes will land here directly. + sd::Tensor result(diffusion_params.x->shape()); + + // Tiny no_alloc context — only holds tensor metadata, no data backing. + ggml_init_params params = {2 * ggml_tensor_overhead(), nullptr, true}; + ggml_context* out_ctx = ggml_init(params); + if (out_ctx == nullptr) { + LOG_ERROR("compute_dispatch: ggml_init failed"); + return {}; + } + + // Make a metadata tensor with the same shape as result and point its data + // pointer at result's memory. The runner's ggml_ext_backend_tensor_get_and_sync + // will copy GPU→here directly. Skip ggml_dup_tensor by passing non-null *output. + ggml_tensor* out_tensor = sd::make_ggml_tensor(out_ctx, result, false); + out_tensor->data = result.data(); + + bool ok = compute_streaming(n_threads, diffusion_params, &out_tensor, out_ctx); + ggml_free(out_ctx); + if (!ok) return {}; + return result; + } }; struct UNetModel : public DiffusionModel { @@ -122,6 +234,53 @@ struct UNetModel : public DiffusionModel { diffusion_params.controls ? *diffusion_params.controls : empty_controls, diffusion_params.control_strength); } + + // Dynamic tensor offloading + bool is_params_on_gpu() const override { return unet.is_params_on_gpu(); } + bool move_params_to_cpu() override { return unet.move_params_to_cpu(); } + bool move_params_to_gpu() override { return unet.move_params_to_gpu(); } + size_t get_params_vram_size() const override { return unet.get_params_vram_size(); } + + // Layer streaming (coarse-stage for UNet due to skip connections) + bool supports_layer_streaming() const override { return true; } + + void enable_layer_streaming(int prefetch_layers, size_t min_free_vram) override { + LayerStreaming::StreamingConfig config; + config.prefetch_layers = prefetch_layers; + config.min_free_vram = min_free_vram; + unet.enable_layer_streaming(config); + } + + void disable_layer_streaming() override { + unet.disable_layer_streaming(); + } + + bool is_layer_streaming_enabled() const override { + return unet.is_streaming_enabled(); + } + + void offload_streaming_layers() override { + unet.offload_streaming_layers(); + } + + bool compute_streaming(int n_threads, + DiffusionParams diffusion_params, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr) override { + StreamingParamConverter cvt; + auto controls_vec = cvt.convert_vec(diffusion_params.controls); + return unet.compute_streaming(n_threads, + cvt.convert(diffusion_params.x), + cvt.convert(diffusion_params.timesteps), + cvt.convert(diffusion_params.context), + cvt.convert(diffusion_params.c_concat), + cvt.convert(diffusion_params.y), + diffusion_params.num_video_frames, + controls_vec, + diffusion_params.control_strength, + output, + output_ctx); + } }; struct MMDiTModel : public DiffusionModel { @@ -189,6 +348,50 @@ struct MMDiTModel : public DiffusionModel { tensor_or_empty(diffusion_params.y), diffusion_params.skip_layers ? *diffusion_params.skip_layers : empty_skip_layers); } + + // Dynamic tensor offloading + bool is_params_on_gpu() const override { return mmdit.is_params_on_gpu(); } + bool move_params_to_cpu() override { return mmdit.move_params_to_cpu(); } + bool move_params_to_gpu() override { return mmdit.move_params_to_gpu(); } + size_t get_params_vram_size() const override { return mmdit.get_params_vram_size(); } + + // Layer streaming (granular tensor offloading) + bool supports_layer_streaming() const override { return true; } + + void enable_layer_streaming(int prefetch_layers, size_t min_free_vram) override { + LayerStreaming::StreamingConfig config; + config.prefetch_layers = prefetch_layers; + config.min_free_vram = min_free_vram; + mmdit.enable_layer_streaming(config); + } + + void disable_layer_streaming() override { + mmdit.disable_layer_streaming(); + } + + bool is_layer_streaming_enabled() const override { + return mmdit.is_streaming_enabled(); + } + + void offload_streaming_layers() override { + mmdit.offload_streaming_layers(); + } + + bool compute_streaming(int n_threads, + DiffusionParams diffusion_params, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr) override { + StreamingParamConverter cvt; + auto skip = diffusion_params.skip_layers ? *diffusion_params.skip_layers : std::vector(); + return mmdit.compute_streaming(n_threads, + cvt.convert(diffusion_params.x), + cvt.convert(diffusion_params.timesteps), + cvt.convert(diffusion_params.context), + cvt.convert(diffusion_params.y), + output, + output_ctx, + skip); + } }; struct FluxModel : public DiffusionModel { @@ -263,6 +466,55 @@ struct FluxModel : public DiffusionModel { diffusion_params.increase_ref_index, diffusion_params.skip_layers ? *diffusion_params.skip_layers : empty_skip_layers); } + + // Dynamic tensor offloading + bool is_params_on_gpu() const override { return flux.is_params_on_gpu(); } + bool move_params_to_cpu() override { return flux.move_params_to_cpu(); } + bool move_params_to_gpu() override { return flux.move_params_to_gpu(); } + size_t get_params_vram_size() const override { return flux.get_params_vram_size(); } + + // Layer streaming (granular tensor offloading) + bool supports_layer_streaming() const override { return true; } + + void enable_layer_streaming(int prefetch_layers, size_t min_free_vram) override { + LayerStreaming::StreamingConfig config; + config.prefetch_layers = prefetch_layers; + config.min_free_vram = min_free_vram; + flux.enable_layer_streaming(config); + } + + void disable_layer_streaming() override { + flux.disable_layer_streaming(); + } + + bool is_layer_streaming_enabled() const override { + return flux.is_streaming_enabled(); + } + + void offload_streaming_layers() override { + flux.offload_streaming_layers(); + } + + bool compute_streaming(int n_threads, + DiffusionParams diffusion_params, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr) override { + StreamingParamConverter cvt; + auto ref_vec = cvt.convert_vec(diffusion_params.ref_latents); + auto skip = diffusion_params.skip_layers ? *diffusion_params.skip_layers : std::vector(); + return flux.compute_streaming(n_threads, + cvt.convert(diffusion_params.x), + cvt.convert(diffusion_params.timesteps), + cvt.convert(diffusion_params.context), + cvt.convert(diffusion_params.c_concat), + cvt.convert(diffusion_params.y), + cvt.convert(diffusion_params.guidance), + ref_vec, + diffusion_params.increase_ref_index, + output, + output_ctx, + skip); + } }; struct AnimaModel : public DiffusionModel { @@ -331,6 +583,42 @@ struct AnimaModel : public DiffusionModel { tensor_or_empty(diffusion_params.t5_ids), tensor_or_empty(diffusion_params.t5_weights)); } + + bool supports_layer_streaming() const override { return true; } + + void enable_layer_streaming(int prefetch_layers, size_t min_free_vram) override { + LayerStreaming::StreamingConfig config; + config.prefetch_layers = prefetch_layers; + config.min_free_vram = min_free_vram; + anima.enable_layer_streaming(config); + } + + void disable_layer_streaming() override { + anima.disable_layer_streaming(); + } + + bool is_layer_streaming_enabled() const override { + return anima.is_streaming_enabled(); + } + + void offload_streaming_layers() override { + anima.offload_streaming_layers(); + } + + bool compute_streaming(int n_threads, + DiffusionParams diffusion_params, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr) override { + StreamingParamConverter cvt; + return anima.compute_streaming(n_threads, + cvt.convert(diffusion_params.x), + cvt.convert(diffusion_params.timesteps), + cvt.convert(diffusion_params.context), + cvt.convert(diffusion_params.t5_ids), + cvt.convert(diffusion_params.t5_weights), + output, + output_ctx); + } }; struct WanModel : public DiffusionModel { @@ -403,6 +691,52 @@ struct WanModel : public DiffusionModel { tensor_or_empty(diffusion_params.vace_context), diffusion_params.vace_strength); } + + // Dynamic tensor offloading + bool is_params_on_gpu() const override { return wan.is_params_on_gpu(); } + bool move_params_to_cpu() override { return wan.move_params_to_cpu(); } + bool move_params_to_gpu() override { return wan.move_params_to_gpu(); } + size_t get_params_vram_size() const override { return wan.get_params_vram_size(); } + + // Layer streaming (granular tensor offloading) + bool supports_layer_streaming() const override { return true; } + + void enable_layer_streaming(int prefetch_layers, size_t min_free_vram) override { + LayerStreaming::StreamingConfig config; + config.prefetch_layers = prefetch_layers; + config.min_free_vram = min_free_vram; + wan.enable_layer_streaming(config); + } + + void disable_layer_streaming() override { + wan.disable_layer_streaming(); + } + + bool is_layer_streaming_enabled() const override { + return wan.is_streaming_enabled(); + } + + void offload_streaming_layers() override { + wan.offload_streaming_layers(); + } + + bool compute_streaming(int n_threads, + DiffusionParams diffusion_params, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr) override { + StreamingParamConverter cvt; + return wan.compute_streaming(n_threads, + cvt.convert(diffusion_params.x), + cvt.convert(diffusion_params.timesteps), + cvt.convert(diffusion_params.context), + cvt.convert(diffusion_params.y), + cvt.convert(diffusion_params.c_concat), + nullptr, + cvt.convert(diffusion_params.vace_context), + diffusion_params.vace_strength, + output, + output_ctx); + } }; struct QwenImageModel : public DiffusionModel { @@ -474,6 +808,50 @@ struct QwenImageModel : public DiffusionModel { diffusion_params.ref_latents ? *diffusion_params.ref_latents : empty_ref_latents, true); } + + // Dynamic tensor offloading + bool is_params_on_gpu() const override { return qwen_image.is_params_on_gpu(); } + bool move_params_to_cpu() override { return qwen_image.move_params_to_cpu(); } + bool move_params_to_gpu() override { return qwen_image.move_params_to_gpu(); } + size_t get_params_vram_size() const override { return qwen_image.get_params_vram_size(); } + + // Layer streaming (granular tensor offloading) + bool supports_layer_streaming() const override { return true; } + + void enable_layer_streaming(int prefetch_layers, size_t min_free_vram) override { + LayerStreaming::StreamingConfig config; + config.prefetch_layers = prefetch_layers; + config.min_free_vram = min_free_vram; + qwen_image.enable_layer_streaming(config); + } + + void disable_layer_streaming() override { + qwen_image.disable_layer_streaming(); + } + + bool is_layer_streaming_enabled() const override { + return qwen_image.is_streaming_enabled(); + } + + bool compute_streaming(int n_threads, + DiffusionParams diffusion_params, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr) override { + StreamingParamConverter cvt; + auto ref_vec = cvt.convert_vec(diffusion_params.ref_latents); + return qwen_image.compute_streaming(n_threads, + cvt.convert(diffusion_params.x), + cvt.convert(diffusion_params.timesteps), + cvt.convert(diffusion_params.context), + ref_vec, + true, // increase_ref_index + output, + output_ctx); + } + + void offload_streaming_layers() override { + qwen_image.offload_streaming_layers(); + } }; struct ZImageModel : public DiffusionModel { @@ -544,6 +922,50 @@ struct ZImageModel : public DiffusionModel { diffusion_params.ref_latents ? *diffusion_params.ref_latents : empty_ref_latents, true); } + + // Dynamic tensor offloading + bool is_params_on_gpu() const override { return z_image.is_params_on_gpu(); } + bool move_params_to_cpu() override { return z_image.move_params_to_cpu(); } + bool move_params_to_gpu() override { return z_image.move_params_to_gpu(); } + size_t get_params_vram_size() const override { return z_image.get_params_vram_size(); } + + // Layer streaming (granular tensor offloading) + bool supports_layer_streaming() const override { return true; } + + void enable_layer_streaming(int prefetch_layers, size_t min_free_vram) override { + LayerStreaming::StreamingConfig config; + config.prefetch_layers = prefetch_layers; + config.min_free_vram = min_free_vram; + z_image.enable_layer_streaming(config); + } + + void disable_layer_streaming() override { + z_image.disable_layer_streaming(); + } + + bool is_layer_streaming_enabled() const override { + return z_image.is_streaming_enabled(); + } + + void offload_streaming_layers() override { + z_image.offload_streaming_layers(); + } + + bool compute_streaming(int n_threads, + DiffusionParams diffusion_params, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr) override { + StreamingParamConverter cvt; + auto ref_vec = cvt.convert_vec(diffusion_params.ref_latents); + return z_image.compute_streaming(n_threads, + cvt.convert(diffusion_params.x), + cvt.convert(diffusion_params.timesteps), + cvt.convert(diffusion_params.context), + ref_vec, + true, // increase_ref_index + output, + output_ctx); + } }; struct ErnieImageModel : public DiffusionModel { diff --git a/src/flux.hpp b/src/flux.hpp index 732a37197..22653d531 100644 --- a/src/flux.hpp +++ b/src/flux.hpp @@ -5,6 +5,7 @@ #include #include "common_dit.hpp" +#include "layer_streaming.hpp" #include "model.h" #include "rope.hpp" @@ -847,6 +848,142 @@ namespace Flux { } } + struct StreamingInputResult { + ggml_tensor* img; + ggml_tensor* txt; + ggml_tensor* vec; + ggml_tensor* txt_img_mask; + std::vector ds_img_mods; + std::vector ds_txt_mods; + std::vector ss_mods; + int64_t n_txt_tokens; + }; + + StreamingInputResult forward_input_stage(GGMLRunnerContext* ctx, + ggml_tensor* img, + ggml_tensor* txt, + ggml_tensor* timesteps, + ggml_tensor* y, + ggml_tensor* guidance, + ggml_tensor* mod_index_arange = nullptr) { + auto img_in = std::dynamic_pointer_cast(blocks["img_in"]); + auto txt_in = std::dynamic_pointer_cast(blocks["txt_in"]); + + int64_t n_txt_tokens = txt->ne[1]; + + if (img_in) { + img = img_in->forward(ctx, img); + } + + ggml_tensor* vec; + ggml_tensor* txt_img_mask = nullptr; + if (params.is_chroma) { + int64_t mod_index_length = 344; + auto approx = std::dynamic_pointer_cast(blocks["distilled_guidance_layer"]); + auto distill_timestep = ggml_ext_timestep_embedding(ctx->ggml_ctx, timesteps, 16, 10000, 1000.f); + auto distill_guidance = ggml_ext_timestep_embedding(ctx->ggml_ctx, guidance, 16, 10000, 1000.f); + + GGML_ASSERT(mod_index_arange != nullptr); + auto modulation_index = ggml_ext_timestep_embedding(ctx->ggml_ctx, mod_index_arange, 32, 10000, 1000.f); + modulation_index = ggml_repeat(ctx->ggml_ctx, modulation_index, ggml_new_tensor_3d(ctx->ggml_ctx, GGML_TYPE_F32, modulation_index->ne[0], modulation_index->ne[1], img->ne[2])); + + auto timestep_guidance = ggml_concat(ctx->ggml_ctx, distill_timestep, distill_guidance, 0); + timestep_guidance = ggml_repeat(ctx->ggml_ctx, timestep_guidance, modulation_index); + + vec = ggml_concat(ctx->ggml_ctx, timestep_guidance, modulation_index, 0); + vec = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, vec, 0, 2, 1, 3)); + vec = approx->forward(ctx, vec); + + if (y != nullptr) { + txt_img_mask = ggml_pad(ctx->ggml_ctx, y, static_cast(img->ne[1]), 0, 0, 0); + } + } else { + auto time_in = std::dynamic_pointer_cast(blocks["time_in"]); + vec = time_in->forward(ctx, ggml_ext_timestep_embedding(ctx->ggml_ctx, timesteps, 256, 10000, 1000.f)); + if (params.guidance_embed) { + GGML_ASSERT(guidance != nullptr); + auto guidance_in = std::dynamic_pointer_cast(blocks["guidance_in"]); + auto g_in = ggml_ext_timestep_embedding(ctx->ggml_ctx, guidance, 256, 10000, 1000.f); + vec = ggml_add(ctx->ggml_ctx, vec, guidance_in->forward(ctx, g_in)); + } + if (params.vec_in_dim > 0) { + auto vector_in = std::dynamic_pointer_cast(blocks["vector_in"]); + vec = ggml_add(ctx->ggml_ctx, vec, vector_in->forward(ctx, y)); + } + } + + std::vector ds_img_mods; + std::vector ds_txt_mods; + std::vector ss_mods; + if (params.share_modulation) { + auto double_stream_modulation_img = std::dynamic_pointer_cast(blocks["double_stream_modulation_img"]); + auto double_stream_modulation_txt = std::dynamic_pointer_cast(blocks["double_stream_modulation_txt"]); + auto single_stream_modulation = std::dynamic_pointer_cast(blocks["single_stream_modulation"]); + + ds_img_mods = double_stream_modulation_img->forward(ctx, vec); + ds_txt_mods = double_stream_modulation_txt->forward(ctx, vec); + ss_mods = single_stream_modulation->forward(ctx, vec); + } + + if (params.semantic_txt_norm) { + auto semantic_txt_norm = std::dynamic_pointer_cast(blocks["txt_norm"]); + txt = semantic_txt_norm->forward(ctx, txt); + } + + txt = txt_in->forward(ctx, txt); + + return {img, txt, vec, txt_img_mask, ds_img_mods, ds_txt_mods, ss_mods, n_txt_tokens}; + } + + std::pair forward_double_block(GGMLRunnerContext* ctx, + int block_idx, + ggml_tensor* img, + ggml_tensor* txt, + ggml_tensor* vec, + ggml_tensor* pe, + ggml_tensor* txt_img_mask, + std::vector& ds_img_mods, + std::vector& ds_txt_mods) { + auto block = std::dynamic_pointer_cast(blocks["double_blocks." + std::to_string(block_idx)]); + auto img_txt = block->forward(ctx, img, txt, vec, pe, txt_img_mask, ds_img_mods, ds_txt_mods); + return img_txt; + } + + ggml_tensor* forward_single_block(GGMLRunnerContext* ctx, + int block_idx, + ggml_tensor* txt_img, + ggml_tensor* vec, + ggml_tensor* pe, + ggml_tensor* txt_img_mask, + std::vector& ss_mods) { + auto block = std::dynamic_pointer_cast(blocks["single_blocks." + std::to_string(block_idx)]); + return block->forward(ctx, txt_img, vec, pe, txt_img_mask, ss_mods); + } + + ggml_tensor* forward_output_stage(GGMLRunnerContext* ctx, + ggml_tensor* txt_img, + ggml_tensor* vec, + int64_t n_img_tokens, + int64_t n_txt_tokens) { + auto final_layer = std::dynamic_pointer_cast(blocks["final_layer"]); + + // Extract img from txt_img + auto img = ggml_view_3d(ctx->ggml_ctx, + txt_img, + txt_img->ne[0], + n_img_tokens, + txt_img->ne[2], + txt_img->nb[1], + txt_img->nb[2], + n_txt_tokens * txt_img->nb[1]); + + if (final_layer) { + img = final_layer->forward(ctx, img, vec); + } + + return img; + } + ggml_tensor* forward_orig(GGMLRunnerContext* ctx, ggml_tensor* img, ggml_tensor* txt, @@ -1175,6 +1312,190 @@ namespace Flux { skip_layers); } } + + struct StreamingContext { + // Intermediate tensors (persist across blocks) + ggml_tensor* img = nullptr; // Image features + ggml_tensor* txt = nullptr; // Text features + ggml_tensor* vec = nullptr; // Time/guidance embedding + ggml_tensor* pe = nullptr; // Positional encoding + ggml_tensor* txt_img_mask = nullptr; // Mask for attention + + // Precomputed modulations (computed once, used by all blocks) + std::vector ds_img_mods; + std::vector ds_txt_mods; + std::vector ss_mods; + + // State tracking + int current_double_block = 0; + int current_single_block = 0; + bool preprocessing_done = false; + bool double_blocks_done = false; + bool single_blocks_done = false; + + // Concatenated tensor for single blocks + ggml_tensor* txt_img = nullptr; + + void reset() { + img = txt = vec = pe = txt_img_mask = txt_img = nullptr; + ds_img_mods.clear(); + ds_txt_mods.clear(); + ss_mods.clear(); + current_double_block = 0; + current_single_block = 0; + preprocessing_done = false; + double_blocks_done = false; + single_blocks_done = false; + } + }; + + void forward_preprocessing(GGMLRunnerContext* ctx, + StreamingContext& stream_ctx, + ggml_tensor* img, + ggml_tensor* txt, + ggml_tensor* timesteps, + ggml_tensor* y, + ggml_tensor* guidance, + ggml_tensor* pe, + ggml_tensor* mod_index_arange = nullptr) { + auto img_in = std::dynamic_pointer_cast(blocks["img_in"]); + auto txt_in = std::dynamic_pointer_cast(blocks["txt_in"]); + + // Image input projection + if (img_in) { + stream_ctx.img = img_in->forward(ctx, img); + } else { + stream_ctx.img = img; + } + + // Compute vec (time/guidance embedding) + if (params.is_chroma) { + int64_t mod_index_length = 344; + auto approx = std::dynamic_pointer_cast(blocks["distilled_guidance_layer"]); + auto distill_timestep = ggml_ext_timestep_embedding(ctx->ggml_ctx, timesteps, 16, 10000, 1000.f); + auto distill_guidance = ggml_ext_timestep_embedding(ctx->ggml_ctx, guidance, 16, 10000, 1000.f); + + GGML_ASSERT(mod_index_arange != nullptr); + auto modulation_index = ggml_ext_timestep_embedding(ctx->ggml_ctx, mod_index_arange, 32, 10000, 1000.f); + modulation_index = ggml_repeat(ctx->ggml_ctx, modulation_index, + ggml_new_tensor_3d(ctx->ggml_ctx, GGML_TYPE_F32, modulation_index->ne[0], modulation_index->ne[1], img->ne[2])); + + auto timestep_guidance = ggml_concat(ctx->ggml_ctx, distill_timestep, distill_guidance, 0); + timestep_guidance = ggml_repeat(ctx->ggml_ctx, timestep_guidance, modulation_index); + + stream_ctx.vec = ggml_concat(ctx->ggml_ctx, timestep_guidance, modulation_index, 0); + stream_ctx.vec = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, stream_ctx.vec, 0, 2, 1, 3)); + stream_ctx.vec = approx->forward(ctx, stream_ctx.vec); + + if (y != nullptr) { + stream_ctx.txt_img_mask = ggml_pad(ctx->ggml_ctx, y, static_cast(img->ne[1]), 0, 0, 0); + } + } else { + auto time_in = std::dynamic_pointer_cast(blocks["time_in"]); + stream_ctx.vec = time_in->forward(ctx, ggml_ext_timestep_embedding(ctx->ggml_ctx, timesteps, 256, 10000, 1000.f)); + + if (params.guidance_embed) { + GGML_ASSERT(guidance != nullptr); + auto guidance_in = std::dynamic_pointer_cast(blocks["guidance_in"]); + auto g_in = ggml_ext_timestep_embedding(ctx->ggml_ctx, guidance, 256, 10000, 1000.f); + stream_ctx.vec = ggml_add(ctx->ggml_ctx, stream_ctx.vec, guidance_in->forward(ctx, g_in)); + } + + if (params.vec_in_dim > 0) { + auto vector_in = std::dynamic_pointer_cast(blocks["vector_in"]); + stream_ctx.vec = ggml_add(ctx->ggml_ctx, stream_ctx.vec, vector_in->forward(ctx, y)); + } + } + + // Precompute modulations (used by all blocks) + if (params.share_modulation) { + auto double_stream_modulation_img = std::dynamic_pointer_cast(blocks["double_stream_modulation_img"]); + auto double_stream_modulation_txt = std::dynamic_pointer_cast(blocks["double_stream_modulation_txt"]); + auto single_stream_modulation = std::dynamic_pointer_cast(blocks["single_stream_modulation"]); + + stream_ctx.ds_img_mods = double_stream_modulation_img->forward(ctx, stream_ctx.vec); + stream_ctx.ds_txt_mods = double_stream_modulation_txt->forward(ctx, stream_ctx.vec); + stream_ctx.ss_mods = single_stream_modulation->forward(ctx, stream_ctx.vec); + } + + // Text normalization and projection + if (params.semantic_txt_norm) { + auto semantic_txt_norm = std::dynamic_pointer_cast(blocks["txt_norm"]); + txt = semantic_txt_norm->forward(ctx, txt); + } + stream_ctx.txt = txt_in->forward(ctx, txt); + + // Store PE + stream_ctx.pe = pe; + + stream_ctx.preprocessing_done = true; + stream_ctx.current_double_block = 0; + stream_ctx.current_single_block = 0; + } + + bool forward_double_block(GGMLRunnerContext* ctx, + StreamingContext& stream_ctx, + int block_idx) { + GGML_ASSERT(stream_ctx.preprocessing_done); + GGML_ASSERT(block_idx < params.depth); + + auto block = std::dynamic_pointer_cast(blocks["double_blocks." + std::to_string(block_idx)]); + auto img_txt = block->forward(ctx, stream_ctx.img, stream_ctx.txt, stream_ctx.vec, + stream_ctx.pe, stream_ctx.txt_img_mask, + stream_ctx.ds_img_mods, stream_ctx.ds_txt_mods); + stream_ctx.img = img_txt.first; + stream_ctx.txt = img_txt.second; + + stream_ctx.current_double_block = block_idx + 1; + if (stream_ctx.current_double_block >= params.depth) { + stream_ctx.double_blocks_done = true; + // Prepare for single blocks by concatenating txt and img + stream_ctx.txt_img = ggml_concat(ctx->ggml_ctx, stream_ctx.txt, stream_ctx.img, 1); + return true; + } + return false; + } + + bool forward_single_block(GGMLRunnerContext* ctx, + StreamingContext& stream_ctx, + int block_idx) { + GGML_ASSERT(stream_ctx.double_blocks_done); + GGML_ASSERT(block_idx < params.depth_single_blocks); + + auto block = std::dynamic_pointer_cast(blocks["single_blocks." + std::to_string(block_idx)]); + stream_ctx.txt_img = block->forward(ctx, stream_ctx.txt_img, stream_ctx.vec, + stream_ctx.pe, stream_ctx.txt_img_mask, stream_ctx.ss_mods); + + stream_ctx.current_single_block = block_idx + 1; + if (stream_ctx.current_single_block >= params.depth_single_blocks) { + stream_ctx.single_blocks_done = true; + return true; + } + return false; + } + + ggml_tensor* forward_postprocessing(GGMLRunnerContext* ctx, + StreamingContext& stream_ctx) { + GGML_ASSERT(stream_ctx.single_blocks_done); + + auto final_layer = std::dynamic_pointer_cast(blocks["final_layer"]); + + // Extract img from txt_img + auto img = ggml_view_3d(ctx->ggml_ctx, + stream_ctx.txt_img, + stream_ctx.txt_img->ne[0], + stream_ctx.img->ne[1], + stream_ctx.txt_img->ne[2], + stream_ctx.txt_img->nb[1], + stream_ctx.txt_img->nb[2], + stream_ctx.txt->ne[1] * stream_ctx.txt_img->nb[1]); + + if (final_layer) { + img = final_layer->forward(ctx, img, stream_ctx.vec); + } + + return img; + } }; struct FluxRunner : public GGMLRunner { @@ -1188,6 +1509,12 @@ namespace Flux { SDVersion version; bool use_mask = false; + // Static layer cache decided on the first sampling step. -1 = not yet + // computed; 0..N = number of "double_blocks.X" / "single_blocks.X" + // blocks kept resident on GPU across sampling steps. + int resident_double_blocks_ = -1; + int resident_single_blocks_ = -1; + FluxRunner(ggml_backend_t backend, bool offload_params_to_cpu, const String2TensorStorage& tensor_storage_map = {}, @@ -1465,6 +1792,101 @@ namespace Flux { return gf; } + // Raw tensor build_graph used by streaming infrastructure + ggml_cgraph* build_graph(ggml_tensor* x, + ggml_tensor* timesteps, + ggml_tensor* context, + ggml_tensor* c_concat, + ggml_tensor* y, + ggml_tensor* guidance, + std::vector ref_latents = {}, + bool increase_ref_index = false, + std::vector skip_layers = {}) { + GGML_ASSERT(x->ne[3] == 1); + ggml_cgraph* gf = new_graph_custom(FLUX_GRAPH_SIZE); + + x = to_backend(x); + timesteps = to_backend(timesteps); + context = to_backend(context); + c_concat = to_backend(c_concat); + y = to_backend(y); + guidance = to_backend(guidance); + for (auto& ref : ref_latents) { + ref = to_backend(ref); + } + + ggml_tensor* mod_index_arange = nullptr; + ggml_tensor* dct = nullptr; + + if (flux_params.is_chroma) { + if (!use_mask) { + y = nullptr; + } + mod_index_arange_vec = arange(0, 344); + mod_index_arange = ggml_new_tensor_1d(compute_ctx, GGML_TYPE_F32, mod_index_arange_vec.size()); + set_backend_tensor_data(mod_index_arange, mod_index_arange_vec.data()); + } + std::set txt_arange_dims; + if (sd_version_is_flux2(version)) { + txt_arange_dims = {3}; + increase_ref_index = true; + } else if (version == VERSION_OVIS_IMAGE) { + txt_arange_dims = {1, 2}; + } + + pe_vec = Rope::gen_flux_pe(static_cast(x->ne[1]), + static_cast(x->ne[0]), + flux_params.patch_size, + static_cast(x->ne[3]), + static_cast(context->ne[1]), + txt_arange_dims, + ref_latents, + increase_ref_index, + flux_params.ref_index_scale, + flux_params.theta, + circular_y_enabled, + circular_x_enabled, + flux_params.axes_dim); + int pos_len = static_cast(pe_vec.size() / flux_params.axes_dim_sum / 2); + auto pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, flux_params.axes_dim_sum / 2, pos_len); + set_backend_tensor_data(pe, pe_vec.data()); + + if (version == VERSION_CHROMA_RADIANCE) { + int patch_size = flux_params.patch_size; + int nerf_max_freqs = flux_params.chroma_radiance_params.nerf_max_freqs; + dct_vec = fetch_dct_pos(patch_size, nerf_max_freqs); + dct = ggml_new_tensor_2d(compute_ctx, GGML_TYPE_F32, nerf_max_freqs * nerf_max_freqs, patch_size * patch_size); + set_backend_tensor_data(dct, dct_vec.data()); + } + + auto runner_ctx = get_context(); + ggml_tensor* out = flux.forward(&runner_ctx, x, timesteps, context, c_concat, y, + guidance, pe, mod_index_arange, dct, ref_latents, skip_layers); + ggml_build_forward_expand(gf, out); + return gf; + } + + // Raw tensor compute used by streaming infrastructure + bool compute(int n_threads, + ggml_tensor* x, + ggml_tensor* timesteps, + ggml_tensor* context, + ggml_tensor* c_concat, + ggml_tensor* y, + ggml_tensor* guidance, + std::vector ref_latents = {}, + bool increase_ref_index = false, + ggml_tensor** output = nullptr, + ggml_context* output_ctx = nullptr, + std::vector skip_layers = {}, + bool skip_param_offload = false) { + auto get_graph = [&]() -> ggml_cgraph* { + return build_graph(x, timesteps, context, c_concat, y, guidance, + ref_latents, increase_ref_index, skip_layers); + }; + return GGMLRunner::compute(get_graph, n_threads, false, output, output_ctx, skip_param_offload); + } + sd::Tensor compute(int n_threads, const sd::Tensor& x, const sd::Tensor& timesteps, @@ -1584,6 +2006,534 @@ namespace Flux { LOG_INFO("flux model loaded"); flux->test(); } + + void enable_layer_streaming(const LayerStreaming::StreamingConfig& config = {}) { + std::map tensor_map; + flux.get_param_tensors(tensor_map, "model.diffusion_model"); + init_streaming(config, tensor_map, LayerStreaming::flux_layer_pattern); + LOG_INFO("%s layer streaming enabled with %zu layers", + get_desc().c_str(), streaming_engine_->get_registry().get_layer_count()); + } + + bool compute_streaming(int n_threads, + struct ggml_tensor* x, + struct ggml_tensor* timesteps, + struct ggml_tensor* context, + struct ggml_tensor* c_concat, + struct ggml_tensor* y, + struct ggml_tensor* guidance, + std::vector ref_latents = {}, + bool increase_ref_index = false, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr, + std::vector skip_layers = std::vector()) { + if (!is_streaming_enabled()) { + LOG_ERROR("%s streaming not enabled", get_desc().c_str()); + return false; + } + + int64_t t0 = ggml_time_ms(); + auto analysis = analyze_vram_budget(); + + if (analysis.fits_in_vram) { + LOG_INFO("%s model fits in VRAM, using coarse-stage streaming", get_desc().c_str()); + load_all_layers_coarse(); + + bool result = compute(n_threads, x, timesteps, context, c_concat, y, guidance, + ref_latents, increase_ref_index, output, output_ctx, + skip_layers, true); + + int64_t t1 = ggml_time_ms(); + LOG_INFO("%s coarse-stage streaming completed in %.2fs", get_desc().c_str(), (t1 - t0) / 1000.0); + free_compute_buffer(); + return result; + } + + LOG_INFO("%s remaining %.2f GB exceeds available %.2f GB, using per-layer streaming", + get_desc().c_str(), + analysis.remaining_to_load / (1024.0 * 1024.0 * 1024.0), + analysis.available_vram / (1024.0 * 1024.0 * 1024.0)); + + return compute_streaming_true(n_threads, x, timesteps, context, c_concat, y, guidance, + ref_latents, increase_ref_index, output, output_ctx, skip_layers); + } + + bool compute_streaming_true(int n_threads, + struct ggml_tensor* x, + struct ggml_tensor* timesteps, + struct ggml_tensor* context, + struct ggml_tensor* c_concat, + struct ggml_tensor* y, + struct ggml_tensor* guidance, + std::vector ref_latents, + bool increase_ref_index, + struct ggml_tensor** output, + struct ggml_context* output_ctx, + std::vector skip_layers) { + auto& registry = streaming_engine_->get_registry(); + int64_t t_start = ggml_time_ms(); + + const int num_double_blocks = flux_params.depth; + const int num_single_blocks = flux_params.depth_single_blocks; + LOG_INFO("TRUE per-layer streaming - %d double + %d single blocks", + num_double_blocks, num_single_blocks); + + // Load global layers (_global contains input projections, final_layer, etc) + LOG_DEBUG("Loading global layers"); + if (!registry.move_layer_to_gpu("_global")) { + LOG_ERROR("Failed to load _global to GPU"); + return false; + } + LOG_DEBUG("_global loaded successfully"); + + // Set up txt_arange_dims based on version + std::set txt_arange_dims; + if (sd_version_is_flux2(version)) { + txt_arange_dims = {3}; + increase_ref_index = true; + } else if (version == VERSION_OVIS_IMAGE) { + txt_arange_dims = {1, 2}; + } + + // Pre-generate PE + pe_vec = Rope::gen_flux_pe(static_cast(x->ne[1]), + static_cast(x->ne[0]), + flux_params.patch_size, + static_cast(x->ne[3]), + static_cast(context->ne[1]), + txt_arange_dims, + ref_latents, + increase_ref_index, + flux_params.ref_index_scale, + flux_params.theta, + circular_y_enabled, + circular_x_enabled, + flux_params.axes_dim); + + LOG_DEBUG("PE generated"); + + // Pre-generate mod_index_arange for Chroma + if (flux_params.is_chroma) { + mod_index_arange_vec.clear(); + for (int i = 0; i < 344; i++) { + mod_index_arange_vec.push_back(static_cast(i)); + } + } + + LOG_DEBUG("About to execute input stage"); + + // Persistent storage for intermediate tensors. Backed by a single + // GPU-pinned host buffer (via ensure_pinned_act_buffers) so the + // per-block ggml_backend_tensor_get / set_backend_tensor_data + // calls run at full PCIe bandwidth. Falls back to pageable + // std::vector if pinned alloc fails. + std::vector persistent_img_fallback; + std::vector persistent_txt_fallback; + std::vector persistent_vec_fallback; + std::vector persistent_txt_img_fallback; + float* persistent_img = nullptr; + float* persistent_txt = nullptr; + float* persistent_vec = nullptr; + float* persistent_txt_img = nullptr; + size_t persistent_img_count = 0; + size_t persistent_txt_count = 0; + size_t persistent_vec_count = 0; + size_t persistent_txt_img_count = 0; + int64_t img_ne[4], txt_ne[4], vec_ne[4], txt_img_ne[4]; + int64_t n_txt_tokens = 0; + int64_t n_img_tokens = 0; + + LOG_DEBUG("Executing input stage"); + { + ggml_tensor* img_output = nullptr; + ggml_tensor* txt_output = nullptr; + ggml_tensor* vec_output = nullptr; + + auto get_input_graph = [&]() -> struct ggml_cgraph* { + struct ggml_cgraph* gf = new_graph_custom(FLUX_GRAPH_SIZE / 4); + auto runner_ctx = get_context(); + + ggml_tensor* x_patched = DiT::pad_and_patchify(&runner_ctx, to_backend(x), + flux_params.patch_size, flux_params.patch_size); + n_img_tokens = x_patched->ne[1]; + + // Handle ref_latents + for (auto& ref : ref_latents) { + auto ref_patched = DiT::pad_and_patchify(&runner_ctx, to_backend(ref), + flux_params.patch_size, flux_params.patch_size); + x_patched = ggml_concat(compute_ctx, x_patched, ref_patched, 1); + } + + ggml_tensor* context_backend = to_backend(context); + ggml_tensor* timesteps_backend = to_backend(timesteps); + ggml_tensor* y_backend = y ? to_backend(y) : nullptr; + ggml_tensor* guidance_backend = guidance ? to_backend(guidance) : nullptr; + + ggml_tensor* mod_index_arange = nullptr; + if (flux_params.is_chroma && !mod_index_arange_vec.empty()) { + mod_index_arange = ggml_new_tensor_1d(compute_ctx, GGML_TYPE_F32, mod_index_arange_vec.size()); + set_backend_tensor_data(mod_index_arange, mod_index_arange_vec.data()); + } + + auto result = flux.forward_input_stage(&runner_ctx, x_patched, context_backend, + timesteps_backend, y_backend, guidance_backend, + mod_index_arange); + + img_output = result.img; + txt_output = result.txt; + vec_output = result.vec; + n_txt_tokens = result.n_txt_tokens; + + ggml_build_forward_expand(gf, img_output); + ggml_build_forward_expand(gf, txt_output); + ggml_build_forward_expand(gf, vec_output); + + return gf; + }; + + // Don't free compute buffer immediately - we need to read outputs first + if (!GGMLRunner::compute(get_input_graph, n_threads, false, nullptr, nullptr, true)) { + LOG_ERROR("Input stage failed"); + return false; + } + + // Extract to persistent storage + if (img_output && txt_output && vec_output) { + size_t img_size = ggml_nelements(img_output); + size_t txt_size = ggml_nelements(txt_output); + size_t vec_size = ggml_nelements(vec_output); + // txt_img region is sized to hold the concatenated + // (txt + img) activations consumed by single blocks. + size_t txt_img_size = txt_size + img_size; + + persistent_img_count = img_size; + persistent_txt_count = txt_size; + persistent_vec_count = vec_size; + persistent_txt_img_count = txt_img_size; + + std::vector ptrs; + if (ensure_pinned_act_buffers({img_size * sizeof(float), + txt_size * sizeof(float), + vec_size * sizeof(float), + txt_img_size * sizeof(float)}, ptrs)) { + persistent_img = ptrs[0]; + persistent_txt = ptrs[1]; + persistent_vec = ptrs[2]; + persistent_txt_img = ptrs[3]; + } else { + persistent_img_fallback.resize(img_size); + persistent_txt_fallback.resize(txt_size); + persistent_vec_fallback.resize(vec_size); + persistent_txt_img_fallback.resize(txt_img_size); + persistent_img = persistent_img_fallback.data(); + persistent_txt = persistent_txt_fallback.data(); + persistent_vec = persistent_vec_fallback.data(); + persistent_txt_img = persistent_txt_img_fallback.data(); + } + + ggml_backend_tensor_get(img_output, persistent_img, 0, img_size * sizeof(float)); + ggml_backend_tensor_get(txt_output, persistent_txt, 0, txt_size * sizeof(float)); + ggml_backend_tensor_get(vec_output, persistent_vec, 0, vec_size * sizeof(float)); + + for (int i = 0; i < 4; i++) { + img_ne[i] = img_output->ne[i]; + txt_ne[i] = txt_output->ne[i]; + vec_ne[i] = vec_output->ne[i]; + } + } else { + LOG_ERROR("Failed to get input stage outputs"); + free_compute_buffer(); + return false; + } + + // Now safe to free compute buffer + free_compute_buffer(); + } + + LOG_DEBUG("Input stage done, img=%ldx%ldx%ld, txt=%ldx%ldx%ld", + img_ne[0], img_ne[1], img_ne[2], txt_ne[0], txt_ne[1], txt_ne[2]); + + auto double_name_at = [](int i) { return "double_blocks." + std::to_string(i); }; + + if (resident_double_blocks_ < 0 && streaming_engine_) { + resident_double_blocks_ = streaming_engine_->compute_resident_block_count( + "double_blocks.0", num_double_blocks); + LOG_INFO("%s double_blocks cache: %d resident, %d streamed per step", + get_desc().c_str(), + resident_double_blocks_, + num_double_blocks - resident_double_blocks_); + } + + int double_prefetch_start = 0; + while (double_prefetch_start < num_double_blocks && + registry.is_layer_on_gpu(double_name_at(double_prefetch_start))) { + double_prefetch_start++; + } + if (streaming_engine_) { + streaming_engine_->prime_prefetch(double_name_at, double_prefetch_start, num_double_blocks); + } + + for (int block_idx = 0; block_idx < num_double_blocks; block_idx++) { + // Check skip_layers + if (skip_layers.size() > 0 && std::find(skip_layers.begin(), skip_layers.end(), block_idx) != skip_layers.end()) { + LOG_DEBUG("Skipping double_block %d", block_idx); + continue; + } + + std::string block_name = double_name_at(block_idx); + int64_t t_block_start = ggml_time_ms(); + + // Wait for this block's prefetch to complete (if async prefetch was started) + if (streaming_engine_) { + streaming_engine_->wait_for_prefetch(block_name); + } + + // Load this block's weights (sync load if prefetch didn't happen) + if (!registry.move_layer_to_gpu(block_name)) { + LOG_ERROR("Failed to load %s", block_name.c_str()); + return false; + } + + // Keep the prefetch window full + if (streaming_engine_) { + streaming_engine_->advance_prefetch(double_name_at, block_idx, num_double_blocks); + } + + ggml_tensor* img_out = nullptr; + ggml_tensor* txt_out = nullptr; + + auto get_block_graph = [&]() -> struct ggml_cgraph* { + struct ggml_cgraph* gf = new_graph_custom(FLUX_GRAPH_SIZE / 4); + + // Create input tensors from persistent storage + ggml_tensor* img_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, img_ne[0], img_ne[1], img_ne[2], img_ne[3]); + ggml_tensor* txt_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, txt_ne[0], txt_ne[1], txt_ne[2], txt_ne[3]); + ggml_tensor* vec_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, vec_ne[0], vec_ne[1], vec_ne[2], vec_ne[3]); + + img_in = to_backend(img_in); + txt_in = to_backend(txt_in); + vec_in = to_backend(vec_in); + + set_backend_tensor_data(img_in, persistent_img); + set_backend_tensor_data(txt_in, persistent_txt); + set_backend_tensor_data(vec_in, persistent_vec); + + // PE tensor + int pos_len = static_cast(pe_vec.size() / flux_params.axes_dim_sum / 2); + ggml_tensor* pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, flux_params.axes_dim_sum / 2, pos_len); + set_backend_tensor_data(pe, pe_vec.data()); + + std::vector ds_img_mods, ds_txt_mods; + auto runner_ctx = get_context(); + auto result = flux.forward_double_block(&runner_ctx, block_idx, img_in, txt_in, vec_in, pe, + nullptr, ds_img_mods, ds_txt_mods); + + img_out = result.first; + txt_out = result.second; + + ggml_build_forward_expand(gf, img_out); + ggml_build_forward_expand(gf, txt_out); + + return gf; + }; + + // Don't free compute buffer immediately - we need to read outputs first + if (!GGMLRunner::compute(get_block_graph, n_threads, false, nullptr, nullptr, true)) { + LOG_ERROR("Double block %d execution failed", block_idx); + return false; + } + + // Extract outputs to persistent storage + if (img_out && txt_out) { + ggml_backend_tensor_get(img_out, persistent_img, 0, persistent_img_count * sizeof(float)); + ggml_backend_tensor_get(txt_out, persistent_txt, 0, persistent_txt_count * sizeof(float)); + + for (int i = 0; i < 4; i++) { + img_ne[i] = img_out->ne[i]; + txt_ne[i] = txt_out->ne[i]; + } + } + + // Now safe to free compute buffer + free_compute_buffer(); + + // Resident blocks stay on GPU across sampling steps. + if (block_idx >= resident_double_blocks_) { + registry.move_layer_to_cpu(block_name); + } + + LOG_DEBUG("Double block %d/%d done (%.2fms)", + block_idx + 1, num_double_blocks, (ggml_time_ms() - t_block_start) / 1.0); + } + + { + // Concatenate txt and img into txt_img + size_t txt_img_size = persistent_txt_count + persistent_img_count; + // persistent_txt_img was already sized in ensure_pinned_act_buffers + // (txt_img region == txt_count + img_count). Just concat into it. + + // txt goes first, then img (along dimension 1) + // Since we store flattened, we need to handle this carefully + // txt: [hidden_size, n_txt_tokens, N] + // img: [hidden_size, n_img_tokens, N] + // txt_img: [hidden_size, n_txt_tokens + n_img_tokens, N] + std::copy(persistent_txt, persistent_txt + persistent_txt_count, persistent_txt_img); + std::copy(persistent_img, persistent_img + persistent_img_count, persistent_txt_img + persistent_txt_count); + + txt_img_ne[0] = img_ne[0]; // hidden_size + txt_img_ne[1] = txt_ne[1] + img_ne[1]; // n_txt_tokens + n_img_tokens + txt_img_ne[2] = img_ne[2]; // N + txt_img_ne[3] = 1; + } + + auto single_name_at = [](int i) { return "single_blocks." + std::to_string(i); }; + + if (resident_single_blocks_ < 0 && streaming_engine_) { + resident_single_blocks_ = streaming_engine_->compute_resident_block_count( + "single_blocks.0", num_single_blocks); + LOG_INFO("%s single_blocks cache: %d resident, %d streamed per step", + get_desc().c_str(), + resident_single_blocks_, + num_single_blocks - resident_single_blocks_); + } + + int single_prefetch_start = 0; + while (single_prefetch_start < num_single_blocks && + registry.is_layer_on_gpu(single_name_at(single_prefetch_start))) { + single_prefetch_start++; + } + if (streaming_engine_) { + streaming_engine_->prime_prefetch(single_name_at, single_prefetch_start, num_single_blocks); + } + + for (int block_idx = 0; block_idx < num_single_blocks; block_idx++) { + // Check skip_layers (single blocks start at depth offset) + int skip_idx = block_idx + flux_params.depth; + if (skip_layers.size() > 0 && std::find(skip_layers.begin(), skip_layers.end(), skip_idx) != skip_layers.end()) { + LOG_DEBUG("Skipping single_block %d", block_idx); + continue; + } + + std::string block_name = single_name_at(block_idx); + int64_t t_block_start = ggml_time_ms(); + + // Wait for this block's prefetch to complete (if async prefetch was started) + if (streaming_engine_) { + streaming_engine_->wait_for_prefetch(block_name); + } + + // Load this block's weights (sync load if prefetch didn't happen) + if (!registry.move_layer_to_gpu(block_name)) { + LOG_ERROR("Failed to load %s", block_name.c_str()); + return false; + } + + // Keep the prefetch window full + if (streaming_engine_) { + streaming_engine_->advance_prefetch(single_name_at, block_idx, num_single_blocks); + } + + ggml_tensor* txt_img_out = nullptr; + + auto get_block_graph = [&]() -> struct ggml_cgraph* { + struct ggml_cgraph* gf = new_graph_custom(FLUX_GRAPH_SIZE / 4); + + // Create input tensors + ggml_tensor* txt_img_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, + txt_img_ne[0], txt_img_ne[1], txt_img_ne[2], txt_img_ne[3]); + ggml_tensor* vec_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, vec_ne[0], vec_ne[1], vec_ne[2], vec_ne[3]); + + txt_img_in = to_backend(txt_img_in); + vec_in = to_backend(vec_in); + + set_backend_tensor_data(txt_img_in, persistent_txt_img); + set_backend_tensor_data(vec_in, persistent_vec); + + // PE tensor + int pos_len = static_cast(pe_vec.size() / flux_params.axes_dim_sum / 2); + ggml_tensor* pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, flux_params.axes_dim_sum / 2, pos_len); + set_backend_tensor_data(pe, pe_vec.data()); + + std::vector ss_mods; + auto runner_ctx = get_context(); + txt_img_out = flux.forward_single_block(&runner_ctx, block_idx, txt_img_in, vec_in, pe, + nullptr, ss_mods); + + ggml_build_forward_expand(gf, txt_img_out); + + return gf; + }; + + // Don't free compute buffer immediately - we need to read outputs first + if (!GGMLRunner::compute(get_block_graph, n_threads, false, nullptr, nullptr, true)) { + LOG_ERROR("Single block %d execution failed", block_idx); + return false; + } + + // Extract output to persistent storage + if (txt_img_out) { + ggml_backend_tensor_get(txt_img_out, persistent_txt_img, 0, persistent_txt_img_count * sizeof(float)); + + for (int i = 0; i < 4; i++) { + txt_img_ne[i] = txt_img_out->ne[i]; + } + } + + // Now safe to free compute buffer + free_compute_buffer(); + + // Resident blocks stay on GPU across sampling steps. + if (block_idx >= resident_single_blocks_) { + registry.move_layer_to_cpu(block_name); + } + + LOG_DEBUG("Single block %d/%d done (%.2fms)", + block_idx + 1, num_single_blocks, (ggml_time_ms() - t_block_start) / 1.0); + } + + LOG_DEBUG("Executing output stage"); + { + auto get_output_graph = [&]() -> struct ggml_cgraph* { + struct ggml_cgraph* gf = new_graph_custom(FLUX_GRAPH_SIZE / 4); + + ggml_tensor* txt_img_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, + txt_img_ne[0], txt_img_ne[1], txt_img_ne[2], txt_img_ne[3]); + ggml_tensor* vec_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, vec_ne[0], vec_ne[1], vec_ne[2], vec_ne[3]); + + txt_img_in = to_backend(txt_img_in); + vec_in = to_backend(vec_in); + + set_backend_tensor_data(txt_img_in, persistent_txt_img); + set_backend_tensor_data(vec_in, persistent_vec); + + auto runner_ctx = get_context(); + auto final_out = flux.forward_output_stage(&runner_ctx, txt_img_in, vec_in, n_img_tokens, n_txt_tokens); + + // Unpatchify + int64_t W = x->ne[0]; + int64_t H = x->ne[1]; + final_out = DiT::unpatchify_and_crop(compute_ctx, final_out, H, W, flux_params.patch_size, flux_params.patch_size); + + ggml_build_forward_expand(gf, final_out); + + return gf; + }; + + if (!GGMLRunner::compute(get_output_graph, n_threads, true, output, output_ctx, true)) { + LOG_ERROR("Output stage failed"); + return false; + } + } + + int64_t t_end = ggml_time_ms(); + LOG_INFO("TRUE per-layer streaming completed in %.2fs (%d double + %d single blocks)", + (t_end - t_start) / 1000.0, num_double_blocks, num_single_blocks); + + return true; + } + + private: + Flux::StreamingContext streaming_ctx_; }; } // namespace Flux diff --git a/src/ggml_extend.hpp b/src/ggml_extend.hpp index 362303229..5a999e8eb 100644 --- a/src/ggml_extend.hpp +++ b/src/ggml_extend.hpp @@ -29,6 +29,7 @@ #include "ggml_extend_backend.hpp" #include "ggml_graph_cut.h" +#include "layer_streaming.hpp" #include "model.h" #include "tensor.hpp" @@ -1721,6 +1722,7 @@ struct GGMLRunner { ggml_context* offload_ctx = nullptr; ggml_backend_buffer_t runtime_params_buffer = nullptr; bool params_on_runtime_backend = false; + bool auto_offload_after_compute = true; // If false, don't auto-offload in free_compute_buffer ggml_context* cache_ctx = nullptr; ggml_backend_buffer_t cache_buffer = nullptr; @@ -1728,11 +1730,20 @@ struct GGMLRunner { ggml_context* compute_ctx = nullptr; ggml_gallocr* compute_allocr = nullptr; + // Graph-cut segmented param offload (`--max-vram` budget): the executor + // streams only the params needed by the current sub-graph segment. ggml_context* partial_offload_ctx = nullptr; ggml_backend_buffer_t partial_runtime_params_buffer = nullptr; std::vector> partial_offload_pairs; size_t max_graph_vram_bytes = 0; + // GPU-pinned host buffer shared across the per-runner persistent + // activation regions used by layer-streaming compute paths (txt_img, + // t_emb, pe, vec, ...). Allocated lazily in ensure_pinned_act_buffers() + // and freed in ~GGMLRunner. + ggml_backend_buffer_t persistent_act_host_buf_ = nullptr; + size_t persistent_act_host_size_ = 0; + std::shared_ptr weight_adapter = nullptr; std::vector one_vec = {1.f}; @@ -1750,9 +1761,103 @@ struct GGMLRunner { bool circular_x_enabled = false; bool circular_y_enabled = false; + // Graph-cut planner state — caches the segment plan + the set of param + // tensors so the planner doesn't rebuild on every dispatch. sd::ggml_graph_cut::PlanCache graph_cut_plan_cache_; std::unordered_set params_tensor_set_; + // Layer-streaming engine: drives per-layer prefetch + dispatch when the + // runner is configured with `--offload-mode layer_streaming`. + std::unique_ptr streaming_engine_; + + using layer_pattern_fn_t = std::function(const std::string&)>; + + void init_streaming(const LayerStreaming::StreamingConfig& config, + const std::map& tensor_map, + layer_pattern_fn_t pattern_fn) { + if (!params_backend || !runtime_backend) { + LOG_WARN("%s cannot enable streaming without both CPU and GPU backends", get_desc().c_str()); + return; + } + if (!streaming_engine_) { + streaming_engine_ = std::make_unique( + runtime_backend, params_backend); + } + // set_max_graph_vram_bytes() may have been called before this point + // (it's set per-runner during model load, while the streaming engine + // is created lazily here). Apply the stored cap to the engine's + // budget so --max-vram works for our streaming planner too. + streaming_engine_->get_budget().set_max_vram_cap_bytes(max_graph_vram_bytes); + auto cfg = config; + cfg.enabled = true; + streaming_engine_->set_config(cfg); + streaming_engine_->register_model_layers_from_map(tensor_map, pattern_fn); + } + + struct StreamingVramAnalysis { + size_t total_model_size = 0; + size_t available_vram = 0; + size_t already_on_gpu = 0; + size_t remaining_to_load = 0; + bool fits_in_vram = false; + }; + + StreamingVramAnalysis analyze_vram_budget() { + StreamingVramAnalysis result = {}; + if (!streaming_engine_) return result; + + auto& registry = streaming_engine_->get_registry(); + auto& budget = streaming_engine_->get_budget(); + + auto all_layers = registry.get_layer_names_sorted(); + for (const auto& name : all_layers) { + result.total_model_size += registry.get_layer_size(name); + } + + // Subtract a compute-buffer reserve from available VRAM. The fits_in_vram + // decision picks coarse-stage (load all params resident) when params fit; + // without this reserve the planner ignores the runtime compute graph's + // alloc, which on tight caps (e.g. SDXL 1024x1024 with --max-vram 6) tips + // params + CB over the budget mid-step and crashes cudaMalloc. + size_t raw_available = budget.get_available_vram(); + size_t cb_reserve = budget.get_compute_buffer_reserve(); + result.available_vram = (raw_available > cb_reserve) ? (raw_available - cb_reserve) : 0; + + for (const auto& name : all_layers) { + if (registry.is_layer_on_gpu(name)) { + result.already_on_gpu += registry.get_layer_size(name); + } + } + + result.remaining_to_load = (result.total_model_size > result.already_on_gpu) + ? (result.total_model_size - result.already_on_gpu) : 0; + result.fits_in_vram = (result.remaining_to_load <= result.available_vram); + + LOG_DEBUG("%s model size = %.2f GB, on GPU = %.2f GB, remaining = %.2f GB, available VRAM = %.2f GB (CB reserve = %.2f GB)", + get_desc().c_str(), + result.total_model_size / (1024.0 * 1024.0 * 1024.0), + result.already_on_gpu / (1024.0 * 1024.0 * 1024.0), + result.remaining_to_load / (1024.0 * 1024.0 * 1024.0), + result.available_vram / (1024.0 * 1024.0 * 1024.0), + cb_reserve / (1024.0 * 1024.0 * 1024.0)); + + return result; + } + + bool load_all_layers_coarse() { + if (!streaming_engine_) return false; + auto& registry = streaming_engine_->get_registry(); + auto& budget = streaming_engine_->get_budget(); + auto all_layers = registry.get_layer_names_sorted(); + for (const auto& name : all_layers) { + if (!registry.is_layer_on_gpu(name)) { + budget.ensure_vram_for_layer(name, 0); + registry.move_layer_to_gpu(name); + } + } + return true; + } + template static sd::Tensor take_or_empty(std::optional> tensor) { if (!tensor.has_value()) { @@ -1888,6 +1993,11 @@ struct GGMLRunner { return gf; } + // Two-step compute graph + buffer setup. Upstream split alloc_compute_buffer + // into prepare_compute_graph + alloc_compute_buffer(gf) so the graph-cut + // planner can inspect the graph before reserving (it needs to know which + // params each segment touches). The old single-call form is preserved as + // an overload below for callers that don't need the inspection step. bool prepare_compute_graph(get_graph_cb_t get_graph, ggml_cgraph** gf_out) { GGML_ASSERT(gf_out != nullptr); @@ -1910,13 +2020,11 @@ struct GGMLRunner { compute_allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(runtime_backend)); if (!ggml_gallocr_reserve(compute_allocr, gf)) { - // failed to allocate the compute buffer LOG_ERROR("%s: failed to allocate the compute buffer\n", get_desc().c_str()); free_compute_buffer(); return false; } - // compute the required memory size_t compute_buffer_size = ggml_gallocr_get_buffer_size(compute_allocr, 0); LOG_DEBUG("%s compute buffer size: %.2f MB(%s)", get_desc().c_str(), @@ -1925,6 +2033,29 @@ struct GGMLRunner { return true; } + // Backward-compatible single-call overload. Used by the layer-streaming + // path which doesn't need to re-inspect the graph before allocating; it + // wraps prepare_compute_graph + alloc_compute_buffer(gf) and returns the + // built graph via *out_gf so the caller can reuse it for the subsequent + // ggml_gallocr_alloc_graph() pass (avoids tensor pointer mismatches). + bool alloc_compute_buffer(get_graph_cb_t get_graph, struct ggml_cgraph** out_gf = nullptr) { + if (compute_allocr != nullptr) { + if (out_gf) *out_gf = nullptr; + return true; + } + ggml_cgraph* gf = nullptr; + if (!prepare_compute_graph(get_graph, &gf)) { + if (out_gf) *out_gf = nullptr; + return false; + } + if (!alloc_compute_buffer(gf)) { + if (out_gf) *out_gf = nullptr; + return false; + } + if (out_gf) *out_gf = gf; + return true; + } + void free_cache_buffer() { if (cache_buffer != nullptr) { ggml_backend_buffer_free(cache_buffer); @@ -2015,29 +2146,44 @@ struct GGMLRunner { return true; } - void copy_data_to_backend_tensor(ggml_cgraph* gf, bool clear_after_copy = true) { - GGML_ASSERT(gf != nullptr); + // Upload entries from backend_tensor_data_map to their backend tensors. + // When a graph is supplied, only tensors that appear in the graph are + // uploaded (graph-cut needs this so segment-N inputs aren't touched + // outside their segment); otherwise every entry is uploaded + // unconditionally, which is what the layer-streaming dispatch path + // wants since each layer's mini-graph carries only its own inputs. + void copy_data_to_backend_tensor(ggml_cgraph* gf = nullptr, bool clear_after_copy = true) { std::unordered_set graph_tensor_set; - const int n_leafs = sd::ggml_graph_cut::leaf_count(gf); - const int n_nodes = ggml_graph_n_nodes(gf); - graph_tensor_set.reserve(static_cast(n_leafs + n_nodes)); - for (int i = 0; i < n_leafs; ++i) { - graph_tensor_set.insert(sd::ggml_graph_cut::leaf_tensor(gf, i)); - } - for (int i = 0; i < n_nodes; ++i) { - graph_tensor_set.insert(ggml_graph_node(gf, i)); + if (gf != nullptr) { + const int n_leafs = sd::ggml_graph_cut::leaf_count(gf); + const int n_nodes = ggml_graph_n_nodes(gf); + graph_tensor_set.reserve(static_cast(n_leafs + n_nodes)); + for (int i = 0; i < n_leafs; ++i) { + graph_tensor_set.insert(sd::ggml_graph_cut::leaf_tensor(gf, i)); + } + for (int i = 0; i < n_nodes; ++i) { + graph_tensor_set.insert(ggml_graph_node(gf, i)); + } } + int copied_count = 0; + int skipped_count = 0; + for (auto& kv : backend_tensor_data_map) { auto tensor = kv.first; auto data = kv.second; - if (graph_tensor_set.find(tensor) == graph_tensor_set.end()) { + if (gf != nullptr && graph_tensor_set.find(tensor) == graph_tensor_set.end()) { continue; } ggml_backend_buffer_t buf = tensor->view_src ? tensor->view_src->buffer : tensor->buffer; if (buf == nullptr) { + // Either an input the graph didn't actually allocate, or a + // genuine missing-buffer bug. Log once with enough context + // to debug; treat as skip rather than crash so layer streaming + // (which adds inputs that may go unused in some sub-graphs) + // doesn't trip on benign cases. LOG_WARN("%s graph exec skip tensor copy: name=%s op=%s reason=buffer_not_set data=%p view_src=%p view_src_buffer=%p", get_desc().c_str(), tensor && tensor->name[0] != '\0' ? tensor->name : "", @@ -2045,10 +2191,16 @@ struct GGMLRunner { data, tensor ? tensor->view_src : nullptr, (tensor && tensor->view_src) ? tensor->view_src->buffer : nullptr); + skipped_count++; continue; } ggml_backend_tensor_set(tensor, data, 0, ggml_nbytes(tensor)); + copied_count++; + } + + if (copied_count > 0 || skipped_count > 0) { + LOG_DEBUG("copy_data_to_backend_tensor: copied %d tensors, skipped %d", copied_count, skipped_count); } if (clear_after_copy) { @@ -2539,6 +2691,21 @@ struct GGMLRunner { virtual ~GGMLRunner() { free_params_buffer(); + // Also free the runtime-side weight buffers if allocated. free_params_buffer() + // only releases the CPU-side params_buffer; the runtime backend can hold up to + // two more buffers (full + partial) that need explicit cleanup here. + if (runtime_params_buffer != nullptr) { + ggml_backend_buffer_free(runtime_params_buffer); + runtime_params_buffer = nullptr; + } + if (partial_runtime_params_buffer != nullptr) { + ggml_backend_buffer_free(partial_runtime_params_buffer); + partial_runtime_params_buffer = nullptr; + } + if (persistent_act_host_buf_ != nullptr) { + ggml_backend_buffer_free(persistent_act_host_buf_); + persistent_act_host_buf_ = nullptr; + } free_compute_buffer(); free_params_ctx(); free_compute_ctx(); @@ -2548,6 +2715,57 @@ struct GGMLRunner { free_cache_ctx_and_buffer(); } + // Allocates (or grows) a single GPU-pinned host buffer that backs all the + // runner's persistent activation regions for streaming compute paths, and + // writes 256-byte-aligned start pointers for each region into out_ptrs + // (same length as sizes_bytes). Pinned host memory makes the per-layer + // ggml_backend_tensor_get / copy_data_to_backend_tensor calls run at + // full PCIe bandwidth instead of staging through CUDA's bounce buffer. + // + // Returns true on success. On failure (pinned alloc rejected by the + // backend, e.g. out of locked pages) returns false so the caller can + // fall back to pageable std::vector storage — output is still correct, + // just slower. + bool ensure_pinned_act_buffers(const std::vector& sizes_bytes, + std::vector& out_ptrs) { + out_ptrs.assign(sizes_bytes.size(), nullptr); + const size_t align = 256; + std::vector aligned_sizes(sizes_bytes.size()); + size_t total = 0; + for (size_t i = 0; i < sizes_bytes.size(); i++) { + aligned_sizes[i] = ((sizes_bytes[i] + align - 1) / align) * align; + total += aligned_sizes[i]; + } + + if (persistent_act_host_buf_ == nullptr || persistent_act_host_size_ < total) { + if (persistent_act_host_buf_ != nullptr) { + ggml_backend_buffer_free(persistent_act_host_buf_); + persistent_act_host_buf_ = nullptr; + } + ggml_backend_dev_t gpu_dev = runtime_backend ? ggml_backend_get_device(runtime_backend) : nullptr; + ggml_backend_buffer_type_t host_buft = gpu_dev ? ggml_backend_dev_host_buffer_type(gpu_dev) : nullptr; + if (host_buft != nullptr) { + persistent_act_host_buf_ = ggml_backend_buft_alloc_buffer(host_buft, total); + } + if (persistent_act_host_buf_ == nullptr) { + LOG_WARN("%s pinned activation buffer alloc failed (%.2f MB), " + "falling back to pageable", + get_desc().c_str(), total / (1024.0 * 1024.0)); + persistent_act_host_size_ = 0; + return false; + } + persistent_act_host_size_ = total; + } + + char* base = static_cast(ggml_backend_buffer_get_base(persistent_act_host_buf_)); + size_t offset = 0; + for (size_t i = 0; i < sizes_bytes.size(); i++) { + out_ptrs[i] = reinterpret_cast(base + offset); + offset += aligned_sizes[i]; + } + return true; + } + virtual GGMLRunnerContext get_context() { GGMLRunnerContext runner_ctx; runner_ctx.ggml_ctx = compute_ctx; @@ -2567,7 +2785,34 @@ struct GGMLRunner { bool alloc_params_buffer() { size_t num_tensors = ggml_tensor_num(params_ctx); - params_buffer = ggml_backend_alloc_ctx_tensors(params_ctx, params_backend); + bool used_pinned_host = false; + + // When weights live on CPU but get streamed/transferred to GPU during + // compute, allocate them in the GPU device's pinned host buffer so + // async H2D copies actually overlap with compute. Without pinning, + // CUDA falls back to a staged sync copy through an internal bounce + // buffer (and Vulkan/Metal hit similar slow paths). + if (params_backend != runtime_backend && ggml_backend_is_cpu(params_backend)) { + ggml_backend_dev_t gpu_dev = ggml_backend_get_device(runtime_backend); + if (gpu_dev != nullptr) { + ggml_backend_buffer_type_t host_buft = ggml_backend_dev_host_buffer_type(gpu_dev); + if (host_buft != nullptr) { + params_buffer = ggml_backend_alloc_ctx_tensors_from_buft(params_ctx, host_buft); + if (params_buffer != nullptr) { + used_pinned_host = true; + } else { + LOG_WARN("%s pinned host alloc failed (system out of locked pages?), " + "falling back to pageable", + get_desc().c_str()); + } + } + } + } + + if (params_buffer == nullptr) { + params_buffer = ggml_backend_alloc_ctx_tensors(params_ctx, params_backend); + } + if (params_buffer == nullptr) { LOG_ERROR("%s alloc params backend buffer failed, num_tensors = %i", get_desc().c_str(), @@ -2577,15 +2822,20 @@ struct GGMLRunner { rebuild_params_tensor_set(); ggml_backend_buffer_set_usage(params_buffer, GGML_BACKEND_BUFFER_USAGE_WEIGHTS); size_t params_buffer_size = ggml_backend_buffer_get_size(params_buffer); - LOG_DEBUG("%s params backend buffer size = % 6.2f MB(%s) (%i tensors)", + LOG_DEBUG("%s params backend buffer size = % 6.2f MB(%s%s) (%i tensors)", get_desc().c_str(), params_buffer_size / (1024.f * 1024.f), ggml_backend_is_cpu(params_backend) ? "RAM" : "VRAM", + used_pinned_host ? ",pinned" : "", num_tensors); return true; } void free_params_buffer() { + // If params are on GPU, move them back to CPU first (this also frees runtime_params_buffer) + if (params_on_runtime_backend) { + restore_all_params(); + } if (params_buffer != nullptr) { ggml_backend_buffer_free(params_buffer); params_buffer = nullptr; @@ -2599,6 +2849,128 @@ struct GGMLRunner { return 0; } + // Estimate compute buffer size without actually allocating (dry-run) + // Returns 0 on failure, otherwise the required buffer size in bytes + size_t estimate_compute_buffer_size(get_graph_cb_t get_graph) { + reset_compute_ctx(); + struct ggml_cgraph* gf = get_compute_graph(get_graph); + backend_tensor_data_map.clear(); + + ggml_gallocr_t temp_allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(runtime_backend)); + if (temp_allocr == nullptr) { + return 0; + } + + size_t result = 0; + if (ggml_gallocr_reserve(temp_allocr, gf)) { + result = ggml_gallocr_get_buffer_size(temp_allocr, 0); + } + + ggml_gallocr_free(temp_allocr); + reset_compute_ctx(); // Clean up after estimation + return result; + } + + // Dynamic tensor offloading API + // Returns true if params are currently on the runtime (GPU) backend + bool is_params_on_gpu() const { + // If params_backend == runtime_backend, params are always "on GPU" + // (or always on CPU if CPU-only mode) + if (params_backend == runtime_backend) { + return !ggml_backend_is_cpu(runtime_backend); + } + // Otherwise check the offload state + return params_on_runtime_backend; + } + + // Move params from GPU to CPU (params_backend), freeing GPU memory + // Returns true on success, false if already on CPU or not applicable + bool move_params_to_cpu() { + if (params_backend == runtime_backend) { + // No separate CPU backend configured, can't offload + return false; + } + if (!params_on_runtime_backend) { + // Already on CPU + return true; + } + restore_all_params(); + return true; + } + + // Move params from CPU to GPU (runtime_backend), allocating GPU memory + // Returns true on success, false if already on GPU or allocation failed + bool move_params_to_gpu() { + if (params_backend == runtime_backend) { + // No separate CPU backend, params are always on runtime backend + return true; + } + if (params_on_runtime_backend) { + // Already on GPU + return true; + } + return offload_all_params(); + } + + // Get the size of params buffer (VRAM usage when on GPU) + size_t get_params_vram_size() const { + if (params_buffer != nullptr) { + return ggml_backend_buffer_get_size(params_buffer); + } + return 0; + } + + // Control automatic offloading after compute operations + // When disabled, params stay on GPU until explicitly moved via move_params_to_cpu() + void set_auto_offload(bool enabled) { + auto_offload_after_compute = enabled; + } + + bool get_auto_offload() const { + return auto_offload_after_compute; + } + + bool is_streaming_enabled() const { + return streaming_engine_ && streaming_engine_->get_config().enabled; + } + + void disable_layer_streaming() { + if (streaming_engine_) { + auto cfg = streaming_engine_->get_config(); + cfg.enabled = false; + streaming_engine_->set_config(cfg); + } + } + + void offload_streaming_layers() { + if (!streaming_engine_) return; + auto& registry = streaming_engine_->get_registry(); + auto layers = registry.get_layer_names_sorted(); + size_t offloaded = 0; + for (const auto& layer : layers) { + if (registry.is_layer_on_gpu(layer)) { + registry.move_layer_to_cpu(layer); + offloaded++; + } + } + if (offloaded > 0) { + LOG_INFO("%s offloaded %zu streaming layers to CPU", get_desc().c_str(), offloaded); + } + // Hook: runners can drop any cached state that referenced the resident + // layers (e.g. ZImageRunner's Phase 4 chunk graph), since those tensors + // have just been moved to CPU. + on_streaming_layers_offloaded(); + } + + // Override in subclasses to release any cached state tied to the + // streaming layers' GPU residency (e.g. cached chunk graphs whose ops + // reference the now-evicted weight tensors). + virtual void on_streaming_layers_offloaded() {} + + LayerStreaming::LayerExecutionEngine* get_streaming_engine() { + return streaming_engine_.get(); + } + void free_cache_ctx_and_buffer() { free_cache_buffer(); free_cache_ctx(); @@ -2609,8 +2981,16 @@ struct GGMLRunner { ggml_gallocr_free(compute_allocr); compute_allocr = nullptr; } + // Graph-cut path: undo any per-segment partial offload so the next + // compute starts fresh. Both restore_* calls are no-ops if not active. restore_partial_params(); restore_all_params(); + // Layer-streaming / offload-mode path: when the runner has been told + // to drop params back to the params backend after each compute (e.g. + // cond_diffusion / aggressive modes), do that here. + if (auto_offload_after_compute) { + restore_all_params(); + } } // do copy after alloc graph @@ -2669,6 +3049,69 @@ struct GGMLRunner { return ggml_get_tensor(cache_ctx, name.c_str()); } + // Our fork's compute overload with output tensor and skip_param_offload support + bool compute(get_graph_cb_t get_graph, + int n_threads, + bool free_compute_buffer_immediately = true, + ggml_tensor** output = nullptr, + ggml_context* output_ctx = nullptr, + bool skip_param_offload = false) { + // In streaming mode, weights are managed by the streaming engine + // so skip the bulk offload which would fail due to VRAM limits + if (!skip_param_offload && !offload_all_params()) { + LOG_ERROR("%s offload params to runtime backend failed", get_desc().c_str()); + return false; + } + + ggml_cgraph* gf = nullptr; + if (!alloc_compute_buffer(get_graph, &gf)) { + LOG_ERROR("%s alloc compute buffer failed", get_desc().c_str()); + return false; + } + // If alloc_compute_buffer just created a new allocator, gf contains the graph + // used for reservation and we MUST reuse it (same tensor pointers). + // If allocator already existed, gf is nullptr and we need to rebuild. + if (gf == nullptr) { + backend_tensor_data_map.clear(); + reset_compute_ctx(); + gf = get_compute_graph(get_graph); + } + + if (!ggml_gallocr_alloc_graph(compute_allocr, gf)) { + LOG_ERROR("%s alloc compute graph failed", get_desc().c_str()); + return false; + } + copy_data_to_backend_tensor(); + if (ggml_backend_is_cpu(runtime_backend)) { + ggml_backend_cpu_set_n_threads(runtime_backend, n_threads); + } + + ggml_status status = ggml_backend_graph_compute(runtime_backend, gf); + if (status != GGML_STATUS_SUCCESS) { + LOG_ERROR("%s compute failed: %s", get_desc().c_str(), ggml_status_to_string(status)); + return false; + } +#ifdef GGML_PERF + ggml_graph_print(gf); +#endif + copy_cache_tensors_to_cache_buffer(); + if (output != nullptr) { + auto result = ggml_get_tensor(compute_ctx, final_result_name.c_str()); + if (*output == nullptr && output_ctx != nullptr) { + *output = ggml_dup_tensor(output_ctx, result); + } + if (*output != nullptr) { + ggml_ext_backend_tensor_get_and_sync(runtime_backend, result, (*output)->data, 0, ggml_nbytes(*output)); + } + } + + if (free_compute_buffer_immediately) { + free_compute_buffer(); + } + return true; + } + + // Upstream's templated compute returning sd::Tensor template std::optional> compute(get_graph_cb_t get_graph, int n_threads, @@ -2680,6 +3123,10 @@ struct GGMLRunner { } GGML_ASSERT(gf != nullptr); + // Try the graph-cut segmented path first when --max-vram is set and + // params live on a different backend than the runtime. The planner + // may decide a single segment is enough, in which case we fall + // through to the regular alloc + execute path below. if (can_attempt_graph_cut_segmented_compute()) { GraphCutPlan plan; if (!resolve_graph_cut_plan(gf, &plan)) { @@ -2725,6 +3172,12 @@ struct GGMLRunner { void set_max_graph_vram_bytes(size_t max_vram_bytes) { max_graph_vram_bytes = max_vram_bytes; + // Forward to the layer-streaming budget too, so --max-vram caps both + // the graph-cut planner (above) and our streaming planner. Lets a + // single flag drive the simulated-smaller-card case for both paths. + if (streaming_engine_) { + streaming_engine_->get_budget().set_max_vram_cap_bytes(max_vram_bytes); + } } ggml_backend_t get_runtime_backend() { diff --git a/src/layer_streaming.hpp b/src/layer_streaming.hpp new file mode 100644 index 000000000..be7a30b72 --- /dev/null +++ b/src/layer_streaming.hpp @@ -0,0 +1,513 @@ +#ifndef __LAYER_STREAMING_HPP__ +#define __LAYER_STREAMING_HPP__ + +#include +#include +#include +#include +#include +#include + +#include "ggml-alloc.h" +#include "ggml-backend.h" +#include "ggml.h" + +#include "memory_budget.hpp" +#include "tensor_registry.hpp" +#include "util.h" + +namespace LayerStreaming { + +class LayerExecutionEngine; + +struct LayerSubgraph { + std::string name; + int index; + size_t estimated_compute_size = 0; + + using ExecuteFn = std::function( + ggml_context* ctx, + ggml_backend_t backend, + const std::vector& inputs)>; + + ExecuteFn execute_fn; +}; + +struct StreamingConfig { + bool enabled = false; + int prefetch_layers = 1; + int keep_layers_behind = 0; + size_t min_free_vram = 512 * 1024 * 1024; + bool async_prefetch = true; + bool log_operations = false; +}; + +class IntermediateTensorManager { +public: + IntermediateTensorManager(ggml_backend_t gpu_backend) + : gpu_backend_(gpu_backend) {} + + ~IntermediateTensorManager() { + clear(); + } + + ggml_tensor* store(const std::string& name, ggml_tensor* tensor) { + if (contexts_.find(name) != contexts_.end()) { + if (buffers_.find(name) != buffers_.end()) { + ggml_backend_buffer_free(buffers_[name]); + } + ggml_free(contexts_[name]); + } + + size_t ctx_size = ggml_tensor_overhead() + 1024; + struct ggml_init_params params = { + ctx_size, + nullptr, + true // no_alloc + }; + ggml_context* ctx = ggml_init(params); + if (ctx == nullptr) { + LOG_ERROR("failed to create context for '%s'", name.c_str()); + return nullptr; + } + + ggml_tensor* stored = ggml_dup_tensor(ctx, tensor); + ggml_set_name(stored, name.c_str()); + + ggml_backend_buffer_t buffer = ggml_backend_alloc_ctx_tensors(ctx, gpu_backend_); + if (buffer == nullptr) { + LOG_ERROR("failed to allocate buffer for '%s'", name.c_str()); + ggml_free(ctx); + return nullptr; + } + + ggml_backend_tensor_copy(tensor, stored); + ggml_backend_synchronize(gpu_backend_); + + contexts_[name] = ctx; + buffers_[name] = buffer; + tensors_[name] = stored; + + return stored; + } + + ggml_tensor* get(const std::string& name) { + auto it = tensors_.find(name); + if (it == tensors_.end()) { + return nullptr; + } + return it->second; + } + + bool has(const std::string& name) const { + return tensors_.find(name) != tensors_.end(); + } + + void remove(const std::string& name) { + auto buf_it = buffers_.find(name); + if (buf_it != buffers_.end()) { + ggml_backend_buffer_free(buf_it->second); + buffers_.erase(buf_it); + } + + auto ctx_it = contexts_.find(name); + if (ctx_it != contexts_.end()) { + ggml_free(ctx_it->second); + contexts_.erase(ctx_it); + } + + tensors_.erase(name); + } + + void clear() { + for (auto& [name, buffer] : buffers_) { + ggml_backend_buffer_free(buffer); + } + for (auto& [name, ctx] : contexts_) { + ggml_free(ctx); + } + tensors_.clear(); + buffers_.clear(); + contexts_.clear(); + } + + size_t get_memory_usage() const { + size_t total = 0; + for (const auto& [name, buffer] : buffers_) { + total += ggml_backend_buffer_get_size(buffer); + } + return total; + } + +private: + ggml_backend_t gpu_backend_; + std::unordered_map contexts_; + std::unordered_map buffers_; + std::unordered_map tensors_; +}; + +class LayerExecutionEngine { +public: + LayerExecutionEngine(ggml_backend_t gpu_backend, + ggml_backend_t cpu_backend) + : gpu_backend_(gpu_backend), + cpu_backend_(cpu_backend), + registry_(gpu_backend, cpu_backend), + budget_(registry_, gpu_backend), + intermediates_(gpu_backend) {} + + void set_config(const StreamingConfig& config) { + config_ = config; + } + + const StreamingConfig& get_config() const { + return config_; + } + + TensorRegistry& get_registry() { + return registry_; + } + + MemoryBudgetManager& get_budget() { + return budget_; + } + + // Prefer register_model_layers_from_map() - context tensors often lack proper names + void register_model_layers(ggml_context* params_ctx, + std::function(const std::string&)> layer_pattern_fn) { + registry_.register_from_context(params_ctx, "", layer_pattern_fn); + log_registered_layers(); + } + + void register_model_layers_from_map(const std::map& tensors, + std::function(const std::string&)> layer_pattern_fn) { + registry_.register_from_map(tensors, layer_pattern_fn); + log_registered_layers(); + } + +private: + void log_registered_layers() { + if (config_.log_operations) { + auto layers = registry_.get_layer_names_sorted(); + LOG_INFO("registered %zu layers", layers.size()); + for (const auto& layer : layers) { + LOG_DEBUG(" - %s: %.2f MB", + layer.c_str(), + registry_.get_layer_size(layer) / (1024.0 * 1024.0)); + } + } + } + +public: + + std::vector execute_streaming( + const std::vector& layers, + const std::vector& initial_inputs, + ggml_context* output_ctx) { + + if (!config_.enabled || layers.empty()) { + LOG_WARN("streaming disabled or no layers"); + return {}; + } + + int64_t total_start = ggml_time_ms(); + std::vector current_inputs = initial_inputs; + + for (size_t i = 0; i < layers.size(); i++) { + const auto& layer = layers[i]; + int64_t layer_start = ggml_time_ms(); + + if (!ensure_layer_loaded(layer.name, static_cast(i))) { + LOG_ERROR("failed to load layer '%s'", layer.name.c_str()); + return {}; + } + + if (config_.async_prefetch) { + for (int j = 1; j <= config_.prefetch_layers && i + j < layers.size(); j++) { + prefetch_layer(layers[i + j].name); + } + } + + ggml_context* layer_ctx = create_layer_context(layer); + if (layer_ctx == nullptr) { + LOG_ERROR("failed to create context for layer '%s'", layer.name.c_str()); + return {}; + } + + std::vector outputs = layer.execute_fn(layer_ctx, gpu_backend_, current_inputs); + + for (size_t j = 0; j < outputs.size(); j++) { + std::string name = "intermediate_" + std::to_string(i) + "_" + std::to_string(j); + ggml_tensor* stored = intermediates_.store(name, outputs[j]); + if (stored != nullptr) { + outputs[j] = stored; + } + } + + if (should_offload_layer(layer.name, static_cast(i), layers)) { + registry_.move_layer_to_cpu(layer.name); + } + + ggml_free(layer_ctx); + + current_inputs = outputs; + + if (config_.log_operations) { + int64_t layer_end = ggml_time_ms(); + LOG_DEBUG("executed layer '%s' in %.2fs", + layer.name.c_str(), + (layer_end - layer_start) / 1000.0); + } + } + + int64_t total_end = ggml_time_ms(); + if (config_.log_operations) { + LOG_INFO("executed %zu layers in %.2fs", + layers.size(), + (total_end - total_start) / 1000.0); + } + + return current_inputs; + } + + void clear() { + intermediates_.clear(); + } + + // Clears everything including registry (for new model) + void reset() { + intermediates_.clear(); + registry_.clear(); + } + + void prefetch_layer(const std::string& layer_name) { + if (!config_.async_prefetch) { + return; + } + + if (registry_.is_layer_on_gpu(layer_name)) { + return; + } + + if (pending_prefetches_.find(layer_name) != pending_prefetches_.end()) { + return; + } + + if (registry_.start_async_layer_load(layer_name, gpu_backend_, cpu_backend_)) { + pending_prefetches_.insert(layer_name); + if (config_.log_operations) { + LOG_DEBUG("started async prefetch for '%s'", layer_name.c_str()); + } + } + } + + void wait_for_prefetch(const std::string& layer_name) { + auto it = pending_prefetches_.find(layer_name); + if (it == pending_prefetches_.end()) { + return; + } + + if (registry_.complete_async_layer_load(layer_name, gpu_backend_)) { + pending_prefetches_.erase(it); + if (config_.log_operations) { + LOG_DEBUG("completed async prefetch for '%s'", layer_name.c_str()); + } + } + } + + void wait_for_all_prefetches() { + for (const auto& layer_name : pending_prefetches_) { + registry_.complete_async_layer_load(layer_name, gpu_backend_); + } + pending_prefetches_.clear(); + } + + bool is_prefetch_pending(const std::string& layer_name) const { + return pending_prefetches_.find(layer_name) != pending_prefetches_.end(); + } + + // Decides how many blocks to keep permanently resident on GPU for a + // section of the model (e.g. all "layers.N" or all "double_blocks.N"). + // Static partition follows ComfyUI's partially_load() — for the cyclic + // sequential access pattern of diffusion sampling, caching a fixed + // prefix is simpler and faster than dynamic eviction. Caller is + // responsible for storing the result and only computing it once per + // section so that consecutive calls inside the same generation see a + // consistent VRAM budget. + // + // sample_block_name should be a real block in the section (e.g. + // "layers.0") so per-block size can be measured. compute_buffer_reserve + // should be set per-runner to the peak compute buffer observed during + // a single block forward pass. + int compute_resident_block_count(const std::string& sample_block_name, + int num_blocks, + size_t compute_buffer_reserve = 768ULL * 1024 * 1024) { + if (num_blocks <= 0) { + return 0; + } + + size_t per_block = registry_.get_layer_size(sample_block_name); + if (per_block == 0) { + return 0; + } + + // Headroom: prefetch window in flight + the active block + the + // upcoming compute buffer + a hard safety margin. Without this + // slack the next prefetch's cudaMalloc can fail mid-loop. + int prefetch_count = std::max(1, config_.prefetch_layers); + size_t prefetch_reserve = static_cast(prefetch_count + 1) * per_block; + size_t safety = std::max(config_.min_free_vram, 512ULL * 1024 * 1024); + size_t reserved = prefetch_reserve + safety + compute_buffer_reserve; + + size_t free_vram = budget_.get_free_vram(); + if (free_vram <= reserved) { + return 0; + } + size_t available = free_vram - reserved; + int max_resident = static_cast(available / per_block); + return std::min(num_blocks, max_resident); + } + + // Prime the prefetch pipeline by kicking off transfers for the first + // prefetch_layers blocks starting at start_idx. Call once before the + // streaming loop. name_for(i) -> the registry key for block i. + void prime_prefetch(const std::function& name_for, + int start_idx, int num_blocks) { + int n = config_.prefetch_layers > 0 ? config_.prefetch_layers : 1; + for (int j = 0; j < n && (start_idx + j) < num_blocks; j++) { + prefetch_layer(name_for(start_idx + j)); + } + } + + // After moving block current_idx to GPU, kick off prefetch of the slot + // (current_idx + prefetch_layers) so the window stays full. + void advance_prefetch(const std::function& name_for, + int current_idx, int num_blocks) { + int n = config_.prefetch_layers > 0 ? config_.prefetch_layers : 1; + int target = current_idx + n; + if (target < num_blocks) { + prefetch_layer(name_for(target)); + } + } + +private: + bool ensure_layer_loaded(const std::string& layer_name, int current_idx) { + if (registry_.is_layer_on_gpu(layer_name)) { + return true; + } + + if (!budget_.ensure_vram_for_layer(layer_name, current_idx)) { + LOG_ERROR("cannot ensure VRAM for layer '%s'", layer_name.c_str()); + return false; + } + + return registry_.move_layer_to_gpu(layer_name); + } + + bool should_offload_layer(const std::string& layer_name, + int layer_idx, + const std::vector& layers) { + if (layer_name == "_global") { + return false; + } + + size_t free_vram = budget_.get_available_vram(); + if (free_vram > config_.min_free_vram * 2) { + return false; + } + + // UNet skip connections need more sophisticated logic + if (config_.keep_layers_behind > 0) { + return false; + } + + return free_vram < config_.min_free_vram; + } + + ggml_context* create_layer_context(const LayerSubgraph& layer) { + size_t ctx_size = 1024 * 1024; + if (layer.estimated_compute_size > 0) { + ctx_size = layer.estimated_compute_size; + } + + struct ggml_init_params params = { + ctx_size, + nullptr, + true // no_alloc + }; + + return ggml_init(params); + } + + ggml_backend_t gpu_backend_; + ggml_backend_t cpu_backend_; + + TensorRegistry registry_; + MemoryBudgetManager budget_; + IntermediateTensorManager intermediates_; + + StreamingConfig config_; + + std::set pending_prefetches_; +}; + +inline std::vector build_flux_layer_subgraphs( + int depth, + int depth_single, + const std::vector& skip_layers = {}) { + + std::vector layers; + + for (int i = 0; i < depth; i++) { + if (std::find(skip_layers.begin(), skip_layers.end(), i) != skip_layers.end()) { + continue; + } + + LayerSubgraph layer; + layer.name = "double_blocks." + std::to_string(i); + layer.index = i; + layers.push_back(layer); + } + + for (int i = 0; i < depth_single; i++) { + if (std::find(skip_layers.begin(), skip_layers.end(), i + depth) != skip_layers.end()) { + continue; + } + + LayerSubgraph layer; + layer.name = "single_blocks." + std::to_string(i); + layer.index = depth + i; + layers.push_back(layer); + } + + return layers; +} + +// UNet uses coarse stages due to skip connections +inline std::vector build_unet_layer_subgraphs( + int num_input_blocks, + int num_output_blocks) { + + std::vector layers; + + LayerSubgraph input_stage; + input_stage.name = "input_blocks"; + input_stage.index = 0; + layers.push_back(input_stage); + + LayerSubgraph middle_stage; + middle_stage.name = "middle_block"; + middle_stage.index = 1; + layers.push_back(middle_stage); + + LayerSubgraph output_stage; + output_stage.name = "output_blocks"; + output_stage.index = 2; + layers.push_back(output_stage); + + return layers; +} + +} // namespace LayerStreaming + +#endif // __LAYER_STREAMING_HPP__ diff --git a/src/lora.hpp b/src/lora.hpp index b57bc4226..f4e42890f 100644 --- a/src/lora.hpp +++ b/src/lora.hpp @@ -24,8 +24,9 @@ struct LoraModel : public GGMLRunner { ggml_backend_t backend, const std::string& file_path = "", std::string prefix = "", - SDVersion version = VERSION_COUNT) - : lora_id(lora_id), file_path(file_path), GGMLRunner(backend, false) { + SDVersion version = VERSION_COUNT, + bool enable_offload = false) + : lora_id(lora_id), file_path(file_path), GGMLRunner(backend, enable_offload) { prefix = "lora." + prefix; if (!model_loader.init_from_file_and_convert_name(file_path, prefix, version)) { load_failed = true; @@ -94,6 +95,29 @@ struct LoraModel : public GGMLRunner { return true; } + // Reload params from disk after buffer was freed (for dynamic offloading) + // Assumes lora_tensors map is still valid (tensors exist in params_ctx) + bool reload_params(int n_threads) { + if (lora_tensors.empty()) { + return true; // Nothing to reload + } + + alloc_params_buffer(); + + auto on_reload_cb = [&](const TensorStorage& tensor_storage, ggml_tensor** dst_tensor) -> bool { + const std::string& name = tensor_storage.name; + auto iter = lora_tensors.find(name); + if (iter != lora_tensors.end()) { + *dst_tensor = iter->second; + } + return true; + }; + + model_loader.load_tensors(on_reload_cb, n_threads); + LOG_DEBUG("reloaded lora params from disk"); + return true; + } + void preprocess_lora_tensors(const std::map& model_tensors) { if (tensor_preprocessed) { return; diff --git a/src/memory_budget.hpp b/src/memory_budget.hpp new file mode 100644 index 000000000..199c58091 --- /dev/null +++ b/src/memory_budget.hpp @@ -0,0 +1,316 @@ +#ifndef __MEMORY_BUDGET_HPP__ +#define __MEMORY_BUDGET_HPP__ + +#include +#include +#include + +#include "ggml-backend.h" +#include "ggml.h" + +#include "tensor_registry.hpp" +#include "util.h" + +namespace LayerStreaming { + +enum class EvictionPolicy { + LAYER_DISTANCE, + LRU, + LARGEST_FIRST, +}; + +class MemoryBudgetManager { +public: + MemoryBudgetManager(TensorRegistry& registry, + ggml_backend_t gpu_backend, + size_t safety_margin_bytes = 512 * 1024 * 1024) + : registry_(registry), + gpu_backend_(gpu_backend), + safety_margin_(safety_margin_bytes) { + query_device_memory(); + } + + void set_eviction_policy(EvictionPolicy policy) { + eviction_policy_ = policy; + } + + void set_safety_margin(size_t bytes) { + safety_margin_ = bytes; + } + + void query_device_memory() { + // Use runtime backend device API (works for CUDA, Vulkan, Metal, etc.). + // The previous SD_USE_CUDA gate broke after PR #1448 removed compile-time + // backend selection, leaving every build on the 8 GB / 4 GB fallback. + ggml_backend_dev_t dev = gpu_backend_ ? ggml_backend_get_device(gpu_backend_) : nullptr; + if (dev != nullptr) { + ggml_backend_dev_memory(dev, &free_vram_, &total_vram_); + } else { + total_vram_ = 8ULL * 1024 * 1024 * 1024; + free_vram_ = total_vram_ / 2; + } + // If the caller set a `--max-vram` budget, treat that as the upper + // bound on what our streaming planner is allowed to see, so the + // same budget knob drives both leejet's graph-cut path and our + // layer-streaming path. Lets users simulate a smaller card without + // needing a separate flag. + if (max_vram_cap_bytes_ > 0) { + if (max_vram_cap_bytes_ < free_vram_) { + free_vram_ = max_vram_cap_bytes_; + } + if (max_vram_cap_bytes_ < total_vram_) { + total_vram_ = max_vram_cap_bytes_; + } + } + LOG_DEBUG("total VRAM = %.2f GB, free = %.2f GB", + total_vram_ / (1024.0 * 1024.0 * 1024.0), + free_vram_ / (1024.0 * 1024.0 * 1024.0)); + } + + void set_max_vram_cap_bytes(size_t bytes) { + max_vram_cap_bytes_ = bytes; + } + + void set_compute_buffer_reserve(size_t bytes) { + compute_buffer_reserve_ = bytes; + } + + size_t get_compute_buffer_reserve() const { + return compute_buffer_reserve_; + } + + size_t get_free_vram() { + query_device_memory(); + return free_vram_; + } + + size_t get_total_vram() const { + return total_vram_; + } + + size_t get_available_vram() { + size_t free = get_free_vram(); + if (free <= safety_margin_) { + return 0; + } + return free - safety_margin_; + } + + bool has_enough_vram(size_t required_bytes) { + return get_available_vram() >= required_bytes; + } + + // Evicts other layers if necessary to make room + bool ensure_vram_for_layer(const std::string& layer_name, int current_layer_idx = -1) { + if (registry_.is_layer_on_gpu(layer_name)) { + return true; + } + + size_t layer_size = registry_.get_layer_size(layer_name); + if (layer_size == 0) { + LOG_ERROR("layer '%s' not found", layer_name.c_str()); + return false; + } + + if (has_enough_vram(layer_size)) { + return true; + } + + size_t needed = layer_size - get_available_vram(); + return evict_layers_for_space(needed, layer_name, current_layer_idx); + } + + // Dry-run allocation to get exact buffer requirements + size_t estimate_compute_buffer_size(ggml_cgraph* graph) { + if (graph == nullptr) { + return 0; + } + + ggml_gallocr_t temp_allocr = ggml_gallocr_new( + ggml_backend_get_default_buffer_type(gpu_backend_)); + + if (!ggml_gallocr_reserve(temp_allocr, graph)) { + ggml_gallocr_free(temp_allocr); + return 0; + } + + size_t compute_size = ggml_gallocr_get_buffer_size(temp_allocr, 0); + ggml_gallocr_free(temp_allocr); + + return compute_size; + } + + bool should_offload_layer(const std::string& layer_name, + const std::string& next_layer_name, + int keep_layers_ahead = 1) { + size_t next_layer_size = registry_.get_layer_size(next_layer_name); + if (has_enough_vram(next_layer_size * (keep_layers_ahead + 1))) { + return false; + } + return true; + } + + std::vector get_suggested_gpu_layers(int current_layer_idx, + int layers_ahead = 1, + int layers_behind = 0) { + auto all_layers = registry_.get_layer_names_sorted(); + std::vector result; + + for (const auto& name : all_layers) { + if (name == "_global") { + result.push_back(name); + continue; + } + + // TODO: filter by index range once layer index tracking is implemented + result.push_back(name); + } + + return result; + } + +private: + bool evict_layers_for_space(size_t bytes_needed, + const std::string& protected_layer, + int current_layer_idx) { + auto layers_on_gpu = registry_.get_layers_on_gpu(); + if (layers_on_gpu.empty()) { + LOG_ERROR("no layers to evict but need %.2f MB", + bytes_needed / (1024.0 * 1024.0)); + return false; + } + + layers_on_gpu.erase( + std::remove(layers_on_gpu.begin(), layers_on_gpu.end(), protected_layer), + layers_on_gpu.end()); + + // _global contains shared tensors, never evict + layers_on_gpu.erase( + std::remove(layers_on_gpu.begin(), layers_on_gpu.end(), "_global"), + layers_on_gpu.end()); + + if (layers_on_gpu.empty()) { + LOG_ERROR("no evictable layers available"); + return false; + } + + std::vector> scored_layers; + for (const auto& layer : layers_on_gpu) { + int score = compute_eviction_score(layer, current_layer_idx); + scored_layers.push_back({layer, score}); + } + + std::sort(scored_layers.begin(), scored_layers.end(), + [](const auto& a, const auto& b) { return a.second > b.second; }); + + size_t freed = 0; + for (const auto& [layer, score] : scored_layers) { + size_t layer_size = registry_.get_layer_size(layer); + registry_.move_layer_to_cpu(layer); + freed += layer_size; + + LOG_DEBUG("evicted layer '%s' (%.2f MB), total freed: %.2f MB", + layer.c_str(), + layer_size / (1024.0 * 1024.0), + freed / (1024.0 * 1024.0)); + + if (freed >= bytes_needed) { + return true; + } + } + + LOG_WARN("only freed %.2f MB, needed %.2f MB", + freed / (1024.0 * 1024.0), + bytes_needed / (1024.0 * 1024.0)); + return freed >= bytes_needed; + } + + // Higher score = more likely to evict + int compute_eviction_score(const std::string& layer, int current_layer_idx) { + switch (eviction_policy_) { + case EvictionPolicy::LAYER_DISTANCE: { + int layer_idx = extract_layer_index(layer); + if (layer_idx < 0 || current_layer_idx < 0) { + return 0; + } + return std::abs(layer_idx - current_layer_idx); + } + + case EvictionPolicy::LARGEST_FIRST: { + return static_cast(registry_.get_layer_size(layer) / (1024 * 1024)); + } + + case EvictionPolicy::LRU: + default: + // TODO: LRU needs access tracking in TensorRegistry, falling back to size-based + return static_cast(registry_.get_layer_size(layer) / (1024 * 1024)); + } + } + + int extract_layer_index(const std::string& layer_name) { + size_t db_pos = layer_name.find("double_blocks."); + if (db_pos != std::string::npos) { + size_t num_start = db_pos + 14; + try { + return std::stoi(layer_name.substr(num_start)); + } catch (...) { + return -1; + } + } + + size_t sb_pos = layer_name.find("single_blocks."); + if (sb_pos != std::string::npos) { + size_t num_start = sb_pos + 14; + try { + return 19 + std::stoi(layer_name.substr(num_start)); // offset past double_blocks + } catch (...) { + return -1; + } + } + + size_t ib_pos = layer_name.find("input_blocks."); + if (ib_pos != std::string::npos) { + size_t num_start = ib_pos + 13; + try { + return std::stoi(layer_name.substr(num_start)); + } catch (...) { + return -1; + } + } + + size_t ob_pos = layer_name.find("output_blocks."); + if (ob_pos != std::string::npos) { + size_t num_start = ob_pos + 14; + try { + return 200 + std::stoi(layer_name.substr(num_start)); + } catch (...) { + return -1; + } + } + + if (layer_name.find("middle_block") != std::string::npos) { + return 100; + } + + return -1; + } + + TensorRegistry& registry_; + ggml_backend_t gpu_backend_; + + size_t total_vram_ = 0; + size_t free_vram_ = 0; + size_t safety_margin_ = 512 * 1024 * 1024; + size_t max_vram_cap_bytes_ = 0; // 0 = no cap; set by --max-vram + size_t compute_buffer_reserve_ = 768ULL * 1024 * 1024; // headroom for the active block's compute graph + // alloc; matches compute_resident_block_count default. + // Used by analyze_vram_budget() to avoid picking + // coarse-stage when params fit but params + CB + // would exceed VRAM. + + EvictionPolicy eviction_policy_ = EvictionPolicy::LAYER_DISTANCE; +}; + +} // namespace LayerStreaming + +#endif // __MEMORY_BUDGET_HPP__ diff --git a/src/mmdit.hpp b/src/mmdit.hpp index e57041dc9..fd305c3e0 100644 --- a/src/mmdit.hpp +++ b/src/mmdit.hpp @@ -3,7 +3,9 @@ #include +#include "common_dit.hpp" #include "ggml_extend.hpp" +#include "layer_streaming.hpp" #include "model.h" #define MMDIT_GRAPH_SIZE 10240 @@ -745,6 +747,64 @@ struct MMDiT : public GGMLBlock { return spatial_pos_embed; } + struct StreamingInputResult { + ggml_tensor* x; // [N, H*W, hidden_size] + ggml_tensor* context; // [N, L, hidden_size] + ggml_tensor* c_mod; // [N, hidden_size] + }; + + StreamingInputResult forward_input_stage(GGMLRunnerContext* ctx, + ggml_tensor* x, + ggml_tensor* t, + ggml_tensor* y, + ggml_tensor* context, + int64_t H, int64_t W) { + auto x_embedder = std::dynamic_pointer_cast(blocks["x_embedder"]); + auto t_embedder = std::dynamic_pointer_cast(blocks["t_embedder"]); + + // Patch embed + pos embed + auto patch_embed = x_embedder->forward(ctx, x); // [N, H*W, hidden_size] + auto pos_embed_out = cropped_pos_embed(ctx->ggml_ctx, H, W); // [1, H*W, hidden_size] + x = ggml_add(ctx->ggml_ctx, patch_embed, pos_embed_out); // [N, H*W, hidden_size] + + // Timestep embedding + auto c = t_embedder->forward(ctx, t); // [N, hidden_size] + + // Y embedding (if present) + if (y != nullptr && adm_in_channels != -1) { + auto y_embedder = std::dynamic_pointer_cast(blocks["y_embedder"]); + y = y_embedder->forward(ctx, y); // [N, hidden_size] + c = ggml_add(ctx->ggml_ctx, c, y); + } + + // Context embedding + if (context != nullptr) { + auto context_embedder = std::dynamic_pointer_cast(blocks["context_embedder"]); + context = context_embedder->forward(ctx, context); // [N, L, hidden_size] + } + + return {x, context, c}; + } + + std::pair forward_joint_block(GGMLRunnerContext* ctx, + int block_idx, + ggml_tensor* context, + ggml_tensor* x, + ggml_tensor* c_mod) { + auto block = std::dynamic_pointer_cast(blocks["joint_blocks." + std::to_string(block_idx)]); + return block->forward(ctx, context, x, c_mod); + } + + ggml_tensor* forward_output_stage(GGMLRunnerContext* ctx, + ggml_tensor* x, + ggml_tensor* c_mod) { + auto final_layer = std::dynamic_pointer_cast(blocks["final_layer"]); + return final_layer->forward(ctx, x, c_mod); // (N, H*W, patch_size ** 2 * out_channels) + } + + int get_depth() const { return depth; } + int get_patch_size() const { return patch_size; } + ggml_tensor* forward_core_with_concat(GGMLRunnerContext* ctx, ggml_tensor* x, ggml_tensor* c_mod, @@ -827,6 +887,10 @@ struct MMDiT : public GGMLBlock { struct MMDiTRunner : public GGMLRunner { MMDiT mmdit; + // Static layer cache decided on the first sampling step. -1 = not yet + // computed; 0..N = number of joint_blocks kept resident on GPU. + int resident_joint_blocks_ = -1; + MMDiTRunner(ggml_backend_t backend, bool offload_params_to_cpu, const String2TensorStorage& tensor_storage_map = {}, @@ -843,6 +907,353 @@ struct MMDiTRunner : public GGMLRunner { mmdit.get_param_tensors(tensors, prefix); } + void enable_layer_streaming(const LayerStreaming::StreamingConfig& config = {}) { + std::map tensor_map; + mmdit.get_param_tensors(tensor_map, "model.diffusion_model"); + init_streaming(config, tensor_map, LayerStreaming::mmdit_layer_pattern); + LOG_INFO("%s layer streaming enabled (%zu layers)", + get_desc().c_str(), streaming_engine_->get_registry().get_layer_count()); + } + + bool compute_streaming(int n_threads, + ggml_tensor* x, + ggml_tensor* timesteps, + ggml_tensor* context, + ggml_tensor* y, + ggml_tensor** output = nullptr, + ggml_context* output_ctx = nullptr, + std::vector skip_layers = std::vector()) { + if (!is_streaming_enabled()) { + LOG_ERROR("%s streaming not enabled", get_desc().c_str()); + return false; + } + + int64_t t0 = ggml_time_ms(); + auto analysis = analyze_vram_budget(); + + if (analysis.fits_in_vram) { + LOG_INFO("%s model fits in VRAM, using coarse-stage streaming", get_desc().c_str()); + load_all_layers_coarse(); + bool result = compute(n_threads, x, timesteps, context, y, output, output_ctx, skip_layers, true); + int64_t t1 = ggml_time_ms(); + LOG_INFO("%s coarse-stage streaming completed in %.2fs", get_desc().c_str(), (t1 - t0) / 1000.0); + free_compute_buffer(); + return result; + } + + LOG_INFO("%s remaining %.2f GB exceeds available %.2f GB, using per-layer streaming", + get_desc().c_str(), + analysis.remaining_to_load / (1024.0 * 1024.0 * 1024.0), + analysis.available_vram / (1024.0 * 1024.0 * 1024.0)); + + return compute_streaming_true(n_threads, x, timesteps, context, y, output, output_ctx, skip_layers); + } + + bool compute_streaming_true(int n_threads, + ggml_tensor* x, + ggml_tensor* timesteps, + ggml_tensor* context, + ggml_tensor* y, + ggml_tensor** output = nullptr, + ggml_context* output_ctx = nullptr, + std::vector skip_layers = std::vector()) { + auto& registry = streaming_engine_->get_registry(); + int64_t t_start = ggml_time_ms(); + + const int num_blocks = mmdit.get_depth(); + const int patch_size = mmdit.get_patch_size(); + const int64_t W = x->ne[0]; + const int64_t H = x->ne[1]; + + LOG_INFO("TRUE per-layer streaming - %d joint_blocks", num_blocks); + + // Load global layers + LOG_DEBUG("Loading global layers"); + if (!registry.move_layer_to_gpu("_global")) { + LOG_ERROR("Failed to load _global to GPU"); + return false; + } + + // Persistent storage for intermediate tensors. Backed by a single + // GPU-pinned host buffer (ensure_pinned_act_buffers) so per-block + // ggml_backend_tensor_get / set_backend_tensor_data run at full + // PCIe bandwidth. context is optional (some MMDiT variants omit it). + std::vector persistent_x_fallback; + std::vector persistent_context_fallback; + std::vector persistent_c_mod_fallback; + float* persistent_x = nullptr; + float* persistent_context = nullptr; + float* persistent_c_mod = nullptr; + size_t persistent_x_count = 0; + size_t persistent_context_count = 0; + size_t persistent_c_mod_count = 0; + int64_t x_ne[4], context_ne[4], c_mod_ne[4]; + + LOG_DEBUG("Executing input stage"); + { + ggml_tensor* x_output = nullptr; + ggml_tensor* context_output = nullptr; + ggml_tensor* c_mod_output = nullptr; + + auto get_input_graph = [&]() -> struct ggml_cgraph* { + struct ggml_cgraph* gf = new_graph_custom(MMDIT_GRAPH_SIZE / 4); + auto runner_ctx = get_context(); + + ggml_tensor* x_backend = to_backend(x); + ggml_tensor* timesteps_backend = to_backend(timesteps); + ggml_tensor* y_backend = y ? to_backend(y) : nullptr; + ggml_tensor* context_backend = context ? to_backend(context) : nullptr; + + auto result = mmdit.forward_input_stage(&runner_ctx, x_backend, timesteps_backend, + y_backend, context_backend, H, W); + + x_output = result.x; + context_output = result.context; + c_mod_output = result.c_mod; + + ggml_build_forward_expand(gf, x_output); + if (context_output) ggml_build_forward_expand(gf, context_output); + ggml_build_forward_expand(gf, c_mod_output); + + return gf; + }; + + // Don't free compute buffer immediately - we need to read outputs first + if (!GGMLRunner::compute(get_input_graph, n_threads, false, nullptr, nullptr, true)) { + LOG_ERROR("Input stage failed"); + return false; + } + + // Extract to persistent storage + if (x_output && c_mod_output) { + size_t x_size = ggml_nelements(x_output); + size_t c_mod_size = ggml_nelements(c_mod_output); + size_t context_size = context_output ? ggml_nelements(context_output) : 0; + + persistent_x_count = x_size; + persistent_c_mod_count = c_mod_size; + persistent_context_count = context_size; + + std::vector ptrs; + if (ensure_pinned_act_buffers({x_size * sizeof(float), + c_mod_size * sizeof(float), + context_size * sizeof(float)}, ptrs)) { + persistent_x = ptrs[0]; + persistent_c_mod = ptrs[1]; + persistent_context = context_size ? ptrs[2] : nullptr; + } else { + persistent_x_fallback.resize(x_size); + persistent_c_mod_fallback.resize(c_mod_size); + persistent_x = persistent_x_fallback.data(); + persistent_c_mod = persistent_c_mod_fallback.data(); + if (context_size) { + persistent_context_fallback.resize(context_size); + persistent_context = persistent_context_fallback.data(); + } + } + + ggml_backend_tensor_get(x_output, persistent_x, 0, x_size * sizeof(float)); + ggml_backend_tensor_get(c_mod_output, persistent_c_mod, 0, c_mod_size * sizeof(float)); + + for (int i = 0; i < 4; i++) { + x_ne[i] = x_output->ne[i]; + c_mod_ne[i] = c_mod_output->ne[i]; + } + + if (context_output) { + ggml_backend_tensor_get(context_output, persistent_context, 0, context_size * sizeof(float)); + for (int i = 0; i < 4; i++) { + context_ne[i] = context_output->ne[i]; + } + } + } else { + LOG_ERROR("Failed to get input stage outputs"); + free_compute_buffer(); + return false; + } + + // Now safe to free compute buffer + free_compute_buffer(); + } + + LOG_DEBUG("Input stage done, x=%ldx%ldx%ld", x_ne[0], x_ne[1], x_ne[2]); + + auto block_name_at = [](int i) { return "joint_blocks." + std::to_string(i); }; + if (streaming_engine_) { + if (resident_joint_blocks_ < 0) { + resident_joint_blocks_ = streaming_engine_->compute_resident_block_count( + "joint_blocks.0", num_blocks); + LOG_INFO("%s joint_blocks cache: %d resident, %d streamed per step", + get_desc().c_str(), + resident_joint_blocks_, + num_blocks - resident_joint_blocks_); + } + + int prefetch_start = 0; + while (prefetch_start < num_blocks && + registry.is_layer_on_gpu(block_name_at(prefetch_start))) { + prefetch_start++; + } + streaming_engine_->prime_prefetch(block_name_at, prefetch_start, num_blocks); + } + + for (int block_idx = 0; block_idx < num_blocks; block_idx++) { + // Check skip_layers + if (skip_layers.size() > 0 && std::find(skip_layers.begin(), skip_layers.end(), block_idx) != skip_layers.end()) { + LOG_DEBUG("Skipping joint_block %d", block_idx); + continue; + } + + std::string block_name = block_name_at(block_idx); + int64_t t_block_start = ggml_time_ms(); + + // Wait for this block's prefetch to complete (if async prefetch was started) + if (streaming_engine_) { + streaming_engine_->wait_for_prefetch(block_name); + } + + // Load this block's weights (sync load if prefetch didn't happen) + if (!registry.move_layer_to_gpu(block_name)) { + LOG_ERROR("Failed to load %s", block_name.c_str()); + return false; + } + + // Keep the prefetch window full + if (streaming_engine_) { + streaming_engine_->advance_prefetch(block_name_at, block_idx, num_blocks); + } + + ggml_tensor* x_out = nullptr; + ggml_tensor* context_out = nullptr; + + auto get_block_graph = [&]() -> struct ggml_cgraph* { + struct ggml_cgraph* gf = new_graph_custom(MMDIT_GRAPH_SIZE / 4); + + // Create input tensors from persistent storage + ggml_tensor* x_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, x_ne[0], x_ne[1], x_ne[2], x_ne[3]); + ggml_tensor* c_mod_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, c_mod_ne[0], c_mod_ne[1], c_mod_ne[2], c_mod_ne[3]); + + x_in = to_backend(x_in); + c_mod_in = to_backend(c_mod_in); + + set_backend_tensor_data(x_in, persistent_x); + set_backend_tensor_data(c_mod_in, persistent_c_mod); + + ggml_tensor* context_in = nullptr; + if (persistent_context_count > 0) { + context_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, context_ne[0], context_ne[1], context_ne[2], context_ne[3]); + context_in = to_backend(context_in); + set_backend_tensor_data(context_in, persistent_context); + } + + auto runner_ctx = get_context(); + auto result = mmdit.forward_joint_block(&runner_ctx, block_idx, context_in, x_in, c_mod_in); + + context_out = result.first; + x_out = result.second; + + if (context_out) ggml_build_forward_expand(gf, context_out); + ggml_build_forward_expand(gf, x_out); + + return gf; + }; + + // Don't free compute buffer immediately - we need to read outputs first + if (!GGMLRunner::compute(get_block_graph, n_threads, false, nullptr, nullptr, true)) { + LOG_ERROR("Joint block %d execution failed", block_idx); + return false; + } + + // Extract outputs to persistent storage + if (x_out) { + ggml_backend_tensor_get(x_out, persistent_x, 0, persistent_x_count * sizeof(float)); + for (int i = 0; i < 4; i++) { + x_ne[i] = x_out->ne[i]; + } + } + if (context_out && persistent_context_count > 0) { + ggml_backend_tensor_get(context_out, persistent_context, 0, persistent_context_count * sizeof(float)); + for (int i = 0; i < 4; i++) { + context_ne[i] = context_out->ne[i]; + } + } + + // Now safe to free compute buffer + free_compute_buffer(); + + // Resident blocks stay on GPU across sampling steps. + if (block_idx >= resident_joint_blocks_) { + registry.move_layer_to_cpu(block_name); + } + + LOG_DEBUG("Joint block %d/%d done (%.2fms)", + block_idx + 1, num_blocks, (ggml_time_ms() - t_block_start) / 1.0); + } + + LOG_DEBUG("Executing output stage"); + { + auto get_output_graph = [&]() -> struct ggml_cgraph* { + struct ggml_cgraph* gf = new_graph_custom(MMDIT_GRAPH_SIZE / 4); + + ggml_tensor* x_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, x_ne[0], x_ne[1], x_ne[2], x_ne[3]); + ggml_tensor* c_mod_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, c_mod_ne[0], c_mod_ne[1], c_mod_ne[2], c_mod_ne[3]); + + x_in = to_backend(x_in); + c_mod_in = to_backend(c_mod_in); + + set_backend_tensor_data(x_in, persistent_x); + set_backend_tensor_data(c_mod_in, persistent_c_mod); + + auto runner_ctx = get_context(); + auto final_out = mmdit.forward_output_stage(&runner_ctx, x_in, c_mod_in); + + // Unpatchify + final_out = DiT::unpatchify_and_crop(compute_ctx, final_out, H, W, patch_size, patch_size, /*patch_last*/ false); + + ggml_build_forward_expand(gf, final_out); + + return gf; + }; + + if (!GGMLRunner::compute(get_output_graph, n_threads, true, output, output_ctx, true)) { + LOG_ERROR("Output stage failed"); + return false; + } + } + + int64_t t_end = ggml_time_ms(); + LOG_INFO("TRUE per-layer streaming completed in %.2fs (%d joint_blocks)", + (t_end - t_start) / 1000.0, num_blocks); + + return true; + } + + // Old-style build_graph for streaming code that uses raw ggml_tensor pointers + ggml_cgraph* build_graph(ggml_tensor* x, + ggml_tensor* timesteps, + ggml_tensor* context, + ggml_tensor* y, + std::vector skip_layers = std::vector()) { + ggml_cgraph* gf = new_graph_custom(MMDIT_GRAPH_SIZE); + + x = to_backend(x); + context = to_backend(context); + y = to_backend(y); + timesteps = to_backend(timesteps); + + auto runner_ctx = get_context(); + ggml_tensor* out = mmdit.forward(&runner_ctx, + x, + timesteps, + y, + context, + skip_layers); + + ggml_build_forward_expand(gf, out); + + return gf; + } + ggml_cgraph* build_graph(const sd::Tensor& x_tensor, const sd::Tensor& timesteps_tensor, const sd::Tensor& context_tensor = {}, @@ -868,6 +1279,23 @@ struct MMDiTRunner : public GGMLRunner { return gf; } + // Old-style compute for streaming code that uses raw ggml_tensor pointers + bool compute(int n_threads, + ggml_tensor* x, + ggml_tensor* timesteps, + ggml_tensor* context, + ggml_tensor* y, + ggml_tensor** output = nullptr, + ggml_context* output_ctx = nullptr, + std::vector skip_layers = std::vector(), + bool skip_param_offload = false) { + auto get_graph = [&]() -> ggml_cgraph* { + return build_graph(x, timesteps, context, y, skip_layers); + }; + + return GGMLRunner::compute(get_graph, n_threads, false, output, output_ctx, skip_param_offload); + } + sd::Tensor compute(int n_threads, const sd::Tensor& x, const sd::Tensor& timesteps, diff --git a/src/qwen_image.hpp b/src/qwen_image.hpp index 35d32109e..d89b19950 100644 --- a/src/qwen_image.hpp +++ b/src/qwen_image.hpp @@ -5,6 +5,7 @@ #include "common_block.hpp" #include "flux.hpp" +#include "layer_streaming.hpp" namespace Qwen { constexpr int QWEN_IMAGE_GRAPH_SIZE = 20480; @@ -436,6 +437,92 @@ namespace Qwen { return img; } + struct StreamingInputResult { + ggml_tensor* img; + ggml_tensor* txt; + ggml_tensor* t_emb; + }; + + StreamingInputResult forward_input_stage(GGMLRunnerContext* ctx, + ggml_tensor* x, + ggml_tensor* timestep, + ggml_tensor* context, + std::vector ref_latents = {}, + int64_t* out_img_tokens = nullptr) { + auto time_text_embed = std::dynamic_pointer_cast(blocks["time_text_embed"]); + auto txt_norm = std::dynamic_pointer_cast(blocks["txt_norm"]); + auto img_in = std::dynamic_pointer_cast(blocks["img_in"]); + auto txt_in = std::dynamic_pointer_cast(blocks["txt_in"]); + + auto t_emb = time_text_embed->forward(ctx, timestep); + if (params.zero_cond_t) { + auto t_emb_0 = time_text_embed->forward(ctx, ggml_ext_zeros(ctx->ggml_ctx, timestep->ne[0], timestep->ne[1], timestep->ne[2], timestep->ne[3])); + t_emb = ggml_concat(ctx->ggml_ctx, t_emb, t_emb_0, 1); + } + + // Patchify input (same as main forward()) + auto img_patched = DiT::pad_and_patchify(ctx, x, params.patch_size, params.patch_size); + int64_t img_tokens = img_patched->ne[1]; + + // Handle reference latents + if (ref_latents.size() > 0) { + for (ggml_tensor* ref : ref_latents) { + ref = DiT::pad_and_patchify(ctx, ref, params.patch_size, params.patch_size); + img_patched = ggml_concat(ctx->ggml_ctx, img_patched, ref, 1); + } + } + + auto img = img_in->forward(ctx, img_patched); + auto txt = txt_norm->forward(ctx, context); + txt = txt_in->forward(ctx, txt); + + if (out_img_tokens) { + *out_img_tokens = img_tokens; + } + + return {img, txt, t_emb}; + } + + std::pair forward_single_block(GGMLRunnerContext* ctx, + int block_idx, + ggml_tensor* img, + ggml_tensor* txt, + ggml_tensor* t_emb, + ggml_tensor* pe, + ggml_tensor* modulate_index = nullptr) { + auto block = std::dynamic_pointer_cast(blocks["transformer_blocks." + std::to_string(block_idx)]); + return block->forward(ctx, img, txt, t_emb, pe, modulate_index); + } + + ggml_tensor* forward_output_stage(GGMLRunnerContext* ctx, + ggml_tensor* img, + ggml_tensor* t_emb, + int64_t img_tokens, + int64_t orig_H, + int64_t orig_W) { + auto norm_out = std::dynamic_pointer_cast(blocks["norm_out"]); + auto proj_out = std::dynamic_pointer_cast(blocks["proj_out"]); + + if (params.zero_cond_t) { + t_emb = ggml_ext_chunk(ctx->ggml_ctx, t_emb, 2, 1)[0]; + } + + // Trim to original img_tokens if ref_latents were used + if (img->ne[1] > img_tokens) { + img = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, img, 0, 2, 1, 3)); + img = ggml_view_3d(ctx->ggml_ctx, img, img->ne[0], img->ne[1], img_tokens, img->nb[1], img->nb[2], 0); + img = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, img, 0, 2, 1, 3)); + } + + img = norm_out->forward(ctx, img, t_emb); + img = proj_out->forward(ctx, img); + + // Unpatchify and crop + img = DiT::unpatchify_and_crop(ctx->ggml_ctx, img, orig_H, orig_W, params.patch_size, params.patch_size); + + return img; + } + ggml_tensor* forward(GGMLRunnerContext* ctx, ggml_tensor* x, ggml_tensor* timestep, @@ -487,6 +574,10 @@ namespace Qwen { std::vector modulate_index_vec; SDVersion version; + // Static layer cache decided on the first sampling step. -1 = not yet + // computed; 0..N = number of "transformer_blocks.X" kept resident. + int resident_transformer_blocks_ = -1; + QwenImageRunner(ggml_backend_t backend, bool offload_params_to_cpu, const String2TensorStorage& tensor_storage_map = {}, @@ -532,6 +623,485 @@ namespace Qwen { qwen_image.get_param_tensors(tensors, prefix); } + public: + void enable_layer_streaming(const LayerStreaming::StreamingConfig& config = {}) { + std::map tensor_map; + qwen_image.get_param_tensors(tensor_map, "model.diffusion_model"); + init_streaming(config, tensor_map, LayerStreaming::qwen_image_layer_pattern); + LOG_INFO("%s layer streaming enabled (%zu layers)", + get_desc().c_str(), streaming_engine_->get_registry().get_layer_count()); + } + + bool compute_streaming(int n_threads, + ggml_tensor* x, + ggml_tensor* timesteps, + ggml_tensor* context, + std::vector ref_latents = {}, + bool increase_ref_index = false, + ggml_tensor** output = nullptr, + ggml_context* output_ctx = nullptr) { + if (!is_streaming_enabled()) { + LOG_ERROR("%s streaming not enabled", get_desc().c_str()); + return false; + } + + int64_t t0 = ggml_time_ms(); + auto analysis = analyze_vram_budget(); + + if (analysis.fits_in_vram) { + LOG_INFO("%s model fits in VRAM, using coarse-stage streaming", get_desc().c_str()); + load_all_layers_coarse(); + bool result = compute(n_threads, x, timesteps, context, ref_latents, increase_ref_index, + output, output_ctx, true); + int64_t t1 = ggml_time_ms(); + LOG_INFO("%s coarse-stage streaming completed in %.2fs", get_desc().c_str(), (t1 - t0) / 1000.0); + free_compute_buffer(); + return result; + } + + LOG_INFO("%s remaining %.2f GB exceeds available %.2f GB, using per-layer streaming", + get_desc().c_str(), + analysis.remaining_to_load / (1024.0 * 1024.0 * 1024.0), + analysis.available_vram / (1024.0 * 1024.0 * 1024.0)); + + return compute_streaming_true(n_threads, x, timesteps, context, ref_latents, increase_ref_index, output, output_ctx); + } + + private: + // Persistent storage for intermediate tensors between layer executions + struct StreamingState { + std::vector img_data; + std::vector txt_data; + std::vector t_emb_data; + std::vector pe_data; + std::vector modulate_index_data; + + // Tensor dimensions + int64_t img_ne[4]; + int64_t txt_ne[4]; + int64_t t_emb_ne[4]; + int64_t pe_ne[4]; + int64_t modulate_index_ne[4]; + bool has_modulate_index = false; + }; + + void copy_tensor_to_storage(ggml_tensor* tensor, std::vector& storage, int64_t* ne) { + size_t nelements = ggml_nelements(tensor); + storage.resize(nelements); + + // Copy to CPU if needed + ggml_backend_tensor_get(tensor, storage.data(), 0, nelements * sizeof(float)); + + // Store dimensions + for (int i = 0; i < 4; i++) { + ne[i] = tensor->ne[i]; + } + } + + ggml_tensor* create_tensor_from_storage(ggml_context* ctx, const std::vector& storage, + const int64_t* ne, const char* name) { + ggml_tensor* tensor = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, ne[0], ne[1], ne[2], ne[3]); + ggml_set_name(tensor, name); + return tensor; + } + + bool compute_streaming_true(int n_threads, + ggml_tensor* x, + ggml_tensor* timesteps, + ggml_tensor* context, + std::vector ref_latents, + bool increase_ref_index, + ggml_tensor** output, + ggml_context* output_ctx) { + auto& registry = streaming_engine_->get_registry(); + int64_t t_start = ggml_time_ms(); + + const int num_layers = qwen_image_params.num_layers; + LOG_INFO("TRUE per-layer streaming - %d blocks (one at a time)", num_layers); + + // Phase 1: Load global layers (_global contains input/output projections) + LOG_DEBUG("Loading global layers"); + if (!registry.move_layer_to_gpu("_global")) { + LOG_ERROR("Failed to load _global to GPU"); + return false; + } + + // Pre-generate PE and modulate_index vectors (needed for all blocks) + pe_vec = Rope::gen_qwen_image_pe(static_cast(x->ne[1]), + static_cast(x->ne[0]), + qwen_image_params.patch_size, + static_cast(x->ne[3]), + static_cast(context->ne[1]), + ref_latents, + increase_ref_index, + qwen_image_params.theta, + circular_y_enabled, + circular_x_enabled, + qwen_image_params.axes_dim); + + if (qwen_image_params.zero_cond_t) { + modulate_index_vec.clear(); + int64_t h_len = ((x->ne[1] + (qwen_image_params.patch_size / 2)) / qwen_image_params.patch_size); + int64_t w_len = ((x->ne[0] + (qwen_image_params.patch_size / 2)) / qwen_image_params.patch_size); + int64_t num_img_tokens = h_len * w_len; + modulate_index_vec.insert(modulate_index_vec.end(), num_img_tokens, 0.f); + + int64_t num_ref_img_tokens = 0; + for (ggml_tensor* ref : ref_latents) { + int64_t rh_len = ((ref->ne[1] + (qwen_image_params.patch_size / 2)) / qwen_image_params.patch_size); + int64_t rw_len = ((ref->ne[0] + (qwen_image_params.patch_size / 2)) / qwen_image_params.patch_size); + num_ref_img_tokens += rh_len * rw_len; + } + if (num_ref_img_tokens > 0) { + modulate_index_vec.insert(modulate_index_vec.end(), num_ref_img_tokens, 1.f); + } + } + + // TRUE per-layer streaming with mini-graphs + // Execute each block as a separate mini-graph to minimize activation memory + + int64_t t_blocks_start = ggml_time_ms(); + + // Store original image dimensions for unpatchify + int64_t orig_H = x->ne[1]; + int64_t orig_W = x->ne[0]; + + // Persistent storage. Backed by a single GPU-pinned host buffer + // (ensure_pinned_act_buffers) so per-block ggml_backend_tensor_get + // / set_backend_tensor_data run at full PCIe bandwidth. Falls back + // to pageable std::vector if pinned alloc fails. + std::vector persistent_img_fallback; + std::vector persistent_txt_fallback; + std::vector persistent_t_emb_fallback; + float* persistent_img = nullptr; + float* persistent_txt = nullptr; + float* persistent_t_emb = nullptr; + size_t persistent_img_count = 0; + size_t persistent_txt_count = 0; + size_t persistent_t_emb_count = 0; + int64_t img_ne[4], txt_ne[4], t_emb_ne[4]; + int64_t img_tokens_count = 0; + + LOG_DEBUG("Executing input stage"); + { + // Build mini-graph for input projections only + ggml_cgraph* input_graph = nullptr; + ggml_tensor* img_output = nullptr; + ggml_tensor* txt_output = nullptr; + ggml_tensor* t_emb_output = nullptr; + int64_t img_tokens_local = 0; + + auto get_input_graph = [&]() -> ggml_cgraph* { + ggml_cgraph* gf = new_graph_custom(QWEN_IMAGE_GRAPH_SIZE / 4); // Smaller graph + + ggml_tensor* x_backend = to_backend(x); + ggml_tensor* context_backend = to_backend(context); + ggml_tensor* timesteps_backend = to_backend(timesteps); + + // Convert ref_latents to backend + std::vector ref_latents_backend; + for (auto& ref : ref_latents) { + ref_latents_backend.push_back(to_backend(ref)); + } + + auto runner_ctx = get_context(); + auto result = qwen_image.forward_input_stage(&runner_ctx, x_backend, timesteps_backend, context_backend, + ref_latents_backend, &img_tokens_local); + + img_output = result.img; + txt_output = result.txt; + t_emb_output = result.t_emb; + + // Concatenate outputs into single tensor for extraction + // We'll use img as the primary output and extract separately + ggml_build_forward_expand(gf, result.img); + ggml_build_forward_expand(gf, result.txt); + ggml_build_forward_expand(gf, result.t_emb); + + return gf; + }; + + // Execute input stage - don't free compute buffer immediately + if (!GGMLRunner::compute(get_input_graph, n_threads, false, nullptr, nullptr, true)) { + LOG_ERROR("Input stage failed"); + return false; + } + + img_tokens_count = img_tokens_local; + + // Extract computed tensors to persistent storage + if (img_output && txt_output && t_emb_output) { + // Copy tensor data to CPU storage + size_t img_size = ggml_nelements(img_output); + size_t txt_size = ggml_nelements(txt_output); + size_t t_emb_size = ggml_nelements(t_emb_output); + + persistent_img_count = img_size; + persistent_txt_count = txt_size; + persistent_t_emb_count = t_emb_size; + + std::vector ptrs; + if (ensure_pinned_act_buffers({img_size * sizeof(float), + txt_size * sizeof(float), + t_emb_size * sizeof(float)}, ptrs)) { + persistent_img = ptrs[0]; + persistent_txt = ptrs[1]; + persistent_t_emb = ptrs[2]; + } else { + persistent_img_fallback.resize(img_size); + persistent_txt_fallback.resize(txt_size); + persistent_t_emb_fallback.resize(t_emb_size); + persistent_img = persistent_img_fallback.data(); + persistent_txt = persistent_txt_fallback.data(); + persistent_t_emb = persistent_t_emb_fallback.data(); + } + + ggml_backend_tensor_get(img_output, persistent_img, 0, img_size * sizeof(float)); + ggml_backend_tensor_get(txt_output, persistent_txt, 0, txt_size * sizeof(float)); + ggml_backend_tensor_get(t_emb_output, persistent_t_emb, 0, t_emb_size * sizeof(float)); + + for (int i = 0; i < 4; i++) { + img_ne[i] = img_output->ne[i]; + txt_ne[i] = txt_output->ne[i]; + t_emb_ne[i] = t_emb_output->ne[i]; + } + } else { + LOG_ERROR("Failed to get input stage outputs"); + free_compute_buffer(); + return false; + } + + // Now safe to free compute buffer + free_compute_buffer(); + } + + LOG_DEBUG("Input stage done, img=%ldx%ldx%ldx%ld, txt=%ldx%ldx%ldx%ld", + img_ne[0], img_ne[1], img_ne[2], img_ne[3], + txt_ne[0], txt_ne[1], txt_ne[2], txt_ne[3]); + + auto block_name_at = [](int i) { return "transformer_blocks." + std::to_string(i); }; + + if (resident_transformer_blocks_ < 0) { + resident_transformer_blocks_ = streaming_engine_->compute_resident_block_count( + "transformer_blocks.0", num_layers); + LOG_INFO("%s transformer_blocks cache: %d resident, %d streamed per step", + get_desc().c_str(), + resident_transformer_blocks_, + num_layers - resident_transformer_blocks_); + } + + int prefetch_start = 0; + while (prefetch_start < num_layers && + registry.is_layer_on_gpu(block_name_at(prefetch_start))) { + prefetch_start++; + } + streaming_engine_->prime_prefetch(block_name_at, prefetch_start, num_layers); + + for (int block_idx = 0; block_idx < num_layers; block_idx++) { + std::string block_name = block_name_at(block_idx); + int64_t t_block_start = ggml_time_ms(); + + // Wait for this block's prefetch to complete (if it was prefetched) + streaming_engine_->wait_for_prefetch(block_name); + + // Load this block's weights (sync load if prefetch didn't happen) + if (!registry.move_layer_to_gpu(block_name)) { + LOG_ERROR("Failed to load block %d", block_idx); + return false; + } + + // Keep the prefetch window full + streaming_engine_->advance_prefetch(block_name_at, block_idx, num_layers); + + // Build and execute mini-graph for this block + ggml_tensor* img_out = nullptr; + ggml_tensor* txt_out = nullptr; + + auto get_block_graph = [&]() -> ggml_cgraph* { + ggml_cgraph* gf = new_graph_custom(QWEN_IMAGE_GRAPH_SIZE / 4); + + // Create input tensors from persistent storage + ggml_tensor* img_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, img_ne[0], img_ne[1], img_ne[2], img_ne[3]); + ggml_tensor* txt_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, txt_ne[0], txt_ne[1], txt_ne[2], txt_ne[3]); + ggml_tensor* t_emb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, t_emb_ne[0], t_emb_ne[1], t_emb_ne[2], t_emb_ne[3]); + + // Copy to backend and set data + img_in = to_backend(img_in); + txt_in = to_backend(txt_in); + t_emb_in = to_backend(t_emb_in); + + set_backend_tensor_data(img_in, persistent_img); + set_backend_tensor_data(txt_in, persistent_txt); + set_backend_tensor_data(t_emb_in, persistent_t_emb); + + // Generate PE + int pos_len = static_cast(pe_vec.size() / qwen_image_params.axes_dim_sum / 2); + ggml_tensor* pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, qwen_image_params.axes_dim_sum / 2, pos_len); + set_backend_tensor_data(pe, pe_vec.data()); + + // Modulate index + ggml_tensor* modulate_index = nullptr; + if (qwen_image_params.zero_cond_t && !modulate_index_vec.empty()) { + modulate_index = ggml_new_tensor_1d(compute_ctx, GGML_TYPE_F32, modulate_index_vec.size()); + set_backend_tensor_data(modulate_index, modulate_index_vec.data()); + } + + auto runner_ctx = get_context(); + auto [img_result, txt_result] = qwen_image.forward_single_block(&runner_ctx, block_idx, + img_in, txt_in, t_emb_in, pe, modulate_index); + + img_out = img_result; + txt_out = txt_result; + + ggml_build_forward_expand(gf, img_out); + ggml_build_forward_expand(gf, txt_out); + + return gf; + }; + + // Don't free compute buffer immediately - we need to read outputs first + if (!GGMLRunner::compute(get_block_graph, n_threads, false, nullptr, nullptr, true)) { + LOG_ERROR("Block %d execution failed", block_idx); + return false; + } + + // Extract outputs to persistent storage + if (img_out && txt_out) { + ggml_backend_tensor_get(img_out, persistent_img, 0, persistent_img_count * sizeof(float)); + ggml_backend_tensor_get(txt_out, persistent_txt, 0, persistent_txt_count * sizeof(float)); + + for (int i = 0; i < 4; i++) { + img_ne[i] = img_out->ne[i]; + txt_ne[i] = txt_out->ne[i]; + } + } + + // Now safe to free compute buffer + free_compute_buffer(); + + // Resident blocks stay on GPU across sampling steps. + if (block_idx >= resident_transformer_blocks_) { + registry.move_layer_to_cpu(block_name); + } + + LOG_DEBUG("Block %d/%d done (%.2fms)", + block_idx + 1, num_layers, (ggml_time_ms() - t_block_start) / 1.0); + } + + LOG_DEBUG("Executing output stage"); + { + ggml_tensor* final_out = nullptr; + + auto get_output_graph = [&]() -> ggml_cgraph* { + ggml_cgraph* gf = new_graph_custom(QWEN_IMAGE_GRAPH_SIZE / 4); + + // Create input tensors + ggml_tensor* img_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, img_ne[0], img_ne[1], img_ne[2], img_ne[3]); + ggml_tensor* t_emb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, t_emb_ne[0], t_emb_ne[1], t_emb_ne[2], t_emb_ne[3]); + + img_in = to_backend(img_in); + t_emb_in = to_backend(t_emb_in); + + set_backend_tensor_data(img_in, persistent_img); + set_backend_tensor_data(t_emb_in, persistent_t_emb); + + auto runner_ctx = get_context(); + final_out = qwen_image.forward_output_stage(&runner_ctx, img_in, t_emb_in, + img_tokens_count, orig_H, orig_W); + + ggml_build_forward_expand(gf, final_out); + + return gf; + }; + + if (!GGMLRunner::compute(get_output_graph, n_threads, true, output, output_ctx, true)) { + LOG_ERROR("Output stage failed"); + return false; + } + } + + int64_t t_end = ggml_time_ms(); + LOG_INFO("TRUE per-layer streaming completed in %.2fs (%d blocks)", + (t_end - t_start) / 1000.0, num_layers); + + return true; + } + + public: + + // Raw ggml_tensor* overload used by streaming code + ggml_cgraph* build_graph(ggml_tensor* x, + ggml_tensor* timesteps, + ggml_tensor* context, + std::vector ref_latents = {}, + bool increase_ref_index = false) { + GGML_ASSERT(x->ne[3] == 1); + ggml_cgraph* gf = new_graph_custom(QWEN_IMAGE_GRAPH_SIZE); + + x = to_backend(x); + context = to_backend(context); + timesteps = to_backend(timesteps); + + for (size_t i = 0; i < ref_latents.size(); i++) { + ref_latents[i] = to_backend(ref_latents[i]); + } + + pe_vec = Rope::gen_qwen_image_pe(static_cast(x->ne[1]), + static_cast(x->ne[0]), + qwen_image_params.patch_size, + static_cast(x->ne[3]), + static_cast(context->ne[1]), + ref_latents, + increase_ref_index, + qwen_image_params.theta, + circular_y_enabled, + circular_x_enabled, + qwen_image_params.axes_dim); + int pos_len = static_cast(pe_vec.size() / qwen_image_params.axes_dim_sum / 2); + auto pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, qwen_image_params.axes_dim_sum / 2, pos_len); + set_backend_tensor_data(pe, pe_vec.data()); + + ggml_tensor* modulate_index = nullptr; + if (qwen_image_params.zero_cond_t) { + modulate_index_vec.clear(); + + int64_t h_len = ((x->ne[1] + (qwen_image_params.patch_size / 2)) / qwen_image_params.patch_size); + int64_t w_len = ((x->ne[0] + (qwen_image_params.patch_size / 2)) / qwen_image_params.patch_size); + int64_t num_img_tokens = h_len * w_len; + + modulate_index_vec.insert(modulate_index_vec.end(), num_img_tokens, 0.f); + int64_t num_ref_img_tokens = 0; + for (ggml_tensor* ref : ref_latents) { + int64_t rh_len = ((ref->ne[1] + (qwen_image_params.patch_size / 2)) / qwen_image_params.patch_size); + int64_t rw_len = ((ref->ne[0] + (qwen_image_params.patch_size / 2)) / qwen_image_params.patch_size); + + num_ref_img_tokens += rh_len * rw_len; + } + + if (num_ref_img_tokens > 0) { + modulate_index_vec.insert(modulate_index_vec.end(), num_ref_img_tokens, 1.f); + } + + modulate_index = ggml_new_tensor_1d(compute_ctx, GGML_TYPE_F32, modulate_index_vec.size()); + set_backend_tensor_data(modulate_index, modulate_index_vec.data()); + } + + auto runner_ctx = get_context(); + + ggml_tensor* out = qwen_image.forward(&runner_ctx, + x, + timesteps, + context, + pe, + ref_latents, + modulate_index); + + ggml_build_forward_expand(gf, out); + + return gf; + } + + // sd::Tensor overload - upstream public API ggml_cgraph* build_graph(const sd::Tensor& x_tensor, const sd::Tensor& timesteps_tensor, const sd::Tensor& context_tensor, @@ -608,6 +1178,27 @@ namespace Qwen { return gf; } + // Raw ggml_tensor* overload used by streaming code + bool compute(int n_threads, + ggml_tensor* x, + ggml_tensor* timesteps, + ggml_tensor* context, + std::vector ref_latents = {}, + bool increase_ref_index = false, + ggml_tensor** output = nullptr, + ggml_context* output_ctx = nullptr, + bool skip_param_offload = false) { + // x: [N, in_channels, h, w] + // timesteps: [N, ] + // context: [N, max_position, hidden_size] + auto get_graph = [&]() -> ggml_cgraph* { + return build_graph(x, timesteps, context, ref_latents, increase_ref_index); + }; + + return GGMLRunner::compute(get_graph, n_threads, false, output, output_ctx, skip_param_offload); + } + + // sd::Tensor overload - upstream public API sd::Tensor compute(int n_threads, const sd::Tensor& x, const sd::Tensor& timesteps, diff --git a/src/stable-diffusion.cpp b/src/stable-diffusion.cpp index fd439ff1d..b7e25a4d2 100644 --- a/src/stable-diffusion.cpp +++ b/src/stable-diffusion.cpp @@ -1,5 +1,10 @@ #include "ggml_extend.hpp" +#ifdef SD_USE_CUDA +#include "ggml-cuda.h" +#include +#endif + #include "model.h" #include "rng.hpp" #include "rng_mt19937.hpp" @@ -146,6 +151,11 @@ class StableDiffusionGGML { bool offload_params_to_cpu = false; float max_vram = 0.f; bool use_pmid = false; + sd_offload_config_t offload_config = {}; // Dynamic tensor offloading config + + // Track which components were intentionally kept on CPU (don't try to move to GPU) + bool cond_stage_on_cpu_only = false; // true if keep_clip_on_cpu was set + bool vae_on_cpu_only = false; // true if keep_vae_on_cpu was set bool is_using_v_parameterization = false; bool is_using_edm_v_parameterization = false; @@ -192,6 +202,31 @@ class StableDiffusionGGML { free_params_immediately = sd_ctx_params->free_params_immediately; offload_params_to_cpu = sd_ctx_params->offload_params_to_cpu; max_vram = sd_ctx_params->max_vram; + offload_config = sd_ctx_params->offload_config; + + // When the offload_config selects a cross-stage mode, also force the + // affected models onto the CPU backend so we can shuffle them between + // stages. offload_params_to_cpu remains the user-facing knob; this is + // an internal escalation when the config implies it. + bool cond_stage_offload_to_cpu = offload_params_to_cpu; + bool diffusion_offload_to_cpu = offload_params_to_cpu; + bool vae_offload_to_cpu = offload_params_to_cpu; + if (offload_config.mode != SD_OFFLOAD_NONE) { + if (offload_config.offload_cond_stage) { + cond_stage_offload_to_cpu = true; + } + // Diffusion CPU backend is needed even in cond_only mode so we + // can temporarily swap it out while loading cond_stage to GPU. + diffusion_offload_to_cpu = true; + } + // Layer streaming wants every MB it can get back during sampling, so + // give the VAE a CPU-pinned twin too. The VAE is idle for the entire + // sampler loop and only used at decode time — moving it to CPU between + // the two phases is pure win. Other offload modes keep current + // behaviour: VAE on whichever backend the user selected. + if (offload_config.mode == SD_OFFLOAD_LAYER_STREAMING) { + vae_offload_to_cpu = true; + } bool use_tae = false; @@ -376,6 +411,7 @@ class StableDiffusionGGML { } bool clip_on_cpu = sd_ctx_params->keep_clip_on_cpu; + cond_stage_on_cpu_only = clip_on_cpu; // Track for offload decisions const size_t max_graph_vram_bytes = max_vram <= 0.f ? 0 @@ -389,10 +425,10 @@ class StableDiffusionGGML { } if (sd_version_is_sd3(version)) { cond_stage_model = std::make_shared(clip_backend, - offload_params_to_cpu, + cond_stage_offload_to_cpu, tensor_storage_map); diffusion_model = std::make_shared(backend, - offload_params_to_cpu, + diffusion_offload_to_cpu, tensor_storage_map); } else if (sd_version_is_flux(version)) { bool is_chroma = false; @@ -413,53 +449,53 @@ class StableDiffusionGGML { } cond_stage_model = std::make_shared(clip_backend, - offload_params_to_cpu, + cond_stage_offload_to_cpu, tensor_storage_map, sd_ctx_params->chroma_use_t5_mask, sd_ctx_params->chroma_t5_mask_pad); } else if (version == VERSION_OVIS_IMAGE) { cond_stage_model = std::make_shared(clip_backend, - offload_params_to_cpu, + cond_stage_offload_to_cpu, tensor_storage_map, version, "", false); } else { cond_stage_model = std::make_shared(clip_backend, - offload_params_to_cpu, + cond_stage_offload_to_cpu, tensor_storage_map); } diffusion_model = std::make_shared(backend, - offload_params_to_cpu, + diffusion_offload_to_cpu, tensor_storage_map, version, sd_ctx_params->chroma_use_dit_mask); } else if (sd_version_is_flux2(version)) { bool is_chroma = false; cond_stage_model = std::make_shared(clip_backend, - offload_params_to_cpu, + cond_stage_offload_to_cpu, tensor_storage_map, version); diffusion_model = std::make_shared(backend, - offload_params_to_cpu, + diffusion_offload_to_cpu, tensor_storage_map, version, sd_ctx_params->chroma_use_dit_mask); } else if (sd_version_is_wan(version)) { cond_stage_model = std::make_shared(clip_backend, - offload_params_to_cpu, + cond_stage_offload_to_cpu, tensor_storage_map, true, 0, true); diffusion_model = std::make_shared(backend, - offload_params_to_cpu, + diffusion_offload_to_cpu, tensor_storage_map, "model.diffusion_model", version); if (strlen(SAFE_STR(sd_ctx_params->high_noise_diffusion_model_path)) > 0) { high_noise_diffusion_model = std::make_shared(backend, - offload_params_to_cpu, + diffusion_offload_to_cpu, tensor_storage_map, "model.high_noise_diffusion_model", version); @@ -468,7 +504,7 @@ class StableDiffusionGGML { diffusion_model->get_desc() == "Wan2.1-FLF2V-14B" || diffusion_model->get_desc() == "Wan2.1-I2V-1.3B") { clip_vision = std::make_shared(backend, - offload_params_to_cpu, + diffusion_offload_to_cpu, tensor_storage_map); clip_vision->set_max_graph_vram_bytes(max_graph_vram_bytes); clip_vision->alloc_params_buffer(); @@ -480,32 +516,32 @@ class StableDiffusionGGML { enable_vision = true; } cond_stage_model = std::make_shared(clip_backend, - offload_params_to_cpu, + cond_stage_offload_to_cpu, tensor_storage_map, version, "", enable_vision); diffusion_model = std::make_shared(backend, - offload_params_to_cpu, + diffusion_offload_to_cpu, tensor_storage_map, "model.diffusion_model", version, sd_ctx_params->qwen_image_zero_cond_t); } else if (sd_version_is_anima(version)) { cond_stage_model = std::make_shared(clip_backend, - offload_params_to_cpu, + cond_stage_offload_to_cpu, tensor_storage_map); diffusion_model = std::make_shared(backend, - offload_params_to_cpu, + diffusion_offload_to_cpu, tensor_storage_map, "model.diffusion_model"); } else if (sd_version_is_z_image(version)) { cond_stage_model = std::make_shared(clip_backend, - offload_params_to_cpu, + cond_stage_offload_to_cpu, tensor_storage_map, version); diffusion_model = std::make_shared(backend, - offload_params_to_cpu, + diffusion_offload_to_cpu, tensor_storage_map, "model.diffusion_model", version); @@ -525,20 +561,20 @@ class StableDiffusionGGML { } if (strstr(SAFE_STR(sd_ctx_params->photo_maker_path), "v2")) { cond_stage_model = std::make_shared(clip_backend, - offload_params_to_cpu, + cond_stage_offload_to_cpu, tensor_storage_map, embbeding_map, version, PM_VERSION_2); } else { cond_stage_model = std::make_shared(clip_backend, - offload_params_to_cpu, + cond_stage_offload_to_cpu, tensor_storage_map, embbeding_map, version); } diffusion_model = std::make_shared(backend, - offload_params_to_cpu, + diffusion_offload_to_cpu, tensor_storage_map, version); if (sd_ctx_params->diffusion_conv_direct) { @@ -555,6 +591,26 @@ class StableDiffusionGGML { diffusion_model->alloc_params_buffer(); diffusion_model->get_param_tensors(tensors); + // Enable layer streaming if configured + if (offload_config.mode == SD_OFFLOAD_LAYER_STREAMING) { + LOG_INFO("Mode is layer_streaming, checking model support..."); + if (diffusion_model->supports_layer_streaming()) { + LOG_INFO("Enabling layer-by-layer streaming for diffusion model"); + LOG_INFO("Prefetch layers: %d, Min free VRAM: %.0f MB", + offload_config.streaming_prefetch_layers, + offload_config.streaming_min_free_vram / (1024.0 * 1024.0)); + diffusion_model->enable_layer_streaming( + offload_config.streaming_prefetch_layers, + offload_config.streaming_min_free_vram); + LOG_INFO("is_layer_streaming_enabled() = %s", + diffusion_model->is_layer_streaming_enabled() ? "true" : "false"); + } else { + LOG_WARN("Diffusion model does not support layer streaming, falling back to normal mode"); + } + } else { + LOG_DEBUG("Mode is not layer_streaming (mode=%d)", offload_config.mode); + } + if (sd_version_is_unet_edit(version)) { vae_decode_only = false; } @@ -565,6 +621,7 @@ class StableDiffusionGGML { high_noise_diffusion_model->get_param_tensors(tensors); } + vae_on_cpu_only = sd_ctx_params->keep_vae_on_cpu; // Track for offload decisions if (sd_ctx_params->keep_vae_on_cpu && !ggml_backend_is_cpu(backend)) { LOG_INFO("VAE Autoencoder: Using CPU backend"); vae_backend = ggml_backend_cpu_init(); @@ -577,7 +634,7 @@ class StableDiffusionGGML { sd_version_is_qwen_image(version) || sd_version_is_anima(version)) { return std::make_shared(vae_backend, - offload_params_to_cpu, + vae_offload_to_cpu, tensor_storage_map, "decoder", vae_decode_only, @@ -585,7 +642,7 @@ class StableDiffusionGGML { } else { auto model = std::make_shared(vae_backend, - offload_params_to_cpu, + vae_offload_to_cpu, tensor_storage_map, "decoder.layers", vae_decode_only, @@ -599,14 +656,14 @@ class StableDiffusionGGML { sd_version_is_qwen_image(version) || sd_version_is_anima(version)) { return std::make_shared(vae_backend, - offload_params_to_cpu, + vae_offload_to_cpu, tensor_storage_map, "first_stage_model", vae_decode_only, version); } else { auto model = std::make_shared(vae_backend, - offload_params_to_cpu, + vae_offload_to_cpu, tensor_storage_map, "first_stage_model", vae_decode_only, @@ -629,7 +686,7 @@ class StableDiffusionGGML { LOG_INFO("using FakeVAE"); first_stage_model = std::make_shared(version, vae_backend, - offload_params_to_cpu); + vae_offload_to_cpu); } else if (use_tae && !tae_preview_only) { LOG_INFO("using TAE for encoding / decoding"); first_stage_model = create_tae(); @@ -805,6 +862,53 @@ class StableDiffusionGGML { LOG_DEBUG("finished loaded file"); + // For layer streaming mode, offload all diffusion model layers to CPU immediately + // This frees VRAM for the LLM/CLIP during conditioning + // Layers will be loaded on-demand during streaming execution + if (offload_config.mode == SD_OFFLOAD_LAYER_STREAMING && + diffusion_model && diffusion_model->is_layer_streaming_enabled()) { + LOG_INFO("Offloading diffusion model layers to CPU for layer streaming"); + diffusion_model->offload_streaming_layers(); + } + + // When dynamic offloading is enabled and user didn't want clip on CPU, + // we forced CPU backend creation but now TRY to move params to GPU for execution. + // This gives us the best of both: fast GPU execution with ability to offload later. + // Skip if cond_stage was intentionally kept on CPU (keep_clip_on_cpu=true). + if (offload_config.mode != SD_OFFLOAD_NONE && + offload_config.offload_cond_stage && + !cond_stage_on_cpu_only) { + // Disable automatic offloading - we control offload/reload timing explicitly + cond_stage_model->set_auto_offload(false); + + // Check if there's enough VRAM to load cond_stage now + // If not, keep it on CPU - it will be loaded on-demand before conditioning + size_t cond_stage_size = cond_stage_model->get_params_buffer_size(); + size_t free_vram = 0; +#ifdef SD_USE_CUDA + size_t total_vram = 0; + ggml_backend_cuda_get_device_memory(0, &free_vram, &total_vram); +#endif + // Need safety margin for compute buffers + size_t safety_margin = 500 * 1024 * 1024; + + if (free_vram >= cond_stage_size + safety_margin) { + LOG_WARN("Moving cond_stage params to GPU (%.2f MB free, %.2f MB needed)", + free_vram / (1024.0f * 1024.0f), cond_stage_size / (1024.0f * 1024.0f)); + if (cond_stage_model->move_params_to_gpu()) { + LOG_WARN("cond_stage now on GPU (%.2f MB), auto-offload disabled for explicit control", + cond_stage_model->get_params_vram_size() / (1024.0f * 1024.0f)); + } else { + // GPU allocation failed despite having enough reported free VRAM (fragmentation?) + // Keep on CPU - it will work, just with on-demand loading + LOG_WARN("cond_stage GPU allocation failed (fragmentation?), keeping on CPU for on-demand loading"); + } + } else { + LOG_WARN("Not enough VRAM for cond_stage at load time (%.2f MB free, %.2f MB needed), keeping on CPU for on-demand loading", + free_vram / (1024.0f * 1024.0f), cond_stage_size / (1024.0f * 1024.0f)); + } + } + { size_t clip_params_mem_size = cond_stage_model->get_params_buffer_size(); size_t unet_params_mem_size = diffusion_model->get_params_buffer_size(); @@ -1014,7 +1118,11 @@ class StableDiffusionGGML { is_high_noise = true; LOG_DEBUG("high noise lora: %s", lora_path.c_str()); } - auto lora = std::make_shared(lora_id, backend, lora_path, is_high_noise ? "model.high_noise_" : "", version); + // Enable CPU offload for LoRA when dynamic offloading is active + bool enable_lora_offload = (offload_config.mode != SD_OFFLOAD_NONE); + auto lora = std::make_shared(lora_id, backend, lora_path, + is_high_noise ? "model.high_noise_" : "", + version, enable_lora_offload); if (!lora->load_from_file(n_threads, lora_tensor_filter)) { LOG_WARN("load lora tensors from %s failed", lora_path.c_str()); return nullptr; @@ -1691,7 +1799,7 @@ class StableDiffusionGGML { return std::move(cached_output); } - auto output_opt = work_diffusion_model->compute(n_threads, diffusion_params); + auto output_opt = work_diffusion_model->compute_dispatch(n_threads, diffusion_params); if (output_opt.empty()) { LOG_ERROR("diffusion model compute failed"); return sd::Tensor(); @@ -1885,6 +1993,352 @@ class StableDiffusionGGML { return latents; } + // Estimate VRAM needed for VAE decode operation (formula-based) + size_t estimate_vae_decode_vram(int width, int height) { + if (first_stage_model == nullptr) { + return static_cast(width) * height * 12; + } + size_t vae_weights = first_stage_model->get_params_buffer_size(); + size_t compute_estimate = static_cast(width) * height * 48; + return vae_weights + compute_estimate; + } + + // Smart offload before VAE decode - only offload what's needed + bool smart_offload_for_vae(int width, int height, bool decode_video = false) { + if (offload_config.mode == SD_OFFLOAD_NONE) { + return false; + } + + // In layer_streaming mode, skip smart offload for diffusion model + if (offload_config.mode == SD_OFFLOAD_LAYER_STREAMING) { + if (offload_config.offload_cond_stage && cond_stage_model && cond_stage_model->is_params_on_gpu()) { + if (offload_config.log_offload_events) { + LOG_INFO("Layer streaming: moving cond_stage to CPU for VAE decode"); + } + cond_stage_model->move_params_to_cpu(); + return true; + } + return false; + } + + size_t vae_vram_needed = estimate_vae_decode_vram(width, height); + + size_t target_free = offload_config.target_free_vram; + size_t vram_to_free = vae_vram_needed > target_free ? 0 : vae_vram_needed; + + size_t cond_vram = 0; + size_t diffusion_vram = 0; + bool cond_on_gpu = cond_stage_model && cond_stage_model->is_params_on_gpu(); + bool diffusion_on_gpu = diffusion_model && diffusion_model->is_params_on_gpu(); + + if (cond_on_gpu) { + cond_vram = cond_stage_model->get_params_buffer_size(); + } + if (diffusion_on_gpu) { + diffusion_vram = diffusion_model->get_params_buffer_size(); + } + + bool offloaded_anything = false; + + if (offload_config.offload_cond_stage && cond_on_gpu && cond_vram >= offload_config.min_offload_size) { + if (offload_config.log_offload_events) { + LOG_INFO("Smart offload: moving cond_stage to CPU (%.2f MB) for VAE decode", + cond_vram / (1024.0f * 1024.0f)); + } + cond_stage_model->move_params_to_cpu(); + offloaded_anything = true; + vram_to_free = (vram_to_free > cond_vram) ? vram_to_free - cond_vram : 0; + } + + if (offload_config.offload_diffusion && diffusion_on_gpu && vram_to_free > 0 && + diffusion_vram >= offload_config.min_offload_size) { + if (offload_config.log_offload_events) { + LOG_INFO("Smart offload: moving diffusion to CPU (%.2f MB) for VAE decode", + diffusion_vram / (1024.0f * 1024.0f)); + } + diffusion_model->move_params_to_cpu(); + offloaded_anything = true; + } + + return offloaded_anything; + } + + // Smart offload before VAE encode - only offload what's needed + bool smart_offload_for_vae_encode(int width, int height) { + if (offload_config.mode == SD_OFFLOAD_NONE) { + return false; + } + + if (offload_config.mode == SD_OFFLOAD_LAYER_STREAMING) { + bool offloaded = false; + + if (offload_config.offload_cond_stage && cond_stage_model && cond_stage_model->is_params_on_gpu()) { + if (offload_config.log_offload_events) { + LOG_INFO("Layer streaming: moving cond_stage to CPU for VAE encode"); + } + cond_stage_model->move_params_to_cpu(); + offloaded = true; + } + + if (offload_config.offload_diffusion && diffusion_model && diffusion_model->is_params_on_gpu()) { + if (offload_config.log_offload_events) { + LOG_INFO("Layer streaming: moving diffusion to CPU for VAE encode"); + } + diffusion_model->move_params_to_cpu(); + offloaded = true; + } + + return offloaded; + } + + size_t vae_vram_needed = 0; + if (first_stage_model == nullptr) { + vae_vram_needed = static_cast(width) * height * 12; + } else { + size_t vae_weights = first_stage_model->get_params_buffer_size(); + size_t compute_estimate = static_cast(width) * height * 40; + vae_vram_needed = vae_weights + compute_estimate; + } + + size_t target_free = offload_config.target_free_vram; + size_t vram_to_free = vae_vram_needed > target_free ? 0 : vae_vram_needed; + + size_t cond_vram = 0; + size_t diffusion_vram = 0; + bool cond_on_gpu = cond_stage_model && cond_stage_model->is_params_on_gpu(); + bool diffusion_on_gpu = diffusion_model && diffusion_model->is_params_on_gpu(); + + if (cond_on_gpu) { + cond_vram = cond_stage_model->get_params_buffer_size(); + } + if (diffusion_on_gpu) { + diffusion_vram = diffusion_model->get_params_buffer_size(); + } + + bool offloaded_anything = false; + + if (offload_config.offload_cond_stage && cond_on_gpu && cond_vram >= offload_config.min_offload_size) { + if (offload_config.log_offload_events) { + LOG_INFO("Smart offload: moving cond_stage to CPU (%.2f MB) for VAE encode", + cond_vram / (1024.0f * 1024.0f)); + } + cond_stage_model->move_params_to_cpu(); + offloaded_anything = true; + vram_to_free = (vram_to_free > cond_vram) ? vram_to_free - cond_vram : 0; + } + + if (offload_config.offload_diffusion && diffusion_on_gpu && vram_to_free > 0 && + diffusion_vram >= offload_config.min_offload_size) { + if (offload_config.log_offload_events) { + LOG_INFO("Smart offload: moving diffusion to CPU (%.2f MB) for VAE encode", + diffusion_vram / (1024.0f * 1024.0f)); + } + diffusion_model->move_params_to_cpu(); + offloaded_anything = true; + } + + return offloaded_anything; + } + + // Get current free VRAM on the primary GPU + size_t get_free_vram() { + size_t free_vram = 0; +#ifdef SD_USE_CUDA + size_t total_vram = 0; + ggml_backend_cuda_get_device_memory(0, &free_vram, &total_vram); +#endif + return free_vram; + } + + // Estimate VRAM needed for diffusion sampling + size_t estimate_diffusion_vram(int width, int height) { + if (!diffusion_model) { + return 0; + } + size_t params_size = diffusion_model->get_params_buffer_size(); + int latent_w = width / get_vae_scale_factor(); + int latent_h = height / get_vae_scale_factor(); + size_t compute_estimate = latent_w * latent_h * 64; + return params_size + compute_estimate; + } + + // Smart check: Should we offload cond_stage after conditioning? + bool should_offload_cond_stage_for_diffusion(int width, int height) { + if (offload_config.mode == SD_OFFLOAD_NONE || !offload_config.offload_cond_stage) { + return false; + } + if (!cond_stage_model || !cond_stage_model->is_params_on_gpu()) { + return false; + } + + if (offload_config.mode == SD_OFFLOAD_LAYER_STREAMING) { + LOG_INFO("Layer streaming mode: will offload cond_stage to free VRAM for layer loading"); + return true; + } + + size_t cond_stage_vram = cond_stage_model->get_params_vram_size(); + if (cond_stage_vram < offload_config.min_offload_size) { + return false; + } + + size_t free_vram = get_free_vram(); + size_t diffusion_needs = estimate_diffusion_vram(width, height); + size_t safety_margin = 300 * 1024 * 1024; + + bool vram_is_tight = free_vram < (diffusion_needs + safety_margin); + + if (offload_config.log_offload_events) { + LOG_INFO("Smart check (cond->diffusion): free=%.2f MB, diffusion_needs=%.2f MB, cond_stage=%.2f MB, tight=%s", + free_vram / (1024.0f * 1024.0f), + diffusion_needs / (1024.0f * 1024.0f), + cond_stage_vram / (1024.0f * 1024.0f), + vram_is_tight ? "yes" : "no"); + } + + return vram_is_tight; + } + + // Smart check: Should we offload diffusion after sampling? + bool should_offload_diffusion_for_vae(int width, int height) { + if (offload_config.mode != SD_OFFLOAD_AGGRESSIVE && + offload_config.mode != SD_OFFLOAD_COND_DIFFUSION) { + return false; + } + if (!offload_config.offload_diffusion) { + return false; + } + if (!diffusion_model || !diffusion_model->is_params_on_gpu()) { + return false; + } + + size_t diffusion_vram = diffusion_model->get_params_vram_size(); + if (diffusion_vram < offload_config.min_offload_size) { + return false; + } + + size_t free_vram = get_free_vram(); + size_t vae_needs = estimate_vae_decode_vram(width, height); + size_t safety_margin = 300 * 1024 * 1024; + + bool vram_is_tight = free_vram < (vae_needs + safety_margin); + + if (offload_config.log_offload_events) { + LOG_INFO("Smart check (diffusion->VAE): free=%.2f MB, vae_needs=%.2f MB, diffusion=%.2f MB, tight=%s", + free_vram / (1024.0f * 1024.0f), + vae_needs / (1024.0f * 1024.0f), + diffusion_vram / (1024.0f * 1024.0f), + vram_is_tight ? "yes" : "no"); + } + + return vram_is_tight; + } + + // Offload conditioners to CPU after conditioning phase + void offload_conditioners() { + if (offload_config.offload_cond_stage && cond_stage_model && cond_stage_model->is_params_on_gpu()) { + cond_stage_model->move_params_to_cpu(); + } + } + + // Offload diffusion model to CPU after sampling phase + void offload_diffusion_model() { + if (offload_config.offload_diffusion && diffusion_model && diffusion_model->is_params_on_gpu()) { + diffusion_model->move_params_to_cpu(); + } + } + + // Park the VAE on CPU pinned memory while diffusion samples. The VAE is + // idle for the entire sampler loop and only used at decode time, so its + // VRAM footprint is wasted during streaming. Reloads automatically on the + // next decode call via the runner's compute path. Only effective when the + // VAE was constructed with a CPU-pinned twin (vae_offload_to_cpu == true, + // which we escalate under SD_OFFLOAD_LAYER_STREAMING). + bool offload_vae_for_streaming() { + if (offload_config.mode != SD_OFFLOAD_LAYER_STREAMING) return false; + if (!first_stage_model || !first_stage_model->is_params_on_gpu()) return false; + size_t vae_vram = first_stage_model->get_params_vram_size(); + if (!first_stage_model->move_params_to_cpu()) { + return false; + } + if (offload_config.log_offload_events) { + LOG_INFO("Layer streaming: parked VAE on CPU pinned (%.2f MB)", + vae_vram / (1024.0 * 1024.0)); + } + return true; + } + + // Reload diffusion model to GPU before sampling + bool reload_diffusion_model() { + if (diffusion_model && !diffusion_model->is_params_on_gpu()) { + return diffusion_model->move_params_to_gpu(); + } + return true; + } + + // Reload cond_stage model to GPU before conditioning + bool reload_cond_stage_model() { + if (cond_stage_model && !cond_stage_model->is_params_on_gpu()) { + return cond_stage_model->move_params_to_gpu(); + } + return true; + } + + // Post-generation reload of models to GPU + void post_generation_reload() { + if (offload_config.mode == SD_OFFLOAD_NONE || free_params_immediately) { + return; + } + + int64_t reload_start = ggml_time_ms(); + bool reloaded_any = false; + + // Reload diffusion if configured (skip for layer_streaming) + if (offload_config.reload_diffusion && + offload_config.mode != SD_OFFLOAD_LAYER_STREAMING && + diffusion_model && !diffusion_model->is_params_on_gpu()) { + if (offload_config.log_offload_events) { + LOG_WARN("Reloading diffusion to GPU after generation..."); + } + if (diffusion_model->move_params_to_gpu()) { + if (offload_config.log_offload_events) { + LOG_WARN("diffusion reloaded to GPU (%.2f MB)", + diffusion_model->get_params_vram_size() / (1024.0f * 1024.0f)); + } + reloaded_any = true; + } else { + LOG_WARN("Failed to reload diffusion to GPU - will load on-demand"); + } + } + + // Reload cond_stage if configured and enough VRAM + if (offload_config.reload_cond_stage && + cond_stage_model && !cond_stage_model->is_params_on_gpu()) { + size_t cond_stage_size = cond_stage_model->get_params_buffer_size(); + size_t free_vram = get_free_vram(); + size_t safety_margin = 500 * 1024 * 1024; + + if (free_vram >= cond_stage_size + safety_margin) { + if (offload_config.log_offload_events) { + LOG_WARN("Reloading cond_stage to GPU after generation..."); + } + if (cond_stage_model->move_params_to_gpu()) { + if (offload_config.log_offload_events) { + LOG_WARN("cond_stage reloaded to GPU (%.2f MB)", + cond_stage_model->get_params_vram_size() / (1024.0f * 1024.0f)); + } + reloaded_any = true; + } + } else if (offload_config.log_offload_events) { + LOG_WARN("Not enough VRAM to reload cond_stage - will load on-demand"); + } + } + + if (reloaded_any && offload_config.log_offload_events) { + int64_t reload_end = ggml_time_ms(); + LOG_WARN("Post-generation reload completed in %" PRId64 " ms", reload_end - reload_start); + } + } + sd::Tensor decode_first_stage(const sd::Tensor& x, bool decode_video = false) { auto latents = first_stage_model->diffusion_to_vae_latents(x); return first_stage_model->decode(n_threads, latents, vae_tiling_params, decode_video, circular_x, circular_y); @@ -2083,6 +2537,63 @@ enum lora_apply_mode_t str_to_lora_apply_mode(const char* str) { return LORA_APPLY_MODE_COUNT; } +const char* offload_mode_to_str[] = { + "none", + "cond_only", + "cond_diffusion", + "aggressive", + "layer_streaming", +}; + +const char* sd_offload_mode_name(enum sd_offload_mode_t mode) { + if (mode < SD_OFFLOAD_MODE_COUNT) { + return offload_mode_to_str[mode]; + } + return NONE_STR; +} + +enum sd_offload_mode_t str_to_offload_mode(const char* str) { + for (int i = 0; i < SD_OFFLOAD_MODE_COUNT; i++) { + if (!strcmp(str, offload_mode_to_str[i])) { + return (enum sd_offload_mode_t)i; + } + } + return SD_OFFLOAD_MODE_COUNT; +} + +const char* vram_estimation_to_str[] = { + "dryrun", + "formula", +}; + +const char* sd_vram_estimation_name(enum sd_vram_estimation_t method) { + if (method < SD_VRAM_EST_COUNT) { + return vram_estimation_to_str[method]; + } + return NONE_STR; +} + +enum sd_vram_estimation_t str_to_vram_estimation(const char* str) { + for (int i = 0; i < SD_VRAM_EST_COUNT; i++) { + if (!strcmp(str, vram_estimation_to_str[i])) { + return (enum sd_vram_estimation_t)i; + } + } + return SD_VRAM_EST_COUNT; +} + +void sd_offload_config_init(sd_offload_config_t* config) { + config->mode = SD_OFFLOAD_NONE; + config->vram_estimation = SD_VRAM_EST_DRYRUN; // Dry-run is default (accurate) + config->offload_cond_stage = true; + config->offload_diffusion = false; + config->reload_cond_stage = false; + config->reload_diffusion = true; // Default: reload diffusion for next generation + config->log_offload_events = true; + config->min_offload_size = 0; + config->target_free_vram = 2ULL * 1024 * 1024 * 1024; // 2 GB +} + const char* hires_upscaler_to_str[] = { "None", "Latent", @@ -2175,6 +2686,17 @@ void sd_ctx_params_init(sd_ctx_params_t* sd_ctx_params) { sd_ctx_params->chroma_use_dit_mask = true; sd_ctx_params->chroma_use_t5_mask = false; sd_ctx_params->chroma_t5_mask_pad = 1; + // flow_shift moved out of sd_ctx_params_t in upstream master into + // sd_sample_params_t; sd_sample_params_init() initialises it there. + + // Dynamic tensor offloading defaults (disabled) + sd_ctx_params->offload_config.mode = SD_OFFLOAD_NONE; + sd_ctx_params->offload_config.offload_cond_stage = true; + sd_ctx_params->offload_config.offload_diffusion = false; + sd_ctx_params->offload_config.reload_cond_stage = false; // Let on-demand reload handle it (safer) + sd_ctx_params->offload_config.log_offload_events = true; + sd_ctx_params->offload_config.min_offload_size = 0; // No minimum - offload any size + sd_ctx_params->offload_config.target_free_vram = 2ULL * 1024 * 1024 * 1024; // 2 GB target for VAE } char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params) { @@ -2490,6 +3012,15 @@ enum scheduler_t sd_get_default_scheduler(const sd_ctx_t* sd_ctx, enum sample_me return DISCRETE_SCHEDULER; } +const char* sd_get_model_version_name(const sd_ctx_t* sd_ctx) { + if (sd_ctx != nullptr && sd_ctx->sd != nullptr) { + if (sd_ctx->sd->version < VERSION_COUNT) { + return model_version_to_str[sd_ctx->sd->version]; + } + } + return "Unknown"; +} + static int64_t resolve_seed(int64_t seed) { if (seed >= 0) { return seed; @@ -2977,6 +3508,8 @@ static std::optional prepare_image_generation_latents(sd if (init_image_tensor.empty()) { init_latent = sd_ctx->sd->generate_init_latent(request->width, request->height); } else { + // Smart offload before VAE encode to free VRAM + sd_ctx->sd->smart_offload_for_vae_encode(request->width, request->height); init_latent = sd_ctx->sd->encode_first_stage(init_image_tensor); if (init_latent.empty()) { LOG_ERROR("failed to encode init image"); @@ -3171,6 +3704,17 @@ static std::optional prepare_image_generation_embeds(sd_c sd_ctx->sd->cond_stage_model->free_params_buffer(); } + // Smart offload: move cond_stage to CPU if VRAM is tight for diffusion sampling + if (!sd_ctx->sd->free_params_immediately && + sd_ctx->sd->should_offload_cond_stage_for_diffusion(request->width, request->height)) { + sd_ctx->sd->offload_conditioners(); + } + + // Layer-streaming companion: free the VAE's VRAM for the sampler loop. + // It's only needed at decode time, which reloads it via the runner's + // normal compute path. + sd_ctx->sd->offload_vae_for_streaming(); + ImageGenerationEmbeds embeds; if (request->use_img_cond) { embeds.img_cond = SDCondition(uncond.c_crossattn, uncond.c_vector, cond.c_concat); @@ -3189,6 +3733,15 @@ static sd_image_t* decode_image_outputs(sd_ctx_t* sd_ctx, LOG_ERROR("expected %d latents, got %zu", request.batch_count, final_latents.size()); return nullptr; } + // Smart offload before VAE decode + sd_ctx->sd->smart_offload_for_vae(request.width, request.height); + + // For layer_streaming mode: offload streaming layers before VAE decode + if (sd_ctx->sd->offload_config.mode == SD_OFFLOAD_LAYER_STREAMING && + sd_ctx->sd->diffusion_model && sd_ctx->sd->diffusion_model->is_layer_streaming_enabled()) { + sd_ctx->sd->diffusion_model->offload_streaming_layers(); + } + LOG_INFO("decoding %zu latents", final_latents.size()); std::vector> decoded_images; int64_t t0 = ggml_time_ms(); @@ -3369,6 +3922,16 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s sd_ctx->sd->rng->manual_seed(request.seed); sd_ctx->sd->sampler_rng->manual_seed(request.seed); sd_ctx->sd->set_flow_shift(sd_img_gen_params->sample_params.flow_shift); + + // When offload mode is enabled and we have LoRAs, offload cond_stage first to free VRAM + if (sd_ctx->sd->offload_config.mode != SD_OFFLOAD_NONE && + sd_ctx->sd->offload_config.offload_cond_stage && + sd_img_gen_params->lora_count > 0 && + sd_ctx->sd->cond_stage_model && sd_ctx->sd->cond_stage_model->is_params_on_gpu()) { + LOG_WARN("Offloading cond_stage before LoRA application to free VRAM"); + sd_ctx->sd->offload_conditioners(); + } + sd_ctx->sd->apply_loras(sd_img_gen_params->loras, sd_img_gen_params->lora_count); ImageVaeAxesGuard axes_guard(sd_ctx, sd_img_gen_params, request); @@ -3393,6 +3956,16 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s } ImageGenerationEmbeds embeds = std::move(*embeds_opt); + // Ensure diffusion model is on GPU before sampling (may have been offloaded for cond_stage) + // Skip for layer_streaming - streaming engine loads layers individually + if (sd_ctx->sd->offload_config.mode != SD_OFFLOAD_NONE && + sd_ctx->sd->offload_config.mode != SD_OFFLOAD_LAYER_STREAMING) { + if (!sd_ctx->sd->reload_diffusion_model()) { + LOG_ERROR("Failed to reload diffusion model to GPU for sampling"); + return nullptr; + } + } + std::vector> final_latents; int64_t denoise_start = ggml_time_ms(); for (int b = 0; b < request.batch_count; b++) { @@ -3438,6 +4011,18 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s b + 1, request.batch_count, (sampling_end - sampling_start) * 1.0f / 1000); + // Mid-stream failures (e.g. compute-buffer cudaMalloc OOM at layer N) + // leave the streaming engine's resident layers + warm cache GPU-resident + // — the success path's offload_streaming_layers() at the end of + // sampling never runs. Without this eviction, the next job starts on a + // GPU that's already 8-9 GB full from the previous failed run and + // typically hits the same OOM. The swap is cheap (each layer's CPU + // pinned twin already exists) so freeing them is just pointer swaps. + if (sd_ctx->sd->offload_config.mode == SD_OFFLOAD_LAYER_STREAMING && + sd_ctx->sd->diffusion_model && + sd_ctx->sd->diffusion_model->is_layer_streaming_enabled()) { + sd_ctx->sd->diffusion_model->offload_streaming_layers(); + } if (sd_ctx->sd->free_params_immediately) { sd_ctx->sd->diffusion_model->free_params_buffer(); } @@ -3451,6 +4036,12 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s final_latents.size(), (denoise_end - denoise_start) * 1.0f / 1000); + // Smart offload: move diffusion to CPU if VRAM is tight for VAE decode + if (!sd_ctx->sd->free_params_immediately && + sd_ctx->sd->should_offload_diffusion_for_vae(request.width, request.height)) { + sd_ctx->sd->offload_diffusion_model(); + } + if (request.hires.enabled && request.hires.target_width > 0) { LOG_INFO("hires fix: upscaling to %dx%d", request.hires.target_width, request.hires.target_height); @@ -3566,6 +4157,11 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s b + 1, (int)final_latents.size(), (hires_sample_end - hires_sample_start) * 1.0f / 1000); + if (sd_ctx->sd->offload_config.mode == SD_OFFLOAD_LAYER_STREAMING && + sd_ctx->sd->diffusion_model && + sd_ctx->sd->diffusion_model->is_layer_streaming_enabled()) { + sd_ctx->sd->diffusion_model->offload_streaming_layers(); + } if (sd_ctx->sd->free_params_immediately) { sd_ctx->sd->diffusion_model->free_params_buffer(); } @@ -3587,6 +4183,9 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s sd_ctx->sd->lora_stat(); + // Post-generation reload of models to GPU + sd_ctx->sd->post_generation_reload(); + int64_t t1 = ggml_time_ms(); LOG_INFO("generate_image completed in %.2fs", (t1 - t0) * 1.0f / 1000); return result; @@ -3656,6 +4255,9 @@ static std::optional prepare_video_generation_latents(sd sd::ops::slice_assign(&image, 2, request->frames - 1, request->frames, end_image.unsqueeze(2)); } + // Smart offload before VAE encode to free VRAM + sd_ctx->sd->smart_offload_for_vae_encode(request->width, request->height); + auto concat_latent = sd_ctx->sd->encode_first_stage(image); // [b, c, t, h/vae_scale_factor, w/vae_scale_factor] if (concat_latent.empty()) { LOG_ERROR("failed to encode video conditioning frames"); @@ -3705,6 +4307,9 @@ static std::optional prepare_video_generation_latents(sd int64_t t1 = ggml_time_ms(); sd::Tensor ref_image_latent; if (!start_image.empty()) { + // Smart offload before VAE encode to free VRAM + sd_ctx->sd->smart_offload_for_vae_encode(request->width, request->height); + auto ref_img = start_image.reshape({start_image.shape()[0], start_image.shape()[1], 1, start_image.shape()[2], 1}); auto encoded_ref = sd_ctx->sd->encode_first_stage(ref_img); // [b, c, 1, h/vae_scale_factor, w/vae_scale_factor] if (encoded_ref.empty()) { @@ -3727,6 +4332,9 @@ static std::optional prepare_video_generation_latents(sd sd::Tensor inactive = control_video * (1.0f - mask) + 0.5f; sd::Tensor reactive = control_video * mask + 0.5f; + // Smart offload before VAE encode to free VRAM + sd_ctx->sd->smart_offload_for_vae_encode(request->width, request->height); + inactive = sd_ctx->sd->encode_first_stage(inactive); // [b, c, t, h/vae_scale_factor, w/vae_scale_factor] if (inactive.empty()) { LOG_ERROR("failed to encode VACE inactive context"); @@ -3786,6 +4394,14 @@ static ImageGenerationEmbeds prepare_video_generation_embeds(sd_ctx_t* sd_ctx, const sd_vid_gen_params_t* sd_vid_gen_params, const GenerationRequest& request, const ImageGenerationLatents& latents) { + // On-demand GPU reload for cond_stage before conditioning + if (sd_ctx->sd->offload_config.mode != SD_OFFLOAD_NONE && + sd_ctx->sd->offload_config.offload_cond_stage && + !sd_ctx->sd->free_params_immediately && + !sd_ctx->sd->cond_stage_on_cpu_only) { + sd_ctx->sd->reload_cond_stage_model(); + } + ImageGenerationEmbeds embeds; ConditionerParams condition_params; condition_params.clip_skip = request.clip_skip; @@ -3811,6 +4427,13 @@ static ImageGenerationEmbeds prepare_video_generation_embeds(sd_ctx_t* sd_ctx, if (sd_ctx->sd->free_params_immediately) { sd_ctx->sd->cond_stage_model->free_params_buffer(); } + + // Smart offload: move cond_stage to CPU if VRAM is tight for diffusion sampling + if (!sd_ctx->sd->free_params_immediately && + sd_ctx->sd->should_offload_cond_stage_for_diffusion(request.width, request.height)) { + sd_ctx->sd->offload_conditioners(); + } + return embeds; } @@ -3821,6 +4444,16 @@ static sd_image_t* decode_video_outputs(sd_ctx_t* sd_ctx, LOG_ERROR("no latent video to decode"); return nullptr; } + + // Smart offload before VAE decode + sd_ctx->sd->smart_offload_for_vae(0, 0, true); + + // For layer_streaming mode: offload streaming layers before VAE decode + if (sd_ctx->sd->offload_config.mode == SD_OFFLOAD_LAYER_STREAMING && + sd_ctx->sd->diffusion_model && sd_ctx->sd->diffusion_model->is_layer_streaming_enabled()) { + sd_ctx->sd->diffusion_model->offload_streaming_layers(); + } + int64_t t4 = ggml_time_ms(); sd::Tensor vid = sd_ctx->sd->decode_first_stage(final_latent, true); int64_t t5 = ggml_time_ms(); @@ -3919,6 +4552,11 @@ SD_API sd_image_t* generate_video(sd_ctx_t* sd_ctx, const sd_vid_gen_params_t* s int64_t sampling_end = ggml_time_ms(); if (x_t_sampled.empty()) { LOG_ERROR("sampling(high noise) failed after %.2fs", (sampling_end - sampling_start) * 1.0f / 1000); + if (sd_ctx->sd->offload_config.mode == SD_OFFLOAD_LAYER_STREAMING && + sd_ctx->sd->high_noise_diffusion_model && + sd_ctx->sd->high_noise_diffusion_model->is_layer_streaming_enabled()) { + sd_ctx->sd->high_noise_diffusion_model->offload_streaming_layers(); + } if (sd_ctx->sd->free_params_immediately) { sd_ctx->sd->high_noise_diffusion_model->free_params_buffer(); } @@ -3965,10 +4603,21 @@ SD_API sd_image_t* generate_video(sd_ctx_t* sd_ctx, const sd_vid_gen_params_t* s } if (final_latent.empty()) { LOG_ERROR("sampling failed after %.2fs", (sampling_end - sampling_start) * 1.0f / 1000); + if (sd_ctx->sd->offload_config.mode == SD_OFFLOAD_LAYER_STREAMING && + sd_ctx->sd->diffusion_model && + sd_ctx->sd->diffusion_model->is_layer_streaming_enabled()) { + sd_ctx->sd->diffusion_model->offload_streaming_layers(); + } return nullptr; } LOG_INFO("sampling completed, taking %.2fs", (sampling_end - sampling_start) * 1.0f / 1000); + // Smart offload: move diffusion to CPU if VRAM is tight for VAE decode + if (!sd_ctx->sd->free_params_immediately && + sd_ctx->sd->should_offload_diffusion_for_vae(request.width, request.height)) { + sd_ctx->sd->offload_diffusion_model(); + } + if (latents.ref_image_num > 0) { final_latent = sd::ops::slice(final_latent, 2, latents.ref_image_num, final_latent.shape()[2]); } @@ -3983,7 +4632,336 @@ SD_API sd_image_t* generate_video(sd_ctx_t* sd_ctx, const sd_vid_gen_params_t* s sd_ctx->sd->lora_stat(); + // Post-generation reload of models to GPU + sd_ctx->sd->post_generation_reload(); + int64_t t1 = ggml_time_ms(); LOG_INFO("generate_video completed in %.2fs", (t1 - t0) * 1.0f / 1000); return result; } + +/*================================================ Dynamic Tensor Offloading API ================================================*/ + +static const char* component_names[] = { + "cond_stage", // SD_COMPONENT_COND_STAGE + "clip_vision", // SD_COMPONENT_CLIP_VISION + "diffusion", // SD_COMPONENT_DIFFUSION + "vae", // SD_COMPONENT_VAE + "control_net", // SD_COMPONENT_CONTROL_NET + "pmid", // SD_COMPONENT_PMID +}; + +const char* sd_component_name(sd_component_t component) { + if (component >= 0 && component < SD_COMPONENT_COUNT) { + return component_names[component]; + } + return "unknown"; +} + +bool sd_offload_to_cpu(sd_ctx_t* sd_ctx, sd_component_t component) { + if (sd_ctx == nullptr || sd_ctx->sd == nullptr) { + return false; + } + + bool success = false; + switch (component) { + case SD_COMPONENT_COND_STAGE: + if (sd_ctx->sd->cond_stage_model) { + success = sd_ctx->sd->cond_stage_model->move_params_to_cpu(); + if (success) { + LOG_INFO("Offloaded %s to CPU", sd_component_name(component)); + } + } + break; + case SD_COMPONENT_CLIP_VISION: + if (sd_ctx->sd->clip_vision) { + success = sd_ctx->sd->clip_vision->move_params_to_cpu(); + if (success) { + LOG_INFO("Offloaded %s to CPU", sd_component_name(component)); + } + } + break; + case SD_COMPONENT_DIFFUSION: + if (sd_ctx->sd->diffusion_model) { + success = sd_ctx->sd->diffusion_model->move_params_to_cpu(); + if (success) { + LOG_INFO("Offloaded %s to CPU", sd_component_name(component)); + } + } + break; + case SD_COMPONENT_VAE: + if (sd_ctx->sd->first_stage_model) { + success = sd_ctx->sd->first_stage_model->move_params_to_cpu(); + if (success) { + LOG_INFO("Offloaded %s to CPU", sd_component_name(component)); + } + } + break; + case SD_COMPONENT_CONTROL_NET: + if (sd_ctx->sd->control_net) { + success = sd_ctx->sd->control_net->move_params_to_cpu(); + if (success) { + LOG_INFO("Offloaded %s to CPU", sd_component_name(component)); + } + } + break; + case SD_COMPONENT_PMID: + if (sd_ctx->sd->pmid_model) { + success = sd_ctx->sd->pmid_model->move_params_to_cpu(); + if (success) { + LOG_INFO("Offloaded %s to CPU", sd_component_name(component)); + } + } + break; + default: + LOG_WARN("Unknown component: %d", component); + break; + } + return success; +} + +bool sd_reload_to_gpu(sd_ctx_t* sd_ctx, sd_component_t component) { + if (sd_ctx == nullptr || sd_ctx->sd == nullptr) { + return false; + } + + bool success = false; + switch (component) { + case SD_COMPONENT_COND_STAGE: + if (sd_ctx->sd->cond_stage_model) { + success = sd_ctx->sd->cond_stage_model->move_params_to_gpu(); + if (success) { + LOG_INFO("Reloaded %s to GPU", sd_component_name(component)); + } + } + break; + case SD_COMPONENT_CLIP_VISION: + if (sd_ctx->sd->clip_vision) { + success = sd_ctx->sd->clip_vision->move_params_to_gpu(); + if (success) { + LOG_INFO("Reloaded %s to GPU", sd_component_name(component)); + } + } + break; + case SD_COMPONENT_DIFFUSION: + if (sd_ctx->sd->diffusion_model) { + success = sd_ctx->sd->diffusion_model->move_params_to_gpu(); + if (success) { + LOG_INFO("Reloaded %s to GPU", sd_component_name(component)); + } + } + break; + case SD_COMPONENT_VAE: + if (sd_ctx->sd->first_stage_model) { + success = sd_ctx->sd->first_stage_model->move_params_to_gpu(); + if (success) { + LOG_INFO("Reloaded %s to GPU", sd_component_name(component)); + } + } + break; + case SD_COMPONENT_CONTROL_NET: + if (sd_ctx->sd->control_net) { + success = sd_ctx->sd->control_net->move_params_to_gpu(); + if (success) { + LOG_INFO("Reloaded %s to GPU", sd_component_name(component)); + } + } + break; + case SD_COMPONENT_PMID: + if (sd_ctx->sd->pmid_model) { + success = sd_ctx->sd->pmid_model->move_params_to_gpu(); + if (success) { + LOG_INFO("Reloaded %s to GPU", sd_component_name(component)); + } + } + break; + default: + LOG_WARN("Unknown component: %d", component); + break; + } + return success; +} + +bool sd_is_on_gpu(sd_ctx_t* sd_ctx, sd_component_t component) { + if (sd_ctx == nullptr || sd_ctx->sd == nullptr) { + return false; + } + + switch (component) { + case SD_COMPONENT_COND_STAGE: + if (sd_ctx->sd->cond_stage_model) { + return sd_ctx->sd->cond_stage_model->is_params_on_gpu(); + } + break; + case SD_COMPONENT_CLIP_VISION: + if (sd_ctx->sd->clip_vision) { + return sd_ctx->sd->clip_vision->is_params_on_gpu(); + } + break; + case SD_COMPONENT_DIFFUSION: + if (sd_ctx->sd->diffusion_model) { + return sd_ctx->sd->diffusion_model->is_params_on_gpu(); + } + break; + case SD_COMPONENT_VAE: + if (sd_ctx->sd->first_stage_model) { + return sd_ctx->sd->first_stage_model->is_params_on_gpu(); + } + break; + case SD_COMPONENT_CONTROL_NET: + if (sd_ctx->sd->control_net) { + return sd_ctx->sd->control_net->is_params_on_gpu(); + } + break; + case SD_COMPONENT_PMID: + if (sd_ctx->sd->pmid_model) { + return sd_ctx->sd->pmid_model->is_params_on_gpu(); + } + break; + default: + break; + } + return false; +} + +size_t sd_get_component_vram(sd_ctx_t* sd_ctx, sd_component_t component) { + if (sd_ctx == nullptr || sd_ctx->sd == nullptr) { + return 0; + } + + switch (component) { + case SD_COMPONENT_COND_STAGE: + if (sd_ctx->sd->cond_stage_model) { + return sd_ctx->sd->cond_stage_model->get_params_vram_size(); + } + break; + case SD_COMPONENT_CLIP_VISION: + if (sd_ctx->sd->clip_vision) { + return sd_ctx->sd->clip_vision->get_params_vram_size(); + } + break; + case SD_COMPONENT_DIFFUSION: + if (sd_ctx->sd->diffusion_model) { + return sd_ctx->sd->diffusion_model->get_params_vram_size(); + } + break; + case SD_COMPONENT_VAE: + if (sd_ctx->sd->first_stage_model) { + return sd_ctx->sd->first_stage_model->get_params_vram_size(); + } + break; + case SD_COMPONENT_CONTROL_NET: + if (sd_ctx->sd->control_net) { + return sd_ctx->sd->control_net->get_params_vram_size(); + } + break; + case SD_COMPONENT_PMID: + if (sd_ctx->sd->pmid_model) { + return sd_ctx->sd->pmid_model->get_params_vram_size(); + } + break; + default: + break; + } + return 0; +} + +void sd_free_gpu_resources(sd_ctx_t* sd_ctx) { + if (sd_ctx == nullptr || sd_ctx->sd == nullptr) { + return; + } + + LOG_WARN("[Cleanup] Freeing all GPU resources before unload"); + + size_t total_freed = 0; + + // Helper macro to free component GPU memory + #define FREE_COMPONENT_GPU(model_ptr, name) do { \ + auto* model = (model_ptr); \ + if (model) { \ + size_t size = model->get_params_vram_size(); \ + if (size == 0) size = model->get_params_buffer_size(); \ + if (size > 0) { \ + if (!model->move_params_to_cpu()) { \ + model->free_params_buffer(); \ + LOG_WARN("[Cleanup] %s freed GPU buffer (%.2f MB) - no offload backend", name, size / (1024.0f * 1024.0f)); \ + } else { \ + LOG_WARN("[Cleanup] %s offloaded to CPU (%.2f MB)", name, size / (1024.0f * 1024.0f)); \ + } \ + total_freed += size; \ + } \ + } \ + } while(0) + + // Free all model components + FREE_COMPONENT_GPU(sd_ctx->sd->cond_stage_model.get(), "cond_stage"); + FREE_COMPONENT_GPU(sd_ctx->sd->diffusion_model.get(), "diffusion"); + FREE_COMPONENT_GPU(sd_ctx->sd->high_noise_diffusion_model.get(), "high_noise_diffusion"); + FREE_COMPONENT_GPU(sd_ctx->sd->first_stage_model.get(), "VAE"); + FREE_COMPONENT_GPU(sd_ctx->sd->control_net.get(), "ControlNet"); + FREE_COMPONENT_GPU(sd_ctx->sd->clip_vision.get(), "CLIP_Vision"); + FREE_COMPONENT_GPU(sd_ctx->sd->pmid_model.get(), "PhotoMaker"); + + #undef FREE_COMPONENT_GPU + + // Clear LoRA models to free their GPU buffers + size_t lora_freed = 0; + for (auto& lora : sd_ctx->sd->cond_stage_lora_models) { + if (lora) { + size_t size = lora->get_params_buffer_size(); + if (size > 0) { + if (!lora->move_params_to_cpu()) { + lora->free_params_buffer(); + } + lora_freed += size; + } + } + } + for (auto& lora : sd_ctx->sd->diffusion_lora_models) { + if (lora) { + size_t size = lora->get_params_buffer_size(); + if (size > 0) { + if (!lora->move_params_to_cpu()) { + lora->free_params_buffer(); + } + lora_freed += size; + } + } + } + for (auto& lora : sd_ctx->sd->first_stage_lora_models) { + if (lora) { + size_t size = lora->get_params_buffer_size(); + if (size > 0) { + if (!lora->move_params_to_cpu()) { + lora->free_params_buffer(); + } + lora_freed += size; + } + } + } + if (sd_ctx->sd->pmid_lora) { + size_t size = sd_ctx->sd->pmid_lora->get_params_buffer_size(); + if (size > 0) { + if (!sd_ctx->sd->pmid_lora->move_params_to_cpu()) { + sd_ctx->sd->pmid_lora->free_params_buffer(); + } + lora_freed += size; + } + } + if (lora_freed > 0) { + total_freed += lora_freed; + LOG_WARN("[Cleanup] LoRAs freed (%.2f MB)", lora_freed / (1024.0f * 1024.0f)); + } + + // Clear LoRA vectors entirely to trigger destructor cleanup + sd_ctx->sd->cond_stage_lora_models.clear(); + sd_ctx->sd->diffusion_lora_models.clear(); + sd_ctx->sd->first_stage_lora_models.clear(); + + // Synchronize CUDA to ensure all deallocations complete +#ifdef SD_USE_CUDA + cudaDeviceSynchronize(); +#endif + + LOG_WARN("[Cleanup] GPU resources freed, total: %.2f MB", total_freed / (1024.0f * 1024.0f)); +} diff --git a/src/tensor_registry.hpp b/src/tensor_registry.hpp new file mode 100644 index 000000000..cde9513fd --- /dev/null +++ b/src/tensor_registry.hpp @@ -0,0 +1,631 @@ +#ifndef __TENSOR_REGISTRY_HPP__ +#define __TENSOR_REGISTRY_HPP__ + +#include +#include +#include +#include +#include +#include + +#include "ggml-alloc.h" +#include "ggml-backend.h" +#include "ggml.h" + +#include "util.h" + +namespace LayerStreaming { + +struct TensorInfo { + ggml_tensor* gpu_tensor = nullptr; + ggml_tensor* cpu_tensor = nullptr; + size_t size_bytes = 0; + bool on_gpu = false; + int layer_index = -1; + std::string layer_name; + uint64_t last_access = 0; +}; + +struct LayerInfo { + std::string name; + int index = -1; + std::vector tensor_names; + size_t total_size_bytes = 0; + bool on_gpu = false; + ggml_backend_buffer_t gpu_buffer = nullptr; +}; + +// Tracks in-flight async transfers +struct AsyncLoadState { + struct CopyInfo { + std::string name; + ggml_tensor* cpu_tensor; + ggml_tensor* gpu_tensor; + }; + + ggml_context* temp_ctx = nullptr; + ggml_backend_buffer_t gpu_buffer = nullptr; + std::vector copy_list; + int64_t start_time = 0; +}; + +class TensorRegistry { +public: + TensorRegistry(ggml_backend_t gpu_backend, ggml_backend_t cpu_backend) + : gpu_backend_(gpu_backend), cpu_backend_(cpu_backend) {} + + ~TensorRegistry() { + clear(); + } + + void register_tensor(const std::string& name, + ggml_tensor* cpu_tensor, + const std::string& layer_name, + int layer_index) { + TensorInfo info; + info.cpu_tensor = cpu_tensor; + info.gpu_tensor = nullptr; + info.size_bytes = ggml_nbytes(cpu_tensor); + info.on_gpu = false; + info.layer_index = layer_index; + info.layer_name = layer_name; + info.last_access = 0; + + tensors_[name] = info; + + if (layers_.find(layer_name) == layers_.end()) { + LayerInfo layer_info; + layer_info.name = layer_name; + layer_info.index = layer_index; + layer_info.total_size_bytes = 0; + layer_info.on_gpu = false; + layer_info.gpu_buffer = nullptr; + layers_[layer_name] = layer_info; + } + layers_[layer_name].tensor_names.push_back(name); + layers_[layer_name].total_size_bytes += info.size_bytes; + } + + // Only works if tensor names are set with ggml_set_name() + void register_from_context(ggml_context* ctx, + const std::string& prefix, + std::function(const std::string&)> layer_pattern_fn) { + for (ggml_tensor* t = ggml_get_first_tensor(ctx); t != nullptr; t = ggml_get_next_tensor(ctx, t)) { + std::string name = ggml_get_name(t); + auto [layer_name, layer_index] = layer_pattern_fn(name); + register_tensor(name, t, layer_name, layer_index); + } + } + + // Preferred method: tensor names are properly preserved in the map keys + void register_from_map(const std::map& tensors, + std::function(const std::string&)> layer_pattern_fn) { + for (const auto& [name, tensor] : tensors) { + auto [layer_name, layer_index] = layer_pattern_fn(name); + register_tensor(name, tensor, layer_name, layer_index); + } + } + + bool move_layer_to_gpu(const std::string& layer_name) { + auto it = layers_.find(layer_name); + if (it == layers_.end()) { + LOG_ERROR("layer '%s' not found", layer_name.c_str()); + return false; + } + + LayerInfo& layer = it->second; + if (layer.on_gpu) { + return true; + } + + int64_t t0 = ggml_time_ms(); + + size_t ctx_size = layer.tensor_names.size() * ggml_tensor_overhead() + 1024; + struct ggml_init_params ctx_params = { + ctx_size, + nullptr, + true, + }; + ggml_context* temp_ctx = ggml_init(ctx_params); + if (temp_ctx == nullptr) { + LOG_ERROR("failed to create temp context for layer '%s'", layer_name.c_str()); + return false; + } + + // Can't rely on ggml_get_name() because GGMLBlock doesn't call ggml_set_name() + struct CopyInfo { + std::string name; + ggml_tensor* cpu_tensor; + ggml_tensor* gpu_tensor; + }; + std::vector copy_list; + + for (const auto& tensor_name : layer.tensor_names) { + TensorInfo& info = tensors_[tensor_name]; + if (info.on_gpu) { + continue; + } + + ggml_tensor* gpu_tensor = ggml_dup_tensor(temp_ctx, info.cpu_tensor); + ggml_set_name(gpu_tensor, tensor_name.c_str()); + copy_list.push_back({tensor_name, info.cpu_tensor, gpu_tensor}); + } + + if (copy_list.empty()) { + ggml_free(temp_ctx); + layer.on_gpu = true; + return true; + } + + layer.gpu_buffer = ggml_backend_alloc_ctx_tensors(temp_ctx, gpu_backend_); + if (layer.gpu_buffer == nullptr) { + LOG_ERROR("failed to allocate GPU buffer for layer '%s'", layer_name.c_str()); + ggml_free(temp_ctx); + return false; + } + + for (auto& item : copy_list) { + ggml_backend_tensor_copy(item.cpu_tensor, item.gpu_tensor); + } + ggml_backend_synchronize(gpu_backend_); + + for (auto& item : copy_list) { + TensorInfo& info = tensors_[item.name]; + info.gpu_tensor = item.gpu_tensor; + info.on_gpu = true; + info.last_access = access_counter_++; + + // Swap pointers so the original tensor now points to GPU memory + std::swap(item.cpu_tensor->buffer, item.gpu_tensor->buffer); + std::swap(item.cpu_tensor->data, item.gpu_tensor->data); + std::swap(item.cpu_tensor->extra, item.gpu_tensor->extra); + } + + layer.on_gpu = true; + current_gpu_usage_ += layer.total_size_bytes; + layer_contexts_[layer_name] = temp_ctx; + + return true; + } + + void move_layer_to_cpu(const std::string& layer_name) { + auto it = layers_.find(layer_name); + if (it == layers_.end()) { + return; + } + + LayerInfo& layer = it->second; + if (!layer.on_gpu) { + return; + } + + for (const auto& tensor_name : layer.tensor_names) { + TensorInfo& info = tensors_[tensor_name]; + if (!info.on_gpu || info.gpu_tensor == nullptr) { + continue; + } + + std::swap(info.cpu_tensor->buffer, info.gpu_tensor->buffer); + std::swap(info.cpu_tensor->data, info.gpu_tensor->data); + std::swap(info.cpu_tensor->extra, info.gpu_tensor->extra); + + info.gpu_tensor = nullptr; + info.on_gpu = false; + } + + if (layer.gpu_buffer != nullptr) { + ggml_backend_buffer_free(layer.gpu_buffer); + layer.gpu_buffer = nullptr; + } + + auto ctx_it = layer_contexts_.find(layer_name); + if (ctx_it != layer_contexts_.end()) { + ggml_free(ctx_it->second); + layer_contexts_.erase(ctx_it); + } + + current_gpu_usage_ -= layer.total_size_bytes; + layer.on_gpu = false; + } + + bool is_layer_on_gpu(const std::string& layer_name) const { + auto it = layers_.find(layer_name); + if (it == layers_.end()) { + return false; + } + return it->second.on_gpu; + } + + size_t get_layer_size(const std::string& layer_name) const { + auto it = layers_.find(layer_name); + if (it == layers_.end()) { + return 0; + } + return it->second.total_size_bytes; + } + + size_t get_gpu_usage() const { + return current_gpu_usage_; + } + + std::vector get_layer_names_sorted() const { + std::vector> indexed_layers; + for (const auto& [name, info] : layers_) { + indexed_layers.push_back({info.index, name}); + } + std::sort(indexed_layers.begin(), indexed_layers.end()); + + std::vector result; + for (const auto& [idx, name] : indexed_layers) { + result.push_back(name); + } + return result; + } + + std::vector get_layers_on_gpu() const { + std::vector result; + for (const auto& [name, info] : layers_) { + if (info.on_gpu) { + result.push_back(name); + } + } + return result; + } + + size_t get_layer_count() const { + return layers_.size(); + } + + // Initiates transfer without waiting; call complete_async_layer_load() to finalize + bool start_async_layer_load(const std::string& layer_name, + ggml_backend_t gpu_backend, + ggml_backend_t cpu_backend) { + auto it = layers_.find(layer_name); + if (it == layers_.end()) { + LOG_ERROR("layer '%s' not found for async load", layer_name.c_str()); + return false; + } + + LayerInfo& layer = it->second; + if (layer.on_gpu) { + return true; + } + + if (async_loading_layers_.find(layer_name) != async_loading_layers_.end()) { + return true; + } + + int64_t t0 = ggml_time_ms(); + + size_t ctx_size = layer.tensor_names.size() * ggml_tensor_overhead() + 1024; + struct ggml_init_params ctx_params = { + ctx_size, + nullptr, + true, + }; + ggml_context* temp_ctx = ggml_init(ctx_params); + if (temp_ctx == nullptr) { + LOG_ERROR("failed to create temp context for async load of layer '%s'", layer_name.c_str()); + return false; + } + + std::vector copy_list; + + for (const auto& tensor_name : layer.tensor_names) { + TensorInfo& info = tensors_[tensor_name]; + if (info.on_gpu) { + continue; + } + + ggml_tensor* gpu_tensor = ggml_dup_tensor(temp_ctx, info.cpu_tensor); + ggml_set_name(gpu_tensor, tensor_name.c_str()); + copy_list.push_back({tensor_name, info.cpu_tensor, gpu_tensor}); + } + + if (copy_list.empty()) { + ggml_free(temp_ctx); + layer.on_gpu = true; + return true; + } + + ggml_backend_buffer_t buffer = ggml_backend_alloc_ctx_tensors(temp_ctx, gpu_backend); + if (buffer == nullptr) { + LOG_ERROR("failed to allocate GPU buffer for async load of layer '%s'", layer_name.c_str()); + ggml_free(temp_ctx); + return false; + } + + // May fall back to sync for CPU->CUDA + for (auto& item : copy_list) { + ggml_backend_tensor_copy_async(cpu_backend, gpu_backend, item.cpu_tensor, item.gpu_tensor); + } + + AsyncLoadState state; + state.temp_ctx = temp_ctx; + state.gpu_buffer = buffer; + state.copy_list = std::move(copy_list); + state.start_time = t0; + + async_loading_layers_[layer_name] = std::move(state); + + return true; + } + + // Waits for pending async transfers and finalizes the layer state + bool complete_async_layer_load(const std::string& layer_name, + ggml_backend_t gpu_backend) { + auto async_it = async_loading_layers_.find(layer_name); + if (async_it == async_loading_layers_.end()) { + // Not in async loading - check if already on GPU + auto layer_it = layers_.find(layer_name); + if (layer_it != layers_.end() && layer_it->second.on_gpu) { + return true; + } + return false; + } + + AsyncLoadState& state = async_it->second; + auto layer_it = layers_.find(layer_name); + if (layer_it == layers_.end()) { + ggml_backend_buffer_free(state.gpu_buffer); + ggml_free(state.temp_ctx); + async_loading_layers_.erase(async_it); + return false; + } + + LayerInfo& layer = layer_it->second; + + ggml_backend_synchronize(gpu_backend); + + for (auto& item : state.copy_list) { + TensorInfo& info = tensors_[item.name]; + info.gpu_tensor = item.gpu_tensor; + info.on_gpu = true; + info.last_access = access_counter_++; + + std::swap(item.cpu_tensor->buffer, item.gpu_tensor->buffer); + std::swap(item.cpu_tensor->data, item.gpu_tensor->data); + std::swap(item.cpu_tensor->extra, item.gpu_tensor->extra); + } + + layer.on_gpu = true; + layer.gpu_buffer = state.gpu_buffer; + current_gpu_usage_ += layer.total_size_bytes; + layer_contexts_[layer_name] = state.temp_ctx; + + async_loading_layers_.erase(async_it); + return true; + } + + bool is_layer_async_loading(const std::string& layer_name) const { + return async_loading_layers_.find(layer_name) != async_loading_layers_.end(); + } + + void clear() { + for (auto& [name, state] : async_loading_layers_) { + if (state.gpu_buffer) { + ggml_backend_buffer_free(state.gpu_buffer); + } + if (state.temp_ctx) { + ggml_free(state.temp_ctx); + } + } + async_loading_layers_.clear(); + + for (auto& [name, layer] : layers_) { + if (layer.on_gpu) { + move_layer_to_cpu(name); + } + } + + for (auto& [name, ctx] : layer_contexts_) { + ggml_free(ctx); + } + + tensors_.clear(); + layers_.clear(); + layer_contexts_.clear(); + current_gpu_usage_ = 0; + } + +private: + ggml_backend_t gpu_backend_; + ggml_backend_t cpu_backend_; + + std::unordered_map tensors_; + std::unordered_map layers_; + std::unordered_map layer_contexts_; + std::unordered_map async_loading_layers_; + + size_t current_gpu_usage_ = 0; + uint64_t access_counter_ = 0; +}; + +// Extract Flux layer info: double_blocks.N, single_blocks.N, or _global +inline std::pair flux_layer_pattern(const std::string& tensor_name) { + size_t db_pos = tensor_name.find("double_blocks."); + if (db_pos != std::string::npos) { + size_t num_start = db_pos + 14; // Length of "double_blocks." + size_t num_end = tensor_name.find('.', num_start); + if (num_end == std::string::npos) { + num_end = tensor_name.length(); + } + std::string num_str = tensor_name.substr(num_start, num_end - num_start); + int block_idx = std::stoi(num_str); + return {"double_blocks." + num_str, block_idx}; + } + + size_t sb_pos = tensor_name.find("single_blocks."); + if (sb_pos != std::string::npos) { + size_t num_start = sb_pos + 14; // Length of "single_blocks." + size_t num_end = tensor_name.find('.', num_start); + if (num_end == std::string::npos) { + num_end = tensor_name.length(); + } + std::string num_str = tensor_name.substr(num_start, num_end - num_start); + int block_idx = std::stoi(num_str); + // Offset past 19 double_blocks + return {"single_blocks." + num_str, 19 + block_idx}; + } + + return {"_global", -1}; +} + +// Extract UNet layer info: input_blocks.N, middle_block, output_blocks.N, or _global +inline std::pair unet_layer_pattern(const std::string& tensor_name) { + size_t ib_pos = tensor_name.find("input_blocks."); + if (ib_pos != std::string::npos) { + size_t num_start = ib_pos + 13; // Length of "input_blocks." + size_t num_end = tensor_name.find('.', num_start); + if (num_end == std::string::npos) { + num_end = tensor_name.length(); + } + std::string num_str = tensor_name.substr(num_start, num_end - num_start); + int block_idx = std::stoi(num_str); + return {"input_blocks." + num_str, block_idx}; + } + + if (tensor_name.find("middle_block") != std::string::npos) { + return {"middle_block", 100}; + } + + size_t ob_pos = tensor_name.find("output_blocks."); + if (ob_pos != std::string::npos) { + size_t num_start = ob_pos + 14; // Length of "output_blocks." + size_t num_end = tensor_name.find('.', num_start); + if (num_end == std::string::npos) { + num_end = tensor_name.length(); + } + std::string num_str = tensor_name.substr(num_start, num_end - num_start); + int block_idx = std::stoi(num_str); + return {"output_blocks." + num_str, 200 + block_idx}; + } + + return {"_global", -1}; +} + +// Extract MMDiT layer info: joint_blocks.N, or _global +inline std::pair mmdit_layer_pattern(const std::string& tensor_name) { + size_t jb_pos = tensor_name.find("joint_blocks."); + if (jb_pos != std::string::npos) { + size_t num_start = jb_pos + 13; // Length of "joint_blocks." + size_t num_end = tensor_name.find('.', num_start); + if (num_end == std::string::npos) { + num_end = tensor_name.length(); + } + std::string num_str = tensor_name.substr(num_start, num_end - num_start); + int block_idx = std::stoi(num_str); + return {"joint_blocks." + num_str, block_idx}; + } + + return {"_global", -1}; +} + +// Extract WAN layer info: blocks.N, vace_blocks.N, or _global +inline std::pair wan_layer_pattern(const std::string& tensor_name) { + size_t b_pos = tensor_name.find("blocks."); + // Exclude "vace_blocks" matches + if (b_pos != std::string::npos && (b_pos == 0 || tensor_name[b_pos - 1] != '_')) { + size_t num_start = b_pos + 7; // Length of "blocks." + size_t num_end = tensor_name.find('.', num_start); + if (num_end == std::string::npos) { + num_end = tensor_name.length(); + } + std::string num_str = tensor_name.substr(num_start, num_end - num_start); + int block_idx = std::stoi(num_str); + return {"blocks." + num_str, block_idx}; + } + + size_t vb_pos = tensor_name.find("vace_blocks."); + if (vb_pos != std::string::npos) { + size_t num_start = vb_pos + 12; // Length of "vace_blocks." + size_t num_end = tensor_name.find('.', num_start); + if (num_end == std::string::npos) { + num_end = tensor_name.length(); + } + std::string num_str = tensor_name.substr(num_start, num_end - num_start); + int block_idx = std::stoi(num_str); + return {"vace_blocks." + num_str, 100 + block_idx}; + } + + return {"_global", -1}; +} + +// Extract QwenImage layer info: transformer_blocks.N, or _global +inline std::pair qwen_image_layer_pattern(const std::string& tensor_name) { + size_t tb_pos = tensor_name.find("transformer_blocks."); + if (tb_pos != std::string::npos) { + size_t num_start = tb_pos + 19; // Length of "transformer_blocks." + size_t num_end = tensor_name.find('.', num_start); + if (num_end == std::string::npos) { + num_end = tensor_name.length(); + } + std::string num_str = tensor_name.substr(num_start, num_end - num_start); + int block_idx = std::stoi(num_str); + return {"transformer_blocks." + num_str, block_idx}; + } + + return {"_global", -1}; +} + +// Extract ZImage layer info: context_refiner.N, noise_refiner.N, layers.N, or _global +inline std::pair zimage_layer_pattern(const std::string& tensor_name) { + size_t cr_pos = tensor_name.find("context_refiner."); + if (cr_pos != std::string::npos) { + size_t num_start = cr_pos + 16; // Length of "context_refiner." + size_t num_end = tensor_name.find('.', num_start); + if (num_end == std::string::npos) { + num_end = tensor_name.length(); + } + std::string num_str = tensor_name.substr(num_start, num_end - num_start); + int block_idx = std::stoi(num_str); + return {"context_refiner." + num_str, block_idx}; + } + + size_t nr_pos = tensor_name.find("noise_refiner."); + if (nr_pos != std::string::npos) { + size_t num_start = nr_pos + 14; // Length of "noise_refiner." + size_t num_end = tensor_name.find('.', num_start); + if (num_end == std::string::npos) { + num_end = tensor_name.length(); + } + std::string num_str = tensor_name.substr(num_start, num_end - num_start); + int block_idx = std::stoi(num_str); + return {"noise_refiner." + num_str, 10 + block_idx}; + } + + size_t l_pos = tensor_name.find("layers."); + if (l_pos != std::string::npos) { + size_t num_start = l_pos + 7; // Length of "layers." + size_t num_end = tensor_name.find('.', num_start); + if (num_end == std::string::npos) { + num_end = tensor_name.length(); + } + std::string num_str = tensor_name.substr(num_start, num_end - num_start); + int block_idx = std::stoi(num_str); + return {"layers." + num_str, 100 + block_idx}; + } + + return {"_global", -1}; +} + +// Extract Anima layer info: blocks.N (from net.blocks.N), or _global +inline std::pair anima_layer_pattern(const std::string& tensor_name) { + size_t nb_pos = tensor_name.find("net.blocks."); + if (nb_pos != std::string::npos) { + size_t num_start = nb_pos + 11; // Length of "net.blocks." + size_t num_end = tensor_name.find('.', num_start); + if (num_end == std::string::npos) { + num_end = tensor_name.length(); + } + std::string num_str = tensor_name.substr(num_start, num_end - num_start); + int block_idx = std::stoi(num_str); + return {"blocks." + num_str, block_idx}; + } + + return {"_global", -1}; +} + +} // namespace LayerStreaming + +#endif // __TENSOR_REGISTRY_HPP__ diff --git a/src/unet.hpp b/src/unet.hpp index d7ea8c3fa..008d2f2b2 100644 --- a/src/unet.hpp +++ b/src/unet.hpp @@ -2,6 +2,7 @@ #define __UNET_HPP__ #include "common_block.hpp" +#include "layer_streaming.hpp" #include "model.h" /*==================================================== UnetModel =====================================================*/ @@ -597,6 +598,160 @@ class UnetModelBlock : public GGMLBlock { ggml_set_name(h, "bench-end"); return h; // [N, out_channels, h, w] } + + ggml_tensor* forward_embedding_stage(GGMLRunnerContext* ctx, + struct ggml_tensor* timesteps, + struct ggml_tensor* label) { + auto time_embed_0 = std::dynamic_pointer_cast(blocks["time_embed.0"]); + auto time_embed_2 = std::dynamic_pointer_cast(blocks["time_embed.2"]); + + auto emb = ggml_ext_timestep_embedding(ctx->ggml_ctx, timesteps, model_channels); + emb = time_embed_0->forward(ctx, emb); + emb = ggml_silu_inplace(ctx->ggml_ctx, emb); + emb = time_embed_2->forward(ctx, emb); + + if (label != nullptr && adm_in_channels != -1) { + auto label_embed_0 = std::dynamic_pointer_cast(blocks["label_emb.0.0"]); + auto label_embed_2 = std::dynamic_pointer_cast(blocks["label_emb.0.2"]); + + auto label_emb = label_embed_0->forward(ctx, label); + label_emb = ggml_silu_inplace(ctx->ggml_ctx, label_emb); + label_emb = label_embed_2->forward(ctx, label_emb); + + emb = ggml_add(ctx->ggml_ctx, emb, label_emb); + } + + return emb; + } + + ggml_tensor* forward_initial_conv(GGMLRunnerContext* ctx, struct ggml_tensor* x) { + auto input_blocks_0_0 = std::dynamic_pointer_cast(blocks["input_blocks.0.0"]); + return input_blocks_0_0->forward(ctx, x); + } + + ggml_tensor* forward_input_block(GGMLRunnerContext* ctx, + int block_idx, + struct ggml_tensor* h, + struct ggml_tensor* emb, + struct ggml_tensor* context, + int num_video_frames) { + // input_blocks.X.0 is either a ResBlock or a DownSampleBlock — + // SDXL/SD1.x put the per-stage downsample at indices 3 and 6. The + // non-streaming forward() differentiates these inline; the streaming + // path does the same here. + std::string slot0_name = "input_blocks." + std::to_string(block_idx) + ".0"; + auto slot0_it = blocks.find(slot0_name); + if (slot0_it != blocks.end()) { + if (auto downsample = std::dynamic_pointer_cast(slot0_it->second)) { + h = downsample->forward(ctx, h); + } else { + h = resblock_forward(slot0_name, ctx, h, emb, num_video_frames); + } + } + + // input_blocks.X.1 is a SpatialTransformer when attention applies at this resolution. + std::string attn_name = "input_blocks." + std::to_string(block_idx) + ".1"; + auto attn_block = blocks.find(attn_name); + if (attn_block != blocks.end()) { + h = attention_layer_forward(attn_name, ctx, h, context, num_video_frames); + } + + return h; + } + + ggml_tensor* forward_middle_block(GGMLRunnerContext* ctx, + struct ggml_tensor* h, + struct ggml_tensor* emb, + struct ggml_tensor* context, + int num_video_frames) { + h = resblock_forward("middle_block.0", ctx, h, emb, num_video_frames); + if (version == VERSION_SD1 || version == VERSION_SD2 || version == VERSION_SVD) { + h = attention_layer_forward("middle_block.1", ctx, h, context, num_video_frames); + h = resblock_forward("middle_block.2", ctx, h, emb, num_video_frames); + } + return h; + } + + ggml_tensor* forward_output_block(GGMLRunnerContext* ctx, + int block_idx, + struct ggml_tensor* h, + struct ggml_tensor* skip, + struct ggml_tensor* emb, + struct ggml_tensor* context, + int num_video_frames) { + h = ggml_concat(ctx->ggml_ctx, h, skip, 2); + + std::string res_name = "output_blocks." + std::to_string(block_idx) + ".0"; + h = resblock_forward(res_name, ctx, h, emb, num_video_frames); + + // output_blocks.X.1/.2 may be SpatialTransformer (attention), UpSampleBlock, + // or both: when the resolution has attention, slot .1 = transformer and + // slot .2 = upsample; without attention, slot .1 = upsample. Dispatch + // by actual block type so SD1.x's deepest output block (no attention) + // doesn't end up casting an UpSampleBlock to a SpatialTransformer. + for (int i = 1; i <= 2; i++) { + std::string slot_name = "output_blocks." + std::to_string(block_idx) + "." + std::to_string(i); + auto slot_it = blocks.find(slot_name); + if (slot_it == blocks.end()) { + continue; + } + if (auto upsample = std::dynamic_pointer_cast(slot_it->second)) { + h = upsample->forward(ctx, h); + } else { + h = attention_layer_forward(slot_name, ctx, h, context, num_video_frames); + } + } + + return h; + } + + ggml_tensor* forward_output_stage(GGMLRunnerContext* ctx, struct ggml_tensor* h) { + auto out_0 = std::dynamic_pointer_cast(blocks["out.0"]); + auto out_2 = std::dynamic_pointer_cast(blocks["out.2"]); + + h = out_0->forward(ctx, h); + h = ggml_silu_inplace(ctx->ggml_ctx, h); + h = out_2->forward(ctx, h); + + return h; + } + + // Walk the blocks map to find the largest "input_blocks.N.0" index that + // actually exists, then return N+1 so callers can iterate [0, count). + // SDXL ends at 8 (9 total), SD1/SD2 at 11 (12 total), tiny_unet has gaps + // — the streaming loop treats missing indices as "skip" via blocks.find(). + int get_num_input_blocks() const { + return count_blocks_with_prefix("input_blocks."); + } + int get_num_output_blocks() const { + return count_blocks_with_prefix("output_blocks."); + } + +private: + int count_blocks_with_prefix(const std::string& prefix) const { + int max_idx = -1; + for (const auto& kv : blocks) { + const std::string& name = kv.first; + if (name.compare(0, prefix.size(), prefix) != 0) { + continue; + } + // name looks like "input_blocks.N.M"; extract N + size_t i_start = prefix.size(); + size_t i_end = name.find('.', i_start); + if (i_end == std::string::npos) { + continue; + } + try { + int idx = std::stoi(name.substr(i_start, i_end - i_start)); + if (idx > max_idx) max_idx = idx; + } catch (...) { + continue; + } + } + return max_idx + 1; + } + +public: }; struct UNetModelRunner : public GGMLRunner { @@ -615,6 +770,399 @@ struct UNetModelRunner : public GGMLRunner { return "unet"; } + // UNet needs keep_layers_behind=12 for skip connections + void enable_layer_streaming(const LayerStreaming::StreamingConfig& config = {}) { + LayerStreaming::StreamingConfig cfg = config; + cfg.keep_layers_behind = 12; + std::map tensor_map; + unet.get_param_tensors(tensor_map, "model.diffusion_model"); + init_streaming(cfg, tensor_map, LayerStreaming::unet_layer_pattern); + LOG_INFO("%s layer streaming enabled (coarse-stage mode)", get_desc().c_str()); + } + + bool compute_streaming(int n_threads, + ggml_tensor* x, + ggml_tensor* timesteps, + ggml_tensor* context, + ggml_tensor* c_concat, + ggml_tensor* y, + int num_video_frames = -1, + std::vector controls = {}, + float control_strength = 0.f, + ggml_tensor** output = nullptr, + ggml_context* output_ctx = nullptr) { + if (!is_streaming_enabled()) { + LOG_WARN("%s streaming not enabled, falling back to regular compute", get_desc().c_str()); + return compute(n_threads, x, timesteps, context, c_concat, y, + num_video_frames, controls, control_strength, output, output_ctx); + } + + int64_t t0 = ggml_time_ms(); + auto analysis = analyze_vram_budget(); + + if (analysis.fits_in_vram) { + LOG_INFO("%s model fits in VRAM, using coarse-stage streaming", get_desc().c_str()); + load_all_layers_coarse(); + bool result = compute(n_threads, x, timesteps, context, c_concat, y, + num_video_frames, controls, control_strength, output, output_ctx, + /*skip_param_offload=*/true); + int64_t t1 = ggml_time_ms(); + LOG_INFO("%s coarse-stage streaming completed in %.2fs", get_desc().c_str(), (t1 - t0) / 1000.0); + free_compute_buffer(); + return result; + } + + LOG_INFO("%s remaining %.2f GB exceeds available %.2f GB, using per-layer streaming", + get_desc().c_str(), + analysis.remaining_to_load / (1024.0 * 1024.0 * 1024.0), + analysis.available_vram / (1024.0 * 1024.0 * 1024.0)); + + return compute_streaming_true(n_threads, x, timesteps, context, c_concat, y, + num_video_frames, controls, control_strength, output, output_ctx); + } + + bool compute_streaming_true(int n_threads, + ggml_tensor* x, + ggml_tensor* timesteps, + ggml_tensor* context, + ggml_tensor* c_concat = nullptr, + ggml_tensor* y = nullptr, + int num_video_frames = -1, + std::vector controls = {}, + float control_strength = 0.f, + ggml_tensor** output = nullptr, + ggml_context* output_ctx = nullptr) { + auto& registry = streaming_engine_->get_registry(); + int64_t t_start = ggml_time_ms(); + + const int num_input_blocks = unet.get_num_input_blocks(); + const int num_output_blocks = unet.get_num_output_blocks(); + + LOG_INFO("TRUE per-layer streaming - %d input, 1 middle, %d output blocks", + num_input_blocks, num_output_blocks); + + // Load global layers + if (!registry.move_layer_to_gpu("_global")) { + LOG_ERROR("Failed to load _global to GPU"); + return false; + } + + // Skip connections storage - stores each input block's output + std::vector> skip_connections(num_input_blocks); + std::vector> skip_ne(num_input_blocks); + + // Persistent storage for current h and emb + std::vector persistent_h; + std::vector persistent_emb; + int64_t h_ne[4], emb_ne[4]; + + // Handle c_concat + ggml_tensor* actual_x = x; + if (c_concat != nullptr) { + // For now, handle c_concat in input stage + } + + LOG_DEBUG("Computing embeddings"); + { + ggml_tensor* emb_output = nullptr; + + auto get_emb_graph = [&]() -> ggml_cgraph* { + ggml_cgraph* gf = new_graph_custom(UNET_GRAPH_SIZE / 8); + auto runner_ctx = get_context(); + + ggml_tensor* timesteps_b = to_backend(timesteps); + ggml_tensor* y_b = y ? to_backend(y) : nullptr; + + emb_output = unet.forward_embedding_stage(&runner_ctx, timesteps_b, y_b); + ggml_build_forward_expand(gf, emb_output); + + return gf; + }; + + if (!GGMLRunner::compute(get_emb_graph, n_threads, false, nullptr, nullptr, true)) { + LOG_ERROR("Embedding stage failed"); + return false; + } + + // Extract emb + size_t emb_size = ggml_nelements(emb_output); + persistent_emb.resize(emb_size); + ggml_backend_tensor_get(emb_output, persistent_emb.data(), 0, emb_size * sizeof(float)); + for (int i = 0; i < 4; i++) emb_ne[i] = emb_output->ne[i]; + + free_compute_buffer(); + } + + LOG_DEBUG("Processing input blocks"); + { + ggml_tensor* h_output = nullptr; + + // Initial conv + auto get_init_graph = [&]() -> ggml_cgraph* { + ggml_cgraph* gf = new_graph_custom(UNET_GRAPH_SIZE / 8); + auto runner_ctx = get_context(); + + ggml_tensor* x_b = to_backend(x); + if (c_concat != nullptr) { + ggml_tensor* c_b = to_backend(c_concat); + x_b = ggml_concat(compute_ctx, x_b, c_b, 2); + } + + h_output = unet.forward_initial_conv(&runner_ctx, x_b); + ggml_build_forward_expand(gf, h_output); + + return gf; + }; + + if (!GGMLRunner::compute(get_init_graph, n_threads, false, nullptr, nullptr, true)) { + LOG_ERROR("Initial conv failed"); + return false; + } + + // Save skip connection 0 + size_t h_size = ggml_nelements(h_output); + skip_connections[0].resize(h_size); + ggml_backend_tensor_get(h_output, skip_connections[0].data(), 0, h_size * sizeof(float)); + for (int i = 0; i < 4; i++) { + skip_ne[0][i] = h_output->ne[i]; + h_ne[i] = h_output->ne[i]; + } + persistent_h.resize(h_size); + ggml_backend_tensor_get(h_output, persistent_h.data(), 0, h_size * sizeof(float)); + + free_compute_buffer(); + } + + // Process input blocks 1-11 + auto input_block_at = [](int i) { return "input_blocks." + std::to_string(i); }; + if (streaming_engine_) { + streaming_engine_->prime_prefetch(input_block_at, 1, num_input_blocks); + } + + for (int block_idx = 1; block_idx < num_input_blocks; block_idx++) { + std::string block_name = input_block_at(block_idx); + int64_t t_block = ggml_time_ms(); + + if (streaming_engine_) { + streaming_engine_->wait_for_prefetch(block_name); + } + + if (!registry.move_layer_to_gpu(block_name)) { + LOG_ERROR("Failed to load %s", block_name.c_str()); + return false; + } + + // Keep the prefetch window full + if (streaming_engine_) { + streaming_engine_->advance_prefetch(input_block_at, block_idx, num_input_blocks); + } + + ggml_tensor* h_output = nullptr; + + auto get_input_graph = [&]() -> ggml_cgraph* { + ggml_cgraph* gf = new_graph_custom(UNET_GRAPH_SIZE / 8); + + ggml_tensor* h_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, h_ne[0], h_ne[1], h_ne[2], h_ne[3]); + ggml_tensor* emb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, emb_ne[0], emb_ne[1], emb_ne[2], emb_ne[3]); + ggml_tensor* context_b = context ? to_backend(context) : nullptr; + + h_in = to_backend(h_in); + emb_in = to_backend(emb_in); + + set_backend_tensor_data(h_in, persistent_h.data()); + set_backend_tensor_data(emb_in, persistent_emb.data()); + + auto runner_ctx = get_context(); + h_output = unet.forward_input_block(&runner_ctx, block_idx, h_in, emb_in, context_b, num_video_frames); + + ggml_build_forward_expand(gf, h_output); + + return gf; + }; + + if (!GGMLRunner::compute(get_input_graph, n_threads, false, nullptr, nullptr, true)) { + LOG_ERROR("Input block %d failed", block_idx); + return false; + } + + // Save skip connection + size_t h_size = ggml_nelements(h_output); + skip_connections[block_idx].resize(h_size); + ggml_backend_tensor_get(h_output, skip_connections[block_idx].data(), 0, h_size * sizeof(float)); + for (int i = 0; i < 4; i++) { + skip_ne[block_idx][i] = h_output->ne[i]; + h_ne[i] = h_output->ne[i]; + } + + // Update persistent h + persistent_h.resize(h_size); + ggml_backend_tensor_get(h_output, persistent_h.data(), 0, h_size * sizeof(float)); + + free_compute_buffer(); + + registry.move_layer_to_cpu(block_name); + LOG_DEBUG("Input block %d/%d done (%.2fms)", + block_idx + 1, num_input_blocks, (ggml_time_ms() - t_block) / 1.0); + } + + LOG_DEBUG("Processing middle block"); + { + if (!registry.move_layer_to_gpu("middle_block")) { + LOG_ERROR("Failed to load middle_block"); + return false; + } + + ggml_tensor* h_output = nullptr; + + auto get_middle_graph = [&]() -> ggml_cgraph* { + ggml_cgraph* gf = new_graph_custom(UNET_GRAPH_SIZE / 8); + + ggml_tensor* h_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, h_ne[0], h_ne[1], h_ne[2], h_ne[3]); + ggml_tensor* emb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, emb_ne[0], emb_ne[1], emb_ne[2], emb_ne[3]); + ggml_tensor* context_b = context ? to_backend(context) : nullptr; + + h_in = to_backend(h_in); + emb_in = to_backend(emb_in); + + set_backend_tensor_data(h_in, persistent_h.data()); + set_backend_tensor_data(emb_in, persistent_emb.data()); + + auto runner_ctx = get_context(); + h_output = unet.forward_middle_block(&runner_ctx, h_in, emb_in, context_b, num_video_frames); + + ggml_build_forward_expand(gf, h_output); + + return gf; + }; + + if (!GGMLRunner::compute(get_middle_graph, n_threads, false, nullptr, nullptr, true)) { + LOG_ERROR("Middle block failed"); + return false; + } + + // Update persistent h + size_t h_size = ggml_nelements(h_output); + persistent_h.resize(h_size); + ggml_backend_tensor_get(h_output, persistent_h.data(), 0, h_size * sizeof(float)); + for (int i = 0; i < 4; i++) h_ne[i] = h_output->ne[i]; + + free_compute_buffer(); + + registry.move_layer_to_cpu("middle_block"); + } + + LOG_DEBUG("Processing output blocks"); + + auto output_block_at = [](int i) { return "output_blocks." + std::to_string(i); }; + if (streaming_engine_) { + streaming_engine_->prime_prefetch(output_block_at, 0, num_output_blocks); + } + + for (int block_idx = 0; block_idx < num_output_blocks; block_idx++) { + std::string block_name = output_block_at(block_idx); + int64_t t_block = ggml_time_ms(); + + // Skip connection index (reverse order) + int skip_idx = num_input_blocks - 1 - block_idx; + + if (streaming_engine_) { + streaming_engine_->wait_for_prefetch(block_name); + } + + if (!registry.move_layer_to_gpu(block_name)) { + LOG_ERROR("Failed to load %s", block_name.c_str()); + return false; + } + + // Keep the prefetch window full + if (streaming_engine_) { + streaming_engine_->advance_prefetch(output_block_at, block_idx, num_output_blocks); + } + + ggml_tensor* h_output = nullptr; + + auto get_output_graph = [&]() -> ggml_cgraph* { + ggml_cgraph* gf = new_graph_custom(UNET_GRAPH_SIZE / 8); + + ggml_tensor* h_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, h_ne[0], h_ne[1], h_ne[2], h_ne[3]); + ggml_tensor* emb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, emb_ne[0], emb_ne[1], emb_ne[2], emb_ne[3]); + + // Create skip connection tensor + ggml_tensor* skip_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, + skip_ne[skip_idx][0], skip_ne[skip_idx][1], + skip_ne[skip_idx][2], skip_ne[skip_idx][3]); + + ggml_tensor* context_b = context ? to_backend(context) : nullptr; + + h_in = to_backend(h_in); + emb_in = to_backend(emb_in); + skip_in = to_backend(skip_in); + + set_backend_tensor_data(h_in, persistent_h.data()); + set_backend_tensor_data(emb_in, persistent_emb.data()); + set_backend_tensor_data(skip_in, skip_connections[skip_idx].data()); + + auto runner_ctx = get_context(); + h_output = unet.forward_output_block(&runner_ctx, block_idx, h_in, skip_in, emb_in, + context_b, num_video_frames); + + ggml_build_forward_expand(gf, h_output); + + return gf; + }; + + if (!GGMLRunner::compute(get_output_graph, n_threads, false, nullptr, nullptr, true)) { + LOG_ERROR("Output block %d failed", block_idx); + return false; + } + + // Update persistent h + size_t h_size = ggml_nelements(h_output); + persistent_h.resize(h_size); + ggml_backend_tensor_get(h_output, persistent_h.data(), 0, h_size * sizeof(float)); + for (int i = 0; i < 4; i++) h_ne[i] = h_output->ne[i]; + + free_compute_buffer(); + + // Free skip connection memory + skip_connections[skip_idx].clear(); + skip_connections[skip_idx].shrink_to_fit(); + + registry.move_layer_to_cpu(block_name); + LOG_DEBUG("Output block %d/%d done (%.2fms)", + block_idx + 1, num_output_blocks, (ggml_time_ms() - t_block) / 1.0); + } + + LOG_DEBUG("Applying final output layers"); + { + auto get_final_graph = [&]() -> ggml_cgraph* { + ggml_cgraph* gf = new_graph_custom(UNET_GRAPH_SIZE / 8); + + ggml_tensor* h_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, h_ne[0], h_ne[1], h_ne[2], h_ne[3]); + h_in = to_backend(h_in); + set_backend_tensor_data(h_in, persistent_h.data()); + + auto runner_ctx = get_context(); + auto final_out = unet.forward_output_stage(&runner_ctx, h_in); + + ggml_build_forward_expand(gf, final_out); + + return gf; + }; + + if (!GGMLRunner::compute(get_final_graph, n_threads, true, output, output_ctx, true)) { + LOG_ERROR("Final output stage failed"); + return false; + } + } + + int64_t t_end = ggml_time_ms(); + LOG_INFO("TRUE per-layer streaming completed in %.2fs (%d input + 1 middle + %d output blocks)", + (t_end - t_start) / 1000.0, num_input_blocks, num_output_blocks); + + return true; + } + void get_param_tensors(std::map& tensors, const std::string prefix) { unet.get_param_tensors(tensors, prefix); } @@ -661,6 +1209,69 @@ struct UNetModelRunner : public GGMLRunner { return gf; } + // Legacy overload used by streaming code paths (takes raw ggml_tensor pointers) + ggml_cgraph* build_graph(ggml_tensor* x, + ggml_tensor* timesteps, + ggml_tensor* context, + ggml_tensor* c_concat = nullptr, + ggml_tensor* y = nullptr, + int num_video_frames = -1, + std::vector controls = {}, + float control_strength = 0.f) { + ggml_cgraph* gf = new_graph_custom(UNET_GRAPH_SIZE); + + if (num_video_frames == -1) { + num_video_frames = static_cast(x->ne[3]); + } + + x = to_backend(x); + context = to_backend(context); + y = to_backend(y); + timesteps = to_backend(timesteps); + c_concat = to_backend(c_concat); + + for (size_t i = 0; i < controls.size(); i++) { + controls[i] = to_backend(controls[i]); + } + + auto runner_ctx = get_context(); + + ggml_tensor* out = unet.forward(&runner_ctx, + x, + timesteps, + context, + c_concat, + y, + num_video_frames, + controls, + control_strength); + + ggml_build_forward_expand(gf, out); + + return gf; + } + + // Legacy overload used by streaming code paths (takes raw ggml_tensor pointers) + bool compute(int n_threads, + ggml_tensor* x, + ggml_tensor* timesteps, + ggml_tensor* context, + ggml_tensor* c_concat, + ggml_tensor* y, + int num_video_frames = -1, + std::vector controls = {}, + float control_strength = 0.f, + ggml_tensor** output = nullptr, + ggml_context* output_ctx = nullptr, + bool skip_param_offload = false) { + auto get_graph = [&]() -> ggml_cgraph* { + return build_graph(x, timesteps, context, c_concat, y, num_video_frames, controls, control_strength); + }; + + return GGMLRunner::compute(get_graph, n_threads, false, output, output_ctx, skip_param_offload); + } + + // Upstream public API (takes sd::Tensor) sd::Tensor compute(int n_threads, const sd::Tensor& x, const sd::Tensor& timesteps, diff --git a/src/wan.hpp b/src/wan.hpp index 261453301..7c3776ccb 100644 --- a/src/wan.hpp +++ b/src/wan.hpp @@ -7,6 +7,7 @@ #include "common_block.hpp" #include "flux.hpp" +#include "layer_streaming.hpp" #include "rope.hpp" #include "vae.hpp" @@ -2083,6 +2084,55 @@ namespace WAN { return out; } + + struct StreamingInputResult { + ggml_tensor* x; // [N, t_len*h_len*w_len, dim] + ggml_tensor* x_orig; // Original x for vace + ggml_tensor* c; // vace context [N, t_len*h_len*w_len, dim] or nullptr + ggml_tensor* e0; // timestep embedding + ggml_tensor* e; // for head + ggml_tensor* pe; // positional encoding + ggml_tensor* context; // text context + int64_t context_img_len; + }; + + std::pair forward_block(GGMLRunnerContext* ctx, + int block_idx, + struct ggml_tensor* x, + struct ggml_tensor* x_orig, + struct ggml_tensor* c, + struct ggml_tensor* e0, + struct ggml_tensor* pe, + struct ggml_tensor* context, + int64_t context_img_len, + float vace_strength) { + auto block = std::dynamic_pointer_cast(blocks["blocks." + std::to_string(block_idx)]); + x = block->forward(ctx, x, e0, pe, context, context_img_len); + + // Check if this block has a paired vace_block + auto iter = params.vace_layers_mapping.find(block_idx); + if (iter != params.vace_layers_mapping.end() && c != nullptr) { + int n = iter->second; + auto vace_block = std::dynamic_pointer_cast(blocks["vace_blocks." + std::to_string(n)]); + auto result = vace_block->forward(ctx, c, x_orig, e0, pe, context, context_img_len); + auto c_skip = result.first; + c = result.second; + c_skip = ggml_ext_scale(ctx->ggml_ctx, c_skip, vace_strength); + x = ggml_add(ctx->ggml_ctx, x, c_skip); + } + + return {x, c}; + } + + ggml_tensor* forward_output_stage(GGMLRunnerContext* ctx, + struct ggml_tensor* x, + struct ggml_tensor* e) { + auto head = std::dynamic_pointer_cast(blocks["head"]); + return head->forward(ctx, x, e); // [N, t_len*h_len*w_len, pt*ph*pw*out_dim] + } + + int get_num_layers() const { return params.num_layers; } + const std::tuple& get_patch_size() const { return params.patch_size; } }; struct WanRunner : public GGMLRunner { @@ -2212,6 +2262,163 @@ namespace WAN { wan.get_param_tensors(tensors, prefix); } + public: + void enable_layer_streaming(const LayerStreaming::StreamingConfig& config = {}) { + std::map tensor_map; + wan.get_param_tensors(tensor_map, "model.diffusion_model"); + init_streaming(config, tensor_map, LayerStreaming::wan_layer_pattern); + LOG_INFO("%s layer streaming enabled (%zu layers)", + get_desc().c_str(), streaming_engine_->get_registry().get_layer_count()); + } + + bool compute_streaming(int n_threads, + struct ggml_tensor* x, + struct ggml_tensor* timesteps, + struct ggml_tensor* context, + struct ggml_tensor* clip_fea = nullptr, + struct ggml_tensor* c_concat = nullptr, + struct ggml_tensor* time_dim_concat = nullptr, + struct ggml_tensor* vace_context = nullptr, + float vace_strength = 1.f, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr) { + if (!is_streaming_enabled()) { + LOG_ERROR("%s streaming not enabled", get_desc().c_str()); + return false; + } + + int64_t t0 = ggml_time_ms(); + auto analysis = analyze_vram_budget(); + + if (analysis.fits_in_vram) { + LOG_INFO("%s model fits in VRAM, using coarse-stage streaming", get_desc().c_str()); + load_all_layers_coarse(); + bool result = compute(n_threads, x, timesteps, context, clip_fea, c_concat, + time_dim_concat, vace_context, vace_strength, output, output_ctx, true); + int64_t t1 = ggml_time_ms(); + LOG_INFO("%s coarse-stage streaming completed in %.2fs", get_desc().c_str(), (t1 - t0) / 1000.0); + free_compute_buffer(); + return result; + } + + LOG_INFO("%s remaining %.2f GB exceeds available %.2f GB, using per-layer streaming", + get_desc().c_str(), + analysis.remaining_to_load / (1024.0 * 1024.0 * 1024.0), + analysis.available_vram / (1024.0 * 1024.0 * 1024.0)); + + return compute_streaming_true(n_threads, x, timesteps, context, clip_fea, c_concat, + time_dim_concat, vace_context, vace_strength, output, output_ctx); + } + + bool compute_streaming_true(int n_threads, + struct ggml_tensor* x, + struct ggml_tensor* timesteps, + struct ggml_tensor* context, + struct ggml_tensor* clip_fea = nullptr, + struct ggml_tensor* c_concat = nullptr, + struct ggml_tensor* time_dim_concat = nullptr, + struct ggml_tensor* vace_context = nullptr, + float vace_strength = 1.f, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr) { + auto& registry = streaming_engine_->get_registry(); + int64_t t_start = ggml_time_ms(); + + const int num_blocks = wan.get_num_layers(); + const auto& patch_size = wan.get_patch_size(); + const int64_t W = x->ne[0]; + const int64_t H = x->ne[1]; + const int64_t T = x->ne[2]; + + LOG_INFO("TRUE per-layer streaming - %d blocks", num_blocks); + + // Load global layers (includes embedders) + if (!registry.move_layer_to_gpu("_global")) { + LOG_ERROR("Failed to load _global to GPU"); + return false; + } + + // Generate PE + pe_vec = Rope::gen_wan_pe(static_cast(T), + static_cast(H), + static_cast(W), + std::get<0>(patch_size), + std::get<1>(patch_size), + std::get<2>(patch_size), + 1, + wan_params.theta, + wan_params.axes_dim); + + // Persistent storage + std::vector persistent_x; + std::vector persistent_x_orig; + std::vector persistent_c; // vace context + std::vector persistent_e0; + std::vector persistent_e; + int64_t x_ne[4], x_orig_ne[4], c_ne[4], e0_ne[4], e_ne[4]; + bool has_vace = (vace_context != nullptr); + int64_t context_img_len = 0; + int64_t t_len = 0, h_len = 0, w_len = 0; + + // Stage 1: Input stage - execute full input pipeline + LOG_DEBUG("Executing input stage"); + { + ggml_tensor* x_output = nullptr; + ggml_tensor* x_orig_output = nullptr; + ggml_tensor* c_output = nullptr; + ggml_tensor* e0_output = nullptr; + ggml_tensor* e_output = nullptr; + + auto get_input_graph = [&]() -> struct ggml_cgraph* { + struct ggml_cgraph* gf = new_graph_custom(WAN_GRAPH_SIZE / 2); + auto runner_ctx = get_context(); + + ggml_tensor* x_b = to_backend(x); + ggml_tensor* timesteps_b = to_backend(timesteps); + ggml_tensor* context_b = to_backend(context); + ggml_tensor* clip_fea_b = clip_fea ? to_backend(clip_fea) : nullptr; + ggml_tensor* c_concat_b = c_concat ? to_backend(c_concat) : nullptr; + ggml_tensor* time_dim_concat_b = time_dim_concat ? to_backend(time_dim_concat) : nullptr; + ggml_tensor* vace_context_b = vace_context ? to_backend(vace_context) : nullptr; + + if (c_concat_b != nullptr) { + x_b = ggml_concat(compute_ctx, x_b, c_concat_b, 3); + } + + int pos_len = static_cast(pe_vec.size() / wan_params.axes_dim_sum / 2); + auto pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, wan_params.axes_dim_sum / 2, pos_len); + set_backend_tensor_data(pe, pe_vec.data()); + + struct ggml_tensor* out = wan.forward(&runner_ctx, + x_b, + timesteps_b, + context_b, + pe, + clip_fea_b, + time_dim_concat_b, + vace_context_b, + vace_strength, + 1); + + ggml_build_forward_expand(gf, out); + x_output = out; + + return gf; + }; + + if (!GGMLRunner::compute(get_input_graph, n_threads, true, output, output_ctx, true)) { + LOG_ERROR("Compute failed"); + return false; + } + } + + int64_t t_end = ggml_time_ms(); + LOG_INFO("Streaming completed in %.2fs (%d blocks)", + (t_end - t_start) / 1000.0, num_blocks); + + return true; + } + ggml_cgraph* build_graph(const sd::Tensor& x_tensor, const sd::Tensor& timesteps_tensor, const sd::Tensor& context_tensor = {}, @@ -2268,6 +2475,67 @@ namespace WAN { return gf; } + // Raw tensor compute used by streaming infrastructure + bool compute(int n_threads, + struct ggml_tensor* x, + struct ggml_tensor* timesteps, + struct ggml_tensor* context, + struct ggml_tensor* clip_fea = nullptr, + struct ggml_tensor* c_concat = nullptr, + struct ggml_tensor* time_dim_concat = nullptr, + struct ggml_tensor* vace_context = nullptr, + float vace_strength = 1.f, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr, + bool skip_param_offload = false) { + auto get_graph = [&]() -> ggml_cgraph* { + ggml_cgraph* gf = new_graph_custom(WAN_GRAPH_SIZE); + + x = to_backend(x); + timesteps = to_backend(timesteps); + context = to_backend(context); + clip_fea = to_backend(clip_fea); + c_concat = to_backend(c_concat); + time_dim_concat = to_backend(time_dim_concat); + vace_context = to_backend(vace_context); + + pe_vec = Rope::gen_wan_pe(static_cast(x->ne[2]), + static_cast(x->ne[1]), + static_cast(x->ne[0]), + std::get<0>(wan_params.patch_size), + std::get<1>(wan_params.patch_size), + std::get<2>(wan_params.patch_size), + 1, + wan_params.theta, + wan_params.axes_dim); + int pos_len = static_cast(pe_vec.size() / wan_params.axes_dim_sum / 2); + auto pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, wan_params.axes_dim_sum / 2, pos_len); + set_backend_tensor_data(pe, pe_vec.data()); + + if (c_concat != nullptr) { + x = ggml_concat(compute_ctx, x, c_concat, 3); + } + + auto runner_ctx = get_context(); + + ggml_tensor* out = wan.forward(&runner_ctx, + x, + timesteps, + context, + pe, + clip_fea, + time_dim_concat, + vace_context, + vace_strength); + + ggml_build_forward_expand(gf, out); + return gf; + }; + + return GGMLRunner::compute(get_graph, n_threads, false, output, output_ctx, skip_param_offload); + } + + // Upstream sd::Tensor compute interface sd::Tensor compute(int n_threads, const sd::Tensor& x, const sd::Tensor& timesteps, diff --git a/src/z_image.hpp b/src/z_image.hpp index 00b69c264..1d090cc2c 100644 --- a/src/z_image.hpp +++ b/src/z_image.hpp @@ -2,9 +2,12 @@ #define __Z_IMAGE_HPP__ #include +#include +#include "chunk_graph.hpp" #include "flux.hpp" #include "ggml_extend.hpp" +#include "layer_streaming.hpp" #include "mmdit.hpp" // Ref: https://github.com/Alpha-VLLM/Lumina-Image-2.0/blob/main/models/model.py @@ -462,6 +465,95 @@ namespace ZImage { return out; } + + struct StreamingInputResult { + ggml_tensor* txt; // [N, n_txt_token + n_txt_pad_token, hidden_size] + ggml_tensor* img; // [N, n_img_token + n_img_pad_token, hidden_size] + ggml_tensor* t_emb; // [N, hidden_size] + ggml_tensor* txt_pe; // PE for txt + ggml_tensor* img_pe; // PE for img + ggml_tensor* full_pe; // Full PE for main layers + int64_t n_txt_token; + int64_t n_txt_pad_token; + int64_t n_img_token; + }; + + StreamingInputResult forward_input_stage(GGMLRunnerContext* ctx, + struct ggml_tensor* x, + struct ggml_tensor* timestep, + struct ggml_tensor* context, + struct ggml_tensor* pe) { + auto x_embedder = std::dynamic_pointer_cast(blocks["x_embedder"]); + auto t_embedder = std::dynamic_pointer_cast(blocks["t_embedder"]); + auto cap_embedder_0 = std::dynamic_pointer_cast(blocks["cap_embedder.0"]); + auto cap_embedder_1 = std::dynamic_pointer_cast(blocks["cap_embedder.1"]); + + auto txt_pad_token = params["cap_pad_token"]; + auto img_pad_token = params["x_pad_token"]; + + int64_t N = x->ne[2]; + int64_t n_img_token = x->ne[1]; + int64_t n_txt_token = context->ne[1]; + + auto t_emb = t_embedder->forward(ctx, timestep); + + auto txt = cap_embedder_1->forward(ctx, cap_embedder_0->forward(ctx, context)); // [N, n_txt_token, hidden_size] + auto img = x_embedder->forward(ctx, x); // [N, n_img_token, hidden_size] + + int64_t n_txt_pad_token = Rope::bound_mod(static_cast(n_txt_token), SEQ_MULTI_OF); + if (n_txt_pad_token > 0) { + auto txt_pad_tokens = ggml_repeat_4d(ctx->ggml_ctx, txt_pad_token, txt_pad_token->ne[0], n_txt_pad_token, N, 1); + txt = ggml_concat(ctx->ggml_ctx, txt, txt_pad_tokens, 1); + } + + int64_t n_img_pad_token = Rope::bound_mod(static_cast(n_img_token), SEQ_MULTI_OF); + if (n_img_pad_token > 0) { + auto img_pad_tokens = ggml_repeat_4d(ctx->ggml_ctx, img_pad_token, img_pad_token->ne[0], n_img_pad_token, N, 1); + img = ggml_concat(ctx->ggml_ctx, img, img_pad_tokens, 1); + } + + auto txt_pe = ggml_ext_slice(ctx->ggml_ctx, pe, 3, 0, txt->ne[1]); + auto img_pe = ggml_ext_slice(ctx->ggml_ctx, pe, 3, txt->ne[1], pe->ne[3]); + + return {txt, img, t_emb, txt_pe, img_pe, pe, n_txt_token, n_txt_pad_token, n_img_token}; + } + + ggml_tensor* forward_context_refiner_block(GGMLRunnerContext* ctx, + int block_idx, + struct ggml_tensor* txt, + struct ggml_tensor* txt_pe) { + auto block = std::dynamic_pointer_cast(blocks["context_refiner." + std::to_string(block_idx)]); + return block->forward(ctx, txt, txt_pe, nullptr, nullptr); + } + + ggml_tensor* forward_noise_refiner_block(GGMLRunnerContext* ctx, + int block_idx, + struct ggml_tensor* img, + struct ggml_tensor* img_pe, + struct ggml_tensor* t_emb) { + auto block = std::dynamic_pointer_cast(blocks["noise_refiner." + std::to_string(block_idx)]); + return block->forward(ctx, img, img_pe, nullptr, t_emb); + } + + ggml_tensor* forward_layer_block(GGMLRunnerContext* ctx, + int block_idx, + struct ggml_tensor* txt_img, + struct ggml_tensor* pe, + struct ggml_tensor* t_emb) { + auto block = std::dynamic_pointer_cast(blocks["layers." + std::to_string(block_idx)]); + return block->forward(ctx, txt_img, pe, nullptr, t_emb); + } + + ggml_tensor* forward_output_stage(GGMLRunnerContext* ctx, + struct ggml_tensor* txt_img, + struct ggml_tensor* t_emb) { + auto final_layer = std::dynamic_pointer_cast(blocks["final_layer"]); + return final_layer->forward(ctx, txt_img, t_emb); + } + + int get_num_refiner_layers() const { return z_image_params.num_refiner_layers; } + int get_num_layers() const { return z_image_params.num_layers; } + int get_patch_size() const { return z_image_params.patch_size; } }; struct ZImageRunner : public GGMLRunner { @@ -472,6 +564,21 @@ namespace ZImage { std::vector timestep_vec; SDVersion version; + // Number of main layers kept resident on GPU across sampling steps. + // -1 = uncomputed; set on the first compute_streaming_true() call once + // refiners and _global are loaded so we know real free VRAM. + int resident_layer_count_ = -1; + + // Phase 4: cached "chunk" graph spanning all K resident layers in one + // dispatch. Built once on the first sampling step that has K > 0, + // dispatched once per subsequent step. Resident layer weights never + // move between steps so the graph stays cache-stable. Rebuilt when + // input shapes change (e.g. between queue jobs with different prompt + // token counts). See chunk_graph.hpp for the shared helper. + LayerStreaming::ChunkGraph chunk_graph_; + + public: + ZImageRunner(ggml_backend_t backend, bool offload_params_to_cpu, const String2TensorStorage& tensor_storage_map = {}, @@ -482,6 +589,91 @@ namespace ZImage { z_image.init(params_ctx, tensor_storage_map, prefix); } + ~ZImageRunner() = default; + + // Drop the cached chunk graph and reset the resident-layer count when + // streaming layers are evicted to CPU. The chunk graph's compiled ops + // hold raw pointers into the resident layers' GPU tensors; once those + // tensors are moved off-GPU, reusing the graph would read freed + // memory. Forcing a rebuild also lets a new generation pick a + // different resident set if VRAM availability changed. + void on_streaming_layers_offloaded() override { + chunk_graph_.clear(); + resident_layer_count_ = -1; + } + + // Build (or reuse a cached) chunk graph for K resident layers, then + // dispatch it: upload the persistent activations + pe, run K layers in + // a single ggml_backend_graph_compute, read the chunk output back into + // persistent_txt_img. Replaces the per-layer dispatch loop for the + // resident block. + bool dispatch_resident_chunk(int K, + const int64_t txt_img_ne[4], + const int64_t t_emb_ne[4], + float* persistent_txt_img, + float* persistent_t_emb) { + int pos_len = static_cast(pe_vec.size() / z_image_params.axes_dim_sum / 2); + std::vector> shapes = { + { txt_img_ne[0], txt_img_ne[1], txt_img_ne[2], txt_img_ne[3] }, + { t_emb_ne[0], t_emb_ne[1], t_emb_ne[2], t_emb_ne[3] }, + { 2, 2, z_image_params.axes_dim_sum / 2, pos_len }, + }; + + auto build_fn = [this](ggml_context* ctx, + const std::vector& inputs, + int K_inner) -> ggml_tensor* { + GGMLRunnerContext runner_ctx; + runner_ctx.ggml_ctx = ctx; + runner_ctx.backend = runtime_backend; + runner_ctx.flash_attn_enabled = flash_attn_enabled; + runner_ctx.conv2d_direct_enabled = conv2d_direct_enabled; + runner_ctx.circular_x_enabled = circular_x_enabled; + runner_ctx.circular_y_enabled = circular_y_enabled; + runner_ctx.weight_adapter = weight_adapter; + + ggml_tensor* x = inputs[0]; // txt_img + ggml_tensor* t_emb = inputs[1]; + ggml_tensor* pe = inputs[2]; + for (int i = 0; i < K_inner; i++) { + x = z_image.forward_layer_block(&runner_ctx, i, x, pe, t_emb); + } + return x; + }; + + // Fingerprint any state captured by reference in the cached graph + // that would invalidate it: weight_adapter (replaced per + // apply_loras call, so its tensors can be freed) and the runner + // boolean flags that pick alternate ops in forward_layer_block. + uint64_t state_token = reinterpret_cast(weight_adapter.get()); + state_token ^= (static_cast(flash_attn_enabled) << 0) + | (static_cast(conv2d_direct_enabled) << 1) + | (static_cast(circular_x_enabled) << 2) + | (static_cast(circular_y_enabled) << 3); + + if (!chunk_graph_.ensure_built(runtime_backend, K, shapes, + GGML_TYPE_F32, state_token, build_fn, + Z_IMAGE_GRAPH_SIZE * 2, + get_desc())) { + return false; + } + + std::vector host_data = { + persistent_txt_img, + persistent_t_emb, + pe_vec.data(), + }; + std::vector host_nbytes = { + static_cast(txt_img_ne[0] * txt_img_ne[1] * txt_img_ne[2] * txt_img_ne[3]) * sizeof(float), + static_cast(t_emb_ne[0] * t_emb_ne[1] * t_emb_ne[2] * t_emb_ne[3]) * sizeof(float), + static_cast(2 * 2 * (z_image_params.axes_dim_sum / 2) * pos_len) * sizeof(float), + }; + + size_t out_nbytes = ggml_nbytes(chunk_graph_.output()); + return chunk_graph_.dispatch(runtime_backend, + host_data, host_nbytes, + persistent_txt_img, out_nbytes); + } + std::string get_desc() override { return "z_image"; } @@ -490,6 +682,511 @@ namespace ZImage { z_image.get_param_tensors(tensors, prefix); } + void enable_layer_streaming(const LayerStreaming::StreamingConfig& config = {}) { + std::map tensor_map; + z_image.get_param_tensors(tensor_map, "model.diffusion_model"); + init_streaming(config, tensor_map, LayerStreaming::zimage_layer_pattern); + LOG_INFO("%s layer streaming enabled (%zu layers)", + get_desc().c_str(), streaming_engine_->get_registry().get_layer_count()); + } + + bool compute_streaming(int n_threads, + struct ggml_tensor* x, + struct ggml_tensor* timesteps, + struct ggml_tensor* context, + std::vector ref_latents = {}, + bool increase_ref_index = false, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr) { + if (!is_streaming_enabled()) { + LOG_ERROR("%s streaming not enabled", get_desc().c_str()); + return false; + } + + int64_t t0 = ggml_time_ms(); + auto analysis = analyze_vram_budget(); + + if (analysis.fits_in_vram) { + LOG_INFO("%s model fits in VRAM, using coarse-stage streaming", get_desc().c_str()); + load_all_layers_coarse(); + bool result = compute(n_threads, x, timesteps, context, ref_latents, increase_ref_index, + output, output_ctx, true); + int64_t t1 = ggml_time_ms(); + LOG_INFO("%s coarse-stage streaming completed in %.2fs", get_desc().c_str(), (t1 - t0) / 1000.0); + free_compute_buffer(); + return result; + } + + LOG_INFO("%s remaining %.2f GB exceeds available %.2f GB, using per-layer streaming", + get_desc().c_str(), + analysis.remaining_to_load / (1024.0 * 1024.0 * 1024.0), + analysis.available_vram / (1024.0 * 1024.0 * 1024.0)); + + return compute_streaming_true(n_threads, x, timesteps, context, ref_latents, increase_ref_index, + output, output_ctx); + } + + bool compute_streaming_true(int n_threads, + struct ggml_tensor* x, + struct ggml_tensor* timesteps, + struct ggml_tensor* context, + std::vector ref_latents = {}, + bool increase_ref_index = false, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr) { + auto& registry = streaming_engine_->get_registry(); + int64_t t_start = ggml_time_ms(); + + const int num_refiner_layers = z_image.get_num_refiner_layers(); + const int num_layers = z_image.get_num_layers(); + const int patch_size = z_image.get_patch_size(); + const int64_t W = x->ne[0]; + const int64_t H = x->ne[1]; + + LOG_INFO("TRUE per-layer streaming - %d refiners + %d layers", + num_refiner_layers, num_layers); + + // Load global layers + if (!registry.move_layer_to_gpu("_global")) { + LOG_ERROR("Failed to load _global to GPU"); + return false; + } + + // Load refiner layers (context_refiner and noise_refiner) + for (int i = 0; i < num_refiner_layers; i++) { + std::string cr_name = "context_refiner." + std::to_string(i); + std::string nr_name = "noise_refiner." + std::to_string(i); + if (!registry.move_layer_to_gpu(cr_name)) { + LOG_ERROR("Failed to load %s to GPU", cr_name.c_str()); + return false; + } + if (!registry.move_layer_to_gpu(nr_name)) { + LOG_ERROR("Failed to load %s to GPU", nr_name.c_str()); + return false; + } + } + // Generate PE + pe_vec = Rope::gen_z_image_pe(static_cast(H), + static_cast(W), + z_image_params.patch_size, + static_cast(x->ne[3]), + static_cast(context->ne[1]), + SEQ_MULTI_OF, + ref_latents, + increase_ref_index, + z_image_params.theta, + circular_y_enabled, + circular_x_enabled, + z_image_params.axes_dim); + // For ZImage with refiners, we'll execute refiners with global, + // then stream main layers one at a time + // This is a simplified approach - refiners are usually small + + // Persistent storage. Pinned host buffer (member-scoped, reused + // across sampling steps) so the per-layer ggml_backend_tensor_get + // and copy_data_to_backend_tensor calls run at full PCIe bandwidth. + // Falls back to pageable std::vector if pinned alloc fails. + std::vector persistent_txt_img_fallback; + std::vector persistent_t_emb_fallback; + float* persistent_txt_img = nullptr; + float* persistent_t_emb = nullptr; + int64_t txt_img_ne[4], t_emb_ne[4]; + int64_t n_txt_token = 0, n_txt_pad_token = 0, n_img_token_val = 0; + + // Stage 1: Input + Refiners (all in one graph since refiners are small) + { + ggml_tensor* txt_img_output = nullptr; + ggml_tensor* t_emb_output = nullptr; + + auto get_refiner_graph = [&]() -> struct ggml_cgraph* { + struct ggml_cgraph* gf = new_graph_custom(Z_IMAGE_GRAPH_SIZE / 2); + auto runner_ctx = get_context(); + + ggml_tensor* x_backend = to_backend(x); + ggml_tensor* context_backend = to_backend(context); + ggml_tensor* timesteps_backend = to_backend(timesteps); + + // Patchify + auto img = DiT::pad_and_patchify(&runner_ctx, x_backend, patch_size, patch_size, false); + n_img_token_val = img->ne[1]; + + // Handle ref_latents + for (auto& ref : ref_latents) { + auto ref_backend = to_backend(ref); + ref_backend = DiT::pad_and_patchify(&runner_ctx, ref_backend, patch_size, patch_size, false); + img = ggml_concat(compute_ctx, img, ref_backend, 1); + } + + // PE tensor + int pos_len = static_cast(pe_vec.size() / z_image_params.axes_dim_sum / 2); + auto pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, z_image_params.axes_dim_sum / 2, pos_len); + set_backend_tensor_data(pe, pe_vec.data()); + + // Input stage + auto input_result = z_image.forward_input_stage(&runner_ctx, img, timesteps_backend, context_backend, pe); + auto txt = input_result.txt; + img = input_result.img; + auto t_emb = input_result.t_emb; + auto txt_pe = input_result.txt_pe; + auto img_pe = input_result.img_pe; + n_txt_token = input_result.n_txt_token; + n_txt_pad_token = input_result.n_txt_pad_token; + + // Verify PE size + int64_t total_tokens = txt->ne[1] + img->ne[1]; + if (pe->ne[3] != total_tokens) { + LOG_ERROR("ZImage PE mismatch: PE has %ld positions but model needs %ld tokens", + pe->ne[3], total_tokens); + } + + // Context refiners + for (int i = 0; i < num_refiner_layers; i++) { + txt = z_image.forward_context_refiner_block(&runner_ctx, i, txt, txt_pe); + } + + // Noise refiners + for (int i = 0; i < num_refiner_layers; i++) { + img = z_image.forward_noise_refiner_block(&runner_ctx, i, img, img_pe, t_emb); + } + + // Concat for main layers + txt_img_output = ggml_concat(compute_ctx, txt, img, 1); + + // Create explicit copy of t_emb to prevent buffer aliasing + // The allocator may reuse t_emb's buffer after noise refiners use it + auto t_emb_copy = ggml_new_tensor(compute_ctx, t_emb->type, ggml_n_dims(t_emb), t_emb->ne); + t_emb_copy = ggml_cpy(compute_ctx, t_emb, t_emb_copy); + ggml_set_name(t_emb_copy, "t_emb_output_copy"); + t_emb_output = t_emb_copy; + + ggml_build_forward_expand(gf, txt_img_output); + ggml_build_forward_expand(gf, t_emb_output); + + return gf; + }; + + // Don't free compute buffer immediately - we need to read outputs first + if (!GGMLRunner::compute(get_refiner_graph, n_threads, false, nullptr, nullptr, true)) { + LOG_ERROR("Refiner stage failed"); + return false; + } + + // Extract to persistent storage + if (txt_img_output && t_emb_output) { + size_t txt_img_size = ggml_nelements(txt_img_output); + size_t t_emb_size = ggml_nelements(t_emb_output); + + std::vector ptrs; + if (ensure_pinned_act_buffers({txt_img_size * sizeof(float), + t_emb_size * sizeof(float)}, ptrs)) { + persistent_txt_img = ptrs[0]; + persistent_t_emb = ptrs[1]; + } else { + persistent_txt_img_fallback.resize(txt_img_size); + persistent_t_emb_fallback.resize(t_emb_size); + persistent_txt_img = persistent_txt_img_fallback.data(); + persistent_t_emb = persistent_t_emb_fallback.data(); + } + + ggml_backend_tensor_get(txt_img_output, persistent_txt_img, 0, txt_img_size * sizeof(float)); + ggml_backend_tensor_get(t_emb_output, persistent_t_emb, 0, t_emb_size * sizeof(float)); + + for (int i = 0; i < 4; i++) { + txt_img_ne[i] = txt_img_output->ne[i]; + t_emb_ne[i] = t_emb_output->ne[i]; + } + } else { + LOG_ERROR("Failed to get refiner stage outputs"); + free_compute_buffer(); + return false; + } + + // Now safe to free compute buffer + free_compute_buffer(); + } + + // Refiners stay resident across sampling steps. Their weights are + // identical every step, so evicting and re-streaming them was + // pure waste. They cost ~4 layers worth of VRAM (small). + + // On the first sampling step, decide how many main layers we can + // keep permanently resident. Layers [0..K-1] become a static cache; + // layers [K..N-1] continue to stream and evict each step. + if (resident_layer_count_ < 0 && streaming_engine_) { + resident_layer_count_ = streaming_engine_->compute_resident_block_count("layers.0", num_layers); + LOG_INFO("%s layer cache: %d resident, %d streamed per step", + get_desc().c_str(), + resident_layer_count_, + num_layers - resident_layer_count_); + } + + // Stage 2: Main layers (one at a time) + // Debug: limit layers if env var set (to isolate where grid pattern appears) + const char* limit_layers_env = std::getenv("SDCPP_LIMIT_MAIN_LAYERS"); + int layers_to_run = num_layers; + if (limit_layers_env) { + int limit = std::atoi(limit_layers_env); + if (limit >= 0 && limit < num_layers) { + layers_to_run = limit; + LOG_WARN("SDCPP_LIMIT_MAIN_LAYERS=%d: Running only %d of %d main layers (debug mode)", + limit, layers_to_run, num_layers); + } + } + + auto layer_name_at = [](int i) { return "layers." + std::to_string(i); }; + + // Phase 4: dispatch the K resident layers as a single mega-graph + // (one ggml_backend_graph_compute call instead of K). On the first + // sampling step we pre-load all K resident weights and build the + // cached graph; subsequent steps reuse it. + int chunk_K = std::min(resident_layer_count_ < 0 ? 0 : resident_layer_count_, + layers_to_run); + if (chunk_K > 0) { + for (int i = 0; i < chunk_K; i++) { + std::string nm = layer_name_at(i); + if (!registry.is_layer_on_gpu(nm)) { + if (!registry.move_layer_to_gpu(nm)) { + LOG_ERROR("Failed to load resident %s for chunk", nm.c_str()); + return false; + } + } + } + // The shared ChunkGraph helper (chunk_graph.hpp) handles cache + // reuse and shape-mismatch rebuild automatically. + if (!dispatch_resident_chunk(chunk_K, txt_img_ne, t_emb_ne, + persistent_txt_img, persistent_t_emb)) { + return false; + } + // The chunk output has the same shape as the last resident + // layer's output; ne carries through unchanged. + for (int i = 0; i < 4; i++) { + txt_img_ne[i] = chunk_graph_.output()->ne[i]; + } + } + + // Begin prefetch at the first non-resident layer. With chunk_K > 0 + // the resident prefix is already loaded, so prefetch starts at K. + int prefetch_start = chunk_K; + while (prefetch_start < num_layers && + registry.is_layer_on_gpu(layer_name_at(prefetch_start))) { + prefetch_start++; + } + if (streaming_engine_) { + streaming_engine_->prime_prefetch(layer_name_at, prefetch_start, num_layers); + } + + // Phase 3 profiling: per-stage cumulative timings, dumped after the + // main loop. Set SDCPP_STREAM_PROFILE=1 to enable. + int64_t prof_wait_us = 0; + int64_t prof_load_us = 0; + int64_t prof_advance_us = 0; + int64_t prof_build_us = 0; + int64_t prof_compute_us = 0; + int64_t prof_get_us = 0; + int64_t prof_evict_us = 0; + const bool prof_enabled = std::getenv("SDCPP_STREAM_PROFILE") != nullptr; + auto prof_now = []() { return ggml_time_us(); }; + + // Phase 4: skip layers already covered by the chunk dispatch. + for (int layer_idx = chunk_K; layer_idx < layers_to_run; layer_idx++) { + std::string layer_name = layer_name_at(layer_idx); + + int64_t t0 = prof_enabled ? prof_now() : 0; + + // Wait for this layer's prefetch to complete (if async prefetch was started) + if (streaming_engine_) { + streaming_engine_->wait_for_prefetch(layer_name); + } + int64_t t1 = prof_enabled ? prof_now() : 0; + + // Load this layer's weights (sync load if prefetch didn't happen) + if (!registry.move_layer_to_gpu(layer_name)) { + LOG_ERROR("Failed to load %s", layer_name.c_str()); + return false; + } + int64_t t2 = prof_enabled ? prof_now() : 0; + + // Keep the prefetch window full + if (streaming_engine_) { + streaming_engine_->advance_prefetch(layer_name_at, layer_idx, num_layers); + } + int64_t t3 = prof_enabled ? prof_now() : 0; + + ggml_tensor* txt_img_out = nullptr; + + auto get_layer_graph = [&]() -> struct ggml_cgraph* { + struct ggml_cgraph* gf = new_graph_custom(Z_IMAGE_GRAPH_SIZE / 4); + + // Create input tensors in compute_ctx - no need for to_backend() since + // these are created fresh and will be allocated by the graph allocator + ggml_tensor* txt_img_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, + txt_img_ne[0], txt_img_ne[1], txt_img_ne[2], txt_img_ne[3]); + ggml_tensor* t_emb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, + t_emb_ne[0], t_emb_ne[1], t_emb_ne[2], t_emb_ne[3]); + + // Schedule data copy from CPU to GPU (happens after graph allocation) + set_backend_tensor_data(txt_img_in, persistent_txt_img); + set_backend_tensor_data(t_emb_in, persistent_t_emb); + + // PE tensor + int pos_len = static_cast(pe_vec.size() / z_image_params.axes_dim_sum / 2); + auto pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, z_image_params.axes_dim_sum / 2, pos_len); + set_backend_tensor_data(pe, pe_vec.data()); + + auto runner_ctx = get_context(); + txt_img_out = z_image.forward_layer_block(&runner_ctx, layer_idx, txt_img_in, pe, t_emb_in); + + ggml_build_forward_expand(gf, txt_img_out); + + return gf; + }; + + if (!GGMLRunner::compute(get_layer_graph, n_threads, false, nullptr, nullptr, true)) { + LOG_ERROR("Layer %d execution failed", layer_idx); + return false; + } + int64_t t4 = prof_enabled ? prof_now() : 0; + + // Extract output + if (txt_img_out) { + ggml_backend_tensor_get(txt_img_out, persistent_txt_img, 0, ggml_nbytes(txt_img_out)); + for (int i = 0; i < 4; i++) { + txt_img_ne[i] = txt_img_out->ne[i]; + } + } + int64_t t5 = prof_enabled ? prof_now() : 0; + + if (prof_enabled) { + prof_wait_us += t1 - t0; + prof_load_us += t2 - t1; + prof_advance_us += t3 - t2; + // build+compute happens together inside GGMLRunner::compute; + // we can't separate them without instrumenting ggml_extend. + prof_compute_us += t4 - t3; + prof_get_us += t5 - t4; + } + + // Don't free compute buffer here — every main layer has the same shape + // so the gallocr can be reused for the entire sampling step. Freeing here + // forces a destroy-and-recreate cycle that idles the GPU between layers. + + // Resident layers stay on GPU across sampling steps; only evict + // streamed layers (idx >= resident_layer_count_). + if (layer_idx >= resident_layer_count_) { + registry.move_layer_to_cpu(layer_name); + } + } + + if (prof_enabled) { + int64_t total = prof_wait_us + prof_load_us + prof_advance_us + + prof_compute_us + prof_get_us; + LOG_INFO("[stream-profile] %d layers: total=%.2fms wait=%.2fms load=%.2fms " + "advance=%.2fms compute=%.2fms tensor_get=%.2fms", + layers_to_run, + total / 1000.0, + prof_wait_us / 1000.0, + prof_load_us / 1000.0, + prof_advance_us / 1000.0, + prof_compute_us / 1000.0, + prof_get_us / 1000.0); + } + + // After all main layers are done, free the compute buffer so the output stage + // (different graph topology) can allocate a fresh one. + free_compute_buffer(); + + // Stage 3: Output + { + auto get_output_graph = [&]() -> struct ggml_cgraph* { + struct ggml_cgraph* gf = new_graph_custom(Z_IMAGE_GRAPH_SIZE / 4); + + // Create input tensors in compute_ctx - no to_backend() needed + ggml_tensor* txt_img_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, + txt_img_ne[0], txt_img_ne[1], txt_img_ne[2], txt_img_ne[3]); + ggml_tensor* t_emb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, + t_emb_ne[0], t_emb_ne[1], t_emb_ne[2], t_emb_ne[3]); + + // Schedule data copy from CPU to GPU + set_backend_tensor_data(txt_img_in, persistent_txt_img); + set_backend_tensor_data(t_emb_in, persistent_t_emb); + + auto runner_ctx = get_context(); + auto final_out = z_image.forward_output_stage(&runner_ctx, txt_img_in, t_emb_in); + + // Extract img portion and unpatchify + int64_t n_img_token = n_img_token_val; + final_out = ggml_ext_slice(compute_ctx, final_out, 1, + n_txt_token + n_txt_pad_token, + n_txt_token + n_txt_pad_token + n_img_token); + + final_out = DiT::unpatchify_and_crop(compute_ctx, final_out, H, W, patch_size, patch_size, false); + final_out = ggml_ext_scale(compute_ctx, final_out, -1.f); + + ggml_build_forward_expand(gf, final_out); + + return gf; + }; + + if (!GGMLRunner::compute(get_output_graph, n_threads, true, output, output_ctx, true)) { + LOG_ERROR("Output stage failed"); + return false; + } + } + + int64_t t_end = ggml_time_ms(); + LOG_INFO("TRUE per-layer streaming completed in %.2fs (%d refiners + %d layers)", + (t_end - t_start) / 1000.0, num_refiner_layers, num_layers); + + return true; + } + + // Raw pointer overload used by streaming code paths + struct ggml_cgraph* build_graph(struct ggml_tensor* x, + struct ggml_tensor* timesteps, + struct ggml_tensor* context, + std::vector ref_latents = {}, + bool increase_ref_index = false) { + GGML_ASSERT(x->ne[3] == 1); + struct ggml_cgraph* gf = new_graph_custom(Z_IMAGE_GRAPH_SIZE); + + x = to_backend(x); + context = to_backend(context); + timesteps = to_backend(timesteps); + + for (size_t i = 0; i < ref_latents.size(); i++) { + ref_latents[i] = to_backend(ref_latents[i]); + } + + pe_vec = Rope::gen_z_image_pe(static_cast(x->ne[1]), + static_cast(x->ne[0]), + z_image_params.patch_size, + static_cast(x->ne[3]), + static_cast(context->ne[1]), + SEQ_MULTI_OF, + ref_latents, + increase_ref_index, + z_image_params.theta, + circular_y_enabled, + circular_x_enabled, + z_image_params.axes_dim); + int pos_len = static_cast(pe_vec.size() / z_image_params.axes_dim_sum / 2); + auto pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, z_image_params.axes_dim_sum / 2, pos_len); + set_backend_tensor_data(pe, pe_vec.data()); + auto runner_ctx = get_context(); + + ggml_tensor* out = z_image.forward(&runner_ctx, + x, + timesteps, + context, + pe, + ref_latents); + + ggml_build_forward_expand(gf, out); + + return gf; + } + + // sd::Tensor overload used by upstream pipeline ggml_cgraph* build_graph(const sd::Tensor& x_tensor, const sd::Tensor& timesteps_tensor, const sd::Tensor& context_tensor, @@ -540,6 +1237,27 @@ namespace ZImage { return gf; } + // Raw pointer overload used by streaming/offloading code paths + bool compute(int n_threads, + struct ggml_tensor* x, + struct ggml_tensor* timesteps, + struct ggml_tensor* context, + std::vector ref_latents = {}, + bool increase_ref_index = false, + struct ggml_tensor** output = nullptr, + struct ggml_context* output_ctx = nullptr, + bool skip_param_offload = false) { + // x: [N, in_channels, h, w] + // timesteps: [N, ] + // context: [N, max_position, hidden_size] + auto get_graph = [&]() -> ggml_cgraph* { + return build_graph(x, timesteps, context, ref_latents, increase_ref_index); + }; + + return GGMLRunner::compute(get_graph, n_threads, false, output, output_ctx, skip_param_offload); + } + + // sd::Tensor overload used by upstream pipeline sd::Tensor compute(int n_threads, const sd::Tensor& x, const sd::Tensor& timesteps,