diff --git a/docs/vram_offloading.md b/docs/vram_offloading.md
new file mode 100644
index 000000000..dbfd91114
--- /dev/null
+++ b/docs/vram_offloading.md
@@ -0,0 +1,112 @@
+# VRAM Offloading
+
+Run models larger than your GPU memory by offloading weights to CPU RAM during generation.
+
+## Offload Modes
+
+Use `--offload-mode <mode>` to select the offloading strategy:
+
+| Mode | Description | VRAM Usage | Speed | Quality |
+|------|-------------|------------|-------|---------|
+| `none` | Everything stays on GPU (default) | Highest | Fastest | No penalty |
+| `cond_only` | Offload text encoder after conditioning | High | Near-full speed — only a brief reload between conditioning and diffusion | No penalty |
+| `cond_diffusion` | Offload both text encoder and diffusion model between stages | Medium | Slower — model is reloaded to GPU each diffusion step | No penalty |
+| `aggressive` | Aggressively offload all components when not in use | Low | Slowest of the non-streaming modes — frequent CPU↔GPU transfers | No penalty |
+| `layer_streaming` | Stream transformer layers one-by-one through GPU | Lowest | Depends on model size (see below) | No penalty when using coarse-stage; per-layer streaming is lossless for most architectures |
+
+The `--offload-to-cpu` flag is a shortcut that picks a reasonable offload mode automatically.
+
+## Layer Streaming
+
+Layer streaming is the most memory-efficient mode. Instead of loading the entire diffusion model into VRAM, it loads one transformer block at a time.
+
+### How it works
+
+1. **Coarse-stage**: If the model fits in VRAM (e.g., quantized models), all layers are loaded at once and the full graph is executed normally. This is as fast as `--offload-mode none` with no quality penalty — the only overhead is the initial CPU→GPU weight transfer.
+2. **Per-layer streaming**: If the model doesn't fit (e.g., bf16 models on small GPUs), each transformer block is loaded, executed as a mini-graph, then offloaded back to CPU before the next block. This uses minimal VRAM but is significantly slower due to per-step CPU↔GPU transfers. Output quality is identical to full-model execution — the computation is mathematically equivalent, just split across separate graph evaluations.
+
+The mode is chosen automatically based on available VRAM.
+
+### Supported architectures
+
+- Flux (double_blocks + single_blocks)
+- ZImage / Z-Image-Turbo (context_refiner + noise_refiner + layers)
+- MMDiT / SD3 (joint_blocks)
+- UNet / SD1.x / SDXL (input_blocks + middle_block + output_blocks)
+- Anima (blocks)
+- WAN (blocks + vace_blocks)
+- Qwen Image (transformer_blocks)
+
+### Examples
+
+#### ZImage-Turbo Q8 with layer streaming
+
+```
+sd-cli --diffusion-model z_image_turbo-Q8_0.gguf \
+  --llm Qwen3-4b-Z-Engineer-V2.gguf \
+  --vae ae.safetensors \
+  -p "a cat" --cfg-scale 1.0 --diffusion-fa \
+  -H 1024 -W 688 -s 42 \
+  --offload-mode layer_streaming -v
+```
+
+The Q8 model (6.7 GB) fits in a 12 GB GPU, so coarse-stage streaming is used automatically:
+```
+[INFO ] z_image model fits in VRAM, using coarse-stage streaming
+[INFO ] z_image coarse-stage streaming completed in 1.66s
+```
+
+#### Flux-dev Q4 with layer streaming
+
+```
+sd-cli --diffusion-model flux1-dev-q4_0.gguf \
+  --vae ae.safetensors \
+  --clip_l clip_l.safetensors \
+  --t5xxl t5xxl_fp16.safetensors \
+  -p "a lovely cat" --cfg-scale 1.0 --sampling-method euler \
+  --offload-mode layer_streaming -v
+```
+
+#### SD1.5 with aggressive offloading
+
+```
+sd-cli -m sd-v1-4.ckpt \
+  -p "a photograph of an astronaut riding a horse" \
+  --offload-mode aggressive -v
+```
+
+## Combining with other options
+
+- `--diffusion-fa`: Flash attention reduces VRAM further. Recommended with all offload modes. No quality penalty.
+- `--clip-on-cpu`: Run CLIP text encoder on CPU. Saves VRAM but slows conditioning. No quality penalty.
+- Quantized models (`q4_0`, `q8_0`, etc.) reduce model size, making coarse-stage streaming more likely (faster). **Quantization does reduce output quality** — lower bit depths produce softer details and may introduce artifacts. See [quantization](./quantization_and_gguf.md) for quality comparisons. `q8_0` is nearly indistinguishable from full precision; `q4_0` and below show visible degradation on fine details.
+
+## Quality impact summary
+
+| Technique | Quality Impact |
+|-----------|---------------|
+| `--offload-mode` (any mode) | **None** — offloading only changes where weights are stored, not the computation |
+| `--diffusion-fa` (flash attention) | **None** — mathematically equivalent, just more memory-efficient |
+| `--clip-on-cpu` | **None** — same computation on CPU instead of GPU |
+| Quantization (`q8_0`) | **Negligible** — nearly identical to full precision |
+| Quantization (`q4_0`, `q4_k`) | **Minor** — slight softening, fine details may differ |
+| Quantization (`q3_k`, `q2_k`) | **Noticeable** — visible quality loss, best for previews or VRAM-constrained setups |
+
+## Troubleshooting
+
+- **OOM during generation**: Try a more aggressive mode. `layer_streaming` uses the least VRAM.
+- **Slow generation**: Coarse-stage streaming (model fits in VRAM) is nearly as fast as no offloading. Per-layer streaming is slower due to CPU-GPU transfers each step. Using quantized models often lets you stay in coarse-stage mode.
+- **Black or corrupted output**: This is a bug. Please report it with the model, offload mode, and resolution used.
+- **One CPU core pegged at 100% while the GPU is working**: this is the CUDA driver spin-waiting on kernel completion. The default schedule policy (`cudaDeviceScheduleAuto`) often picks `Spin` for short-kernel workloads like per-layer streaming, which busy-waits one host thread for each kernel return. It does *not* slow generation down (the wait is wasted heat, not blocking work), but it looks bad on `top`/`nvtop` and is unfriendly to shared-host setups. Two ways to silence it:
+
+  1. Per-run, no rebuild needed:
+     ```
+     CUDA_DEVICE_SCHEDULE=BlockingSync sd-cli ...
+     ```
+  2. Per-process, set once at startup:
+     ```c
+     cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);
+     ```
+     Long-lived processes (REST servers, queue workers) should do this.
+
+  CPU drops to near zero; GPU performance is unchanged.
diff --git a/examples/cli/main.cpp b/examples/cli/main.cpp
index 27513f475..392e9c404 100644
--- a/examples/cli/main.cpp
+++ b/examples/cli/main.cpp
@@ -698,7 +698,10 @@ int main(int argc, const char* argv[]) {
         vae_decode_only = false;
     }
 
-    sd_ctx_params_t sd_ctx_params = ctx_params.to_sd_ctx_params_t(vae_decode_only, true, cli_params.taesd_preview);
+    // For layer_streaming mode, we need smart offload logic instead of immediate freeing
+    // This allows should_offload_cond_stage_for_diffusion() to be called and offload T5 before streaming
+    bool free_params_immediately   = (ctx_params.offload_config.mode != SD_OFFLOAD_LAYER_STREAMING);
+    sd_ctx_params_t sd_ctx_params  = ctx_params.to_sd_ctx_params_t(vae_decode_only, free_params_immediately, cli_params.taesd_preview);
 
     SDImageVec results;
     int num_results = 0;
diff --git a/examples/common/common.cpp b/examples/common/common.cpp
index d4c8a72b8..faa9eef6a 100644
--- a/examples/common/common.cpp
+++ b/examples/common/common.cpp
@@ -538,6 +538,78 @@ ArgOptions SDContextParams::get_options() {
         return 1;
     };
 
+    auto on_offload_mode_arg = [&](int argc, const char** argv, int index) {
+        if (++index >= argc) {
+            return -1;
+        }
+        const char* arg     = argv[index];
+        offload_config.mode = str_to_offload_mode(arg);
+        if (offload_config.mode == SD_OFFLOAD_MODE_COUNT) {
+            LOG_ERROR("error: invalid offload mode %s", arg);
+            return -1;
+        }
+        return 1;
+    };
+
+    auto on_vram_estimation_arg = [&](int argc, const char** argv, int index) {
+        if (++index >= argc) {
+            return -1;
+        }
+        const char* arg                = argv[index];
+        offload_config.vram_estimation = str_to_vram_estimation(arg);
+        if (offload_config.vram_estimation == SD_VRAM_EST_COUNT) {
+            LOG_ERROR("error: invalid VRAM estimation method %s", arg);
+            return -1;
+        }
+        return 1;
+    };
+
+    auto on_streaming_prefetch_arg = [&](int argc, const char** argv, int index) {
+        if (++index >= argc) {
+            return -1;
+        }
+        try {
+            offload_config.streaming_prefetch_layers = std::stoi(argv[index]);
+            if (offload_config.streaming_prefetch_layers < 0) {
+                LOG_ERROR("error: streaming prefetch must be >= 0");
+                return -1;
+            }
+        } catch (...) {
+            LOG_ERROR("error: invalid streaming prefetch value %s", argv[index]);
+            return -1;
+        }
+        return 1;
+    };
+
+    auto on_streaming_min_vram_arg = [&](int argc, const char** argv, int index) {
+        if (++index >= argc) {
+            return -1;
+        }
+        try {
+            int mb = std::stoi(argv[index]);
+            if (mb < 0) {
+                LOG_ERROR("error: streaming min VRAM must be >= 0");
+                return -1;
+            }
+            offload_config.streaming_min_free_vram = static_cast<size_t>(mb) * 1024 * 1024;
+        } catch (...) {
+            LOG_ERROR("error: invalid streaming min VRAM value %s", argv[index]);
+            return -1;
+        }
+        return 1;
+    };
+
+    options.bool_options.push_back({"", "--offload-log", "log offload events", true, &offload_config.log_offload_events});
+    options.bool_options.push_back({"", "--no-offload-log", "do not log offload events", false, &offload_config.log_offload_events});
+    options.bool_options.push_back({"", "--offload-cond-stage", "offload cond stage to CPU after use", true, &offload_config.offload_cond_stage});
+    options.bool_options.push_back({"", "--no-offload-cond-stage", "do not offload cond stage", false, &offload_config.offload_cond_stage});
+    options.bool_options.push_back({"", "--offload-diffusion", "offload diffusion model to CPU after use", true, &offload_config.offload_diffusion});
+    options.bool_options.push_back({"", "--no-offload-diffusion", "do not offload diffusion model", false, &offload_config.offload_diffusion});
+    options.bool_options.push_back({"", "--reload-cond-stage", "reload cond stage to GPU before use", true, &offload_config.reload_cond_stage});
+    options.bool_options.push_back({"", "--no-reload-cond-stage", "do not reload cond stage", false, &offload_config.reload_cond_stage});
+    options.bool_options.push_back({"", "--reload-diffusion", "reload diffusion to GPU before use", true, &offload_config.reload_diffusion});
+    options.bool_options.push_back({"", "--no-reload-diffusion", "do not reload diffusion", false, &offload_config.reload_diffusion});
+
     options.manual_options = {
         {"",
          "--type",
@@ -564,6 +636,24 @@ ArgOptions SDContextParams::get_options() {
          "but it usually offers faster inference speed and, in some cases, lower memory usage. "
          "The at_runtime mode, on the other hand, is exactly the opposite.",
          on_lora_apply_mode_arg},
+        {"",
+         "--offload-mode",
+         "dynamic VRAM offloading mode, one of [none, cond_only, cond_diffusion, aggressive, layer_streaming] (default: none). "
+         "Use 'cond_only' to offload the LLM/CLIP model to CPU after conditioning. "
+         "Use 'layer_streaming' to stream model layers one-by-one (enables models larger than VRAM).",
+         on_offload_mode_arg},
+        {"",
+         "--vram-estimation",
+         "VRAM estimation method for smart offloading, one of [dryrun, formula] (default: dryrun)",
+         on_vram_estimation_arg},
+        {"",
+         "--streaming-prefetch",
+         "Number of layers to prefetch ahead during layer streaming (default: 1)",
+         on_streaming_prefetch_arg},
+        {"",
+         "--streaming-min-vram",
+         "Minimum VRAM to keep free during layer streaming, in MB (default: 512)",
+         on_streaming_min_vram_arg},
     };
 
     return options;
@@ -693,7 +783,14 @@ std::string SDContextParams::to_string() const {
         << "  chroma_t5_mask_pad: " << chroma_t5_mask_pad << ",\n"
         << "  prediction: " << sd_prediction_name(prediction) << ",\n"
         << "  lora_apply_mode: " << sd_lora_apply_mode_name(lora_apply_mode) << ",\n"
-        << "  force_sdxl_vae_conv_scale: " << (force_sdxl_vae_conv_scale ? "true" : "false") << "\n"
+        << "  force_sdxl_vae_conv_scale: " << (force_sdxl_vae_conv_scale ? "true" : "false") << ",\n"
+        << "  offload_config: { mode=" << sd_offload_mode_name(offload_config.mode)
+        << ", vram_est=" << sd_vram_estimation_name(offload_config.vram_estimation)
+        << ", offload_cond=" << (offload_config.offload_cond_stage ? "true" : "false")
+        << ", offload_diff=" << (offload_config.offload_diffusion ? "true" : "false")
+        << ", reload_cond=" << (offload_config.reload_cond_stage ? "true" : "false")
+        << ", reload_diff=" << (offload_config.reload_diffusion ? "true" : "false")
+        << ", log=" << (offload_config.log_offload_events ? "true" : "false") << " }\n"
         << "}";
     return oss.str();
 }
@@ -751,6 +848,7 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool vae_decode_only, bool f
         chroma_t5_mask_pad,
         qwen_image_zero_cond_t,
         max_vram,
+        offload_config,
     };
     return sd_ctx_params;
 }
diff --git a/examples/common/common.h b/examples/common/common.h
index f87293f3e..8aef9c92c 100644
--- a/examples/common/common.h
+++ b/examples/common/common.h
@@ -135,6 +135,12 @@ struct SDContextParams {
     bool force_sdxl_vae_conv_scale = false;
 
     float flow_shift = INFINITY;
+
+    // Dynamic tensor offloading configuration
+    sd_offload_config_t offload_config = {SD_OFFLOAD_NONE, SD_VRAM_EST_DRYRUN, true, false, false, true, true,
+                                          0, 2ULL * 1024 * 1024 * 1024,
+                                          false, 1, 0, 512ULL * 1024 * 1024};
+
     ArgOptions get_options();
     void build_embedding_map();
     bool resolve(SDMode mode);
diff --git a/include/stable-diffusion.h b/include/stable-diffusion.h
index c4c14949c..28c138c2f 100644
--- a/include/stable-diffusion.h
+++ b/include/stable-diffusion.h
@@ -147,6 +147,53 @@ enum lora_apply_mode_t {
     LORA_APPLY_MODE_COUNT,
 };
 
+// Component identifiers for dynamic tensor offloading
+enum sd_component_t {
+    SD_COMPONENT_COND_STAGE,   // LLM/CLIP text embedder
+    SD_COMPONENT_CLIP_VISION,  // CLIP vision encoder (for SVD/Wan i2v)
+    SD_COMPONENT_DIFFUSION,    // UNet/DiT/Flux diffusion model
+    SD_COMPONENT_VAE,          // VAE encoder/decoder
+    SD_COMPONENT_CONTROL_NET,  // ControlNet (if loaded)
+    SD_COMPONENT_PMID,         // PhotoMaker ID encoder (if loaded)
+    SD_COMPONENT_COUNT
+};
+
+// Offload mode for automatic GPU memory management
+enum sd_offload_mode_t {
+    SD_OFFLOAD_NONE,           // Keep all components on GPU (default, fastest)
+    SD_OFFLOAD_COND_ONLY,      // Offload only conditioning (LLM/CLIP) after use
+    SD_OFFLOAD_COND_DIFFUSION, // Offload conditioning + diffusion, keep VAE
+    SD_OFFLOAD_AGGRESSIVE,     // Offload each component after use (saves most VRAM)
+    SD_OFFLOAD_LAYER_STREAMING, // Stream layers one-by-one (enables models larger than VRAM)
+    SD_OFFLOAD_MODE_COUNT
+};
+
+// VRAM estimation method for smart offloading decisions
+enum sd_vram_estimation_t {
+    SD_VRAM_EST_DRYRUN,        // Dry-run graph allocation for exact size (default, accurate)
+    SD_VRAM_EST_FORMULA,       // Formula-based estimation (faster, approximate)
+    SD_VRAM_EST_COUNT
+};
+
+// Offload configuration for fine-grained control
+typedef struct {
+    enum sd_offload_mode_t mode;          // Offload mode
+    enum sd_vram_estimation_t vram_estimation; // VRAM estimation method
+    bool offload_cond_stage;              // Offload LLM/CLIP after conditioning
+    bool offload_diffusion;               // Offload diffusion model after sampling
+    bool reload_cond_stage;               // Reload LLM/CLIP for next generation
+    bool reload_diffusion;                // Reload diffusion model for next generation
+    bool log_offload_events;              // Log offload/reload events
+    size_t min_offload_size;              // Minimum component size to offload (bytes), 0 = no minimum
+    size_t target_free_vram;              // Target free VRAM before VAE decode (bytes), 0 = always offload when mode is set
+
+    // Layer streaming configuration (for SD_OFFLOAD_LAYER_STREAMING mode)
+    bool layer_streaming_enabled;         // Enable layer-by-layer streaming execution
+    int streaming_prefetch_layers;        // Number of layers to prefetch ahead (default: 1)
+    int streaming_keep_layers_behind;     // Layers to keep after execution (for skip connections)
+    size_t streaming_min_free_vram;       // Minimum VRAM to keep free during streaming (bytes)
+} sd_offload_config_t;
+
 typedef struct {
     bool enabled;
     int tile_size_x;
@@ -203,7 +250,8 @@ typedef struct {
     bool chroma_use_t5_mask;
     int chroma_t5_mask_pad;
     bool qwen_image_zero_cond_t;
-    float max_vram;
+    float max_vram;                       // GiB budget for graph-cut segmented param offload (0 = disabled)
+    sd_offload_config_t offload_config;   // Cross-stage and layer-streaming offload configuration
 } sd_ctx_params_t;
 
 typedef struct {
@@ -393,6 +441,11 @@ SD_API const char* sd_preview_name(enum preview_t preview);
 SD_API enum preview_t str_to_preview(const char* str);
 SD_API const char* sd_lora_apply_mode_name(enum lora_apply_mode_t mode);
 SD_API enum lora_apply_mode_t str_to_lora_apply_mode(const char* str);
+SD_API const char* sd_offload_mode_name(enum sd_offload_mode_t mode);
+SD_API enum sd_offload_mode_t str_to_offload_mode(const char* str);
+SD_API const char* sd_vram_estimation_name(enum sd_vram_estimation_t method);
+SD_API enum sd_vram_estimation_t str_to_vram_estimation(const char* str);
+SD_API void sd_offload_config_init(sd_offload_config_t* config);
 SD_API const char* sd_hires_upscaler_name(enum sd_hires_upscaler_t upscaler);
 SD_API enum sd_hires_upscaler_t str_to_sd_hires_upscaler(const char* str);
 
@@ -411,6 +464,9 @@ SD_API char* sd_sample_params_to_str(const sd_sample_params_t* sample_params);
 SD_API enum sample_method_t sd_get_default_sample_method(const sd_ctx_t* sd_ctx);
 SD_API enum scheduler_t sd_get_default_scheduler(const sd_ctx_t* sd_ctx, enum sample_method_t sample_method);
 
+// Get the model architecture/version name (e.g., "SD 1.x", "SDXL", "Flux", "Z-Image", etc.)
+SD_API const char* sd_get_model_version_name(const sd_ctx_t* sd_ctx);
+
 SD_API void sd_img_gen_params_init(sd_img_gen_params_t* sd_img_gen_params);
 SD_API char* sd_img_gen_params_to_str(const sd_img_gen_params_t* sd_img_gen_params);
 SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* sd_img_gen_params);
@@ -450,6 +506,34 @@ SD_API bool preprocess_canny(sd_image_t image,
 SD_API const char* sd_commit(void);
 SD_API const char* sd_version(void);
 
+// Dynamic tensor offloading API
+// These functions allow runtime GPU memory management by moving model components
+// between CPU and GPU. This enables running larger models on limited VRAM by
+// keeping only the currently-active component on GPU.
+
+// Offload component from GPU to CPU (frees GPU memory)
+// Returns true on success, false if component doesn't exist or is already on CPU
+SD_API bool sd_offload_to_cpu(sd_ctx_t* sd_ctx, enum sd_component_t component);
+
+// Reload component from CPU to GPU (allocates GPU memory)
+// Returns true on success, false if component doesn't exist or allocation failed
+SD_API bool sd_reload_to_gpu(sd_ctx_t* sd_ctx, enum sd_component_t component);
+
+// Query whether component is currently on GPU
+// Returns true if on GPU, false if on CPU or component doesn't exist
+SD_API bool sd_is_on_gpu(sd_ctx_t* sd_ctx, enum sd_component_t component);
+
+// Get component's current memory usage in bytes
+// Returns the buffer size if component exists, 0 otherwise
+SD_API size_t sd_get_component_vram(sd_ctx_t* sd_ctx, enum sd_component_t component);
+
+// Get human-readable name for a component
+SD_API const char* sd_component_name(enum sd_component_t component);
+
+// Free all GPU resources (offload all components to CPU and clear LoRAs)
+// Call this before unloading a model to ensure GPU memory is released
+SD_API void sd_free_gpu_resources(sd_ctx_t* sd_ctx);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/src/anima.hpp b/src/anima.hpp
index 4bfc04749..7da40fbf8 100644
--- a/src/anima.hpp
+++ b/src/anima.hpp
@@ -8,6 +8,7 @@
 
 #include "common_block.hpp"
 #include "flux.hpp"
+#include "layer_streaming.hpp"
 #include "rope.hpp"
 
 namespace Anima {
@@ -516,6 +517,87 @@ namespace Anima {
 
             return x;
         }
+
+        struct StreamingInputResult {
+            ggml_tensor* x;                        // [N, h*w, hidden_size]
+            ggml_tensor* encoder_hidden_states;   // [N, 512, hidden_size]
+            ggml_tensor* embedded_timestep;       // [N, hidden_size]
+            ggml_tensor* temb;                    // [N, hidden_size * 3]
+        };
+
+        StreamingInputResult forward_input_stage(GGMLRunnerContext* ctx,
+                                                  struct ggml_tensor* x,
+                                                  struct ggml_tensor* timestep,
+                                                  struct ggml_tensor* encoder_hidden_states,
+                                                  struct ggml_tensor* t5_ids,
+                                                  struct ggml_tensor* t5_weights,
+                                                  struct ggml_tensor* adapter_q_pe,
+                                                  struct ggml_tensor* adapter_k_pe,
+                                                  int64_t H, int64_t W) {
+            auto x_embedder       = std::dynamic_pointer_cast<XEmbedder>(blocks["x_embedder"]);
+            auto t_embedder       = std::dynamic_pointer_cast<TimestepEmbedder>(blocks["t_embedder"]);
+            auto t_embedding_norm = std::dynamic_pointer_cast<RMSNorm>(blocks["t_embedding_norm"]);
+            auto llm_adapter      = std::dynamic_pointer_cast<LLMAdapter>(blocks["llm_adapter"]);
+
+            // Add padding mask and patchify
+            auto padding_mask = ggml_ext_zeros(ctx->ggml_ctx, x->ne[0], x->ne[1], 1, x->ne[3]);
+            x                 = ggml_concat(ctx->ggml_ctx, x, padding_mask, 2);  // [N, C + 1, H, W]
+            x = DiT::pad_and_patchify(ctx, x, patch_size, patch_size);  // [N, h*w, (C+1)*ph*pw]
+            x = x_embedder->forward(ctx, x);
+
+            // Timestep embedding
+            auto timestep_proj     = ggml_ext_timestep_embedding(ctx->ggml_ctx, timestep, static_cast<int>(hidden_size));
+            auto temb              = t_embedder->forward(ctx, timestep_proj);
+            auto embedded_timestep = t_embedding_norm->forward(ctx, timestep_proj);
+
+            // LLM adapter (if T5 is used)
+            if (t5_ids != nullptr) {
+                auto adapted_context = llm_adapter->forward(ctx, encoder_hidden_states, t5_ids, adapter_q_pe, adapter_k_pe);
+                if (t5_weights != nullptr) {
+                    auto w = t5_weights;
+                    if (ggml_n_dims(w) == 1) {
+                        w = ggml_reshape_3d(ctx->ggml_ctx, w, 1, w->ne[0], 1);
+                    }
+                    w               = ggml_repeat_4d(ctx->ggml_ctx, w, adapted_context->ne[0], adapted_context->ne[1], adapted_context->ne[2], 1);
+                    adapted_context = ggml_mul(ctx->ggml_ctx, adapted_context, w);
+                }
+                if (adapted_context->ne[1] < 512) {
+                    auto pad_ctx    = ggml_ext_zeros(ctx->ggml_ctx,
+                                                     adapted_context->ne[0],
+                                                     512 - adapted_context->ne[1],
+                                                     adapted_context->ne[2],
+                                                     1);
+                    adapted_context = ggml_concat(ctx->ggml_ctx, adapted_context, pad_ctx, 1);
+                } else if (adapted_context->ne[1] > 512) {
+                    adapted_context = ggml_ext_slice(ctx->ggml_ctx, adapted_context, 1, 0, 512);
+                }
+                encoder_hidden_states = adapted_context;
+            }
+
+            return {x, encoder_hidden_states, embedded_timestep, temb};
+        }
+
+        ggml_tensor* forward_block(GGMLRunnerContext* ctx,
+                                    int block_idx,
+                                    struct ggml_tensor* x,
+                                    struct ggml_tensor* encoder_hidden_states,
+                                    struct ggml_tensor* embedded_timestep,
+                                    struct ggml_tensor* temb,
+                                    struct ggml_tensor* image_pe) {
+            auto block = std::dynamic_pointer_cast<TransformerBlock>(blocks["blocks." + std::to_string(block_idx)]);
+            return block->forward(ctx, x, encoder_hidden_states, embedded_timestep, temb, image_pe);
+        }
+
+        ggml_tensor* forward_output_stage(GGMLRunnerContext* ctx,
+                                           struct ggml_tensor* x,
+                                           struct ggml_tensor* embedded_timestep,
+                                           struct ggml_tensor* temb) {
+            auto final_layer = std::dynamic_pointer_cast<FinalLayer>(blocks["final_layer"]);
+            return final_layer->forward(ctx, x, embedded_timestep, temb);  // [N, h*w, ph*pw*C]
+        }
+
+        int64_t get_num_layers() const { return num_layers; }
+        int get_patch_size() const { return patch_size; }
     };
 
     struct AnimaRunner : public GGMLRunner {
@@ -524,6 +606,13 @@ namespace Anima {
         std::vector<float> adapter_q_pe_vec;
         std::vector<float> adapter_k_pe_vec;
         AnimaNet net;
+        int64_t num_layers_ = 28;  // Store for streaming
+
+        // Static layer cache decided on the first sampling step. -1 = not yet
+        // computed; 0..N = number of "blocks.X" kept resident across steps.
+        int resident_blocks_ = -1;
+
+    public:
 
         AnimaRunner(ggml_backend_t backend,
                     bool offload_params_to_cpu,
@@ -549,6 +638,7 @@ namespace Anima {
             if (num_layers <= 0) {
                 num_layers = 28;
             }
+            num_layers_ = num_layers;  // Store for streaming
             LOG_INFO("anima net layers: %" PRId64, num_layers);
 
             net = AnimaNet(num_layers);
@@ -672,6 +762,79 @@ namespace Anima {
             return gf;
         }
 
+        // Raw tensor build_graph used by streaming infrastructure
+        ggml_cgraph* build_graph(ggml_tensor* x,
+                                 ggml_tensor* timesteps,
+                                 ggml_tensor* context,
+                                 ggml_tensor* t5_ids     = nullptr,
+                                 ggml_tensor* t5_weights = nullptr) {
+            GGML_ASSERT(x->ne[3] == 1);
+            ggml_cgraph* gf = new_graph_custom(ANIMA_GRAPH_SIZE);
+
+            x          = to_backend(x);
+            timesteps  = to_backend(timesteps);
+            context    = to_backend(context);
+            t5_ids     = to_backend(t5_ids);
+            t5_weights = to_backend(t5_weights);
+
+            int64_t pad_h = (net.patch_size - x->ne[1] % net.patch_size) % net.patch_size;
+            int64_t pad_w = (net.patch_size - x->ne[0] % net.patch_size) % net.patch_size;
+            int64_t h_pad = x->ne[1] + pad_h;
+            int64_t w_pad = x->ne[0] + pad_w;
+
+            image_pe_vec = gen_anima_image_pe_vec(1,
+                                                   static_cast<int>(h_pad),
+                                                   static_cast<int>(w_pad),
+                                                   static_cast<int>(net.patch_size),
+                                                   net.theta,
+                                                   net.axes_dim,
+                                                   4.0f, 4.0f, 1.0f);
+            int64_t image_pos_len = static_cast<int64_t>(image_pe_vec.size()) / (2 * 2 * (net.head_dim / 2));
+            auto image_pe         = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, net.head_dim / 2, image_pos_len);
+            set_backend_tensor_data(image_pe, image_pe_vec.data());
+
+            ggml_tensor* adapter_q_pe = nullptr;
+            ggml_tensor* adapter_k_pe = nullptr;
+            if (t5_ids != nullptr) {
+                int64_t target_len = t5_ids->ne[0];
+                int64_t source_len = context->ne[1];
+
+                adapter_q_pe_vec = gen_1d_rope_pe_vec(target_len, 64, 10000.f);
+                adapter_k_pe_vec = gen_1d_rope_pe_vec(source_len, 64, 10000.f);
+
+                int64_t target_pos_len = static_cast<int64_t>(adapter_q_pe_vec.size()) / (2 * 2 * 32);
+                int64_t source_pos_len = static_cast<int64_t>(adapter_k_pe_vec.size()) / (2 * 2 * 32);
+
+                adapter_q_pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, 32, target_pos_len);
+                adapter_k_pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, 32, source_pos_len);
+                set_backend_tensor_data(adapter_q_pe, adapter_q_pe_vec.data());
+                set_backend_tensor_data(adapter_k_pe, adapter_k_pe_vec.data());
+            }
+
+            auto runner_ctx = get_context();
+            auto out        = net.forward(&runner_ctx, x, timesteps, context, image_pe,
+                                          t5_ids, t5_weights, adapter_q_pe, adapter_k_pe);
+            ggml_build_forward_expand(gf, out);
+            return gf;
+        }
+
+        // Raw tensor compute used by streaming infrastructure
+        bool compute(int n_threads,
+                     ggml_tensor* x,
+                     ggml_tensor* timesteps,
+                     ggml_tensor* context,
+                     ggml_tensor* t5_ids          = nullptr,
+                     ggml_tensor* t5_weights      = nullptr,
+                     ggml_tensor** output          = nullptr,
+                     ggml_context* output_ctx      = nullptr,
+                     bool skip_param_offload       = false) {
+            auto get_graph = [&]() -> ggml_cgraph* {
+                return build_graph(x, timesteps, context, t5_ids, t5_weights);
+            };
+            return GGMLRunner::compute(get_graph, n_threads, false, output, output_ctx, skip_param_offload);
+        }
+
+        // Upstream sd::Tensor compute interface
         sd::Tensor<float> compute(int n_threads,
                                   const sd::Tensor<float>& x,
                                   const sd::Tensor<float>& timesteps,
@@ -683,6 +846,366 @@ namespace Anima {
             };
             return restore_trailing_singleton_dims(GGMLRunner::compute<float>(get_graph, n_threads, false), x.dim());
         }
+
+        void enable_layer_streaming(const LayerStreaming::StreamingConfig& config = {}) {
+            std::map<std::string, ggml_tensor*> tensor_map;
+            net.get_param_tensors(tensor_map, "model.diffusion_model.net");
+            init_streaming(config, tensor_map, LayerStreaming::anima_layer_pattern);
+            LOG_INFO("%s layer streaming enabled with %zu layers",
+                     get_desc().c_str(), streaming_engine_->get_registry().get_layer_count());
+        }
+
+        bool compute_streaming(int n_threads,
+                               struct ggml_tensor* x,
+                               struct ggml_tensor* timesteps,
+                               struct ggml_tensor* context,
+                               struct ggml_tensor* t5_ids      = nullptr,
+                               struct ggml_tensor* t5_weights  = nullptr,
+                               struct ggml_tensor** output     = nullptr,
+                               struct ggml_context* output_ctx = nullptr) {
+            if (!is_streaming_enabled()) {
+                LOG_ERROR("%s streaming not enabled", get_desc().c_str());
+                return false;
+            }
+
+            int64_t t0 = ggml_time_ms();
+            auto analysis = analyze_vram_budget();
+
+            if (analysis.fits_in_vram) {
+                LOG_INFO("%s model fits in VRAM, using coarse-stage streaming", get_desc().c_str());
+                load_all_layers_coarse();
+                bool result = compute(n_threads, x, timesteps, context, t5_ids, t5_weights,
+                                      output, output_ctx, true);
+                int64_t t1 = ggml_time_ms();
+                LOG_INFO("%s coarse-stage streaming completed in %.2fs", get_desc().c_str(), (t1 - t0) / 1000.0);
+                free_compute_buffer();
+                return result;
+            }
+
+            LOG_INFO("%s remaining %.2f GB exceeds available %.2f GB, using per-layer streaming",
+                     get_desc().c_str(),
+                     analysis.remaining_to_load / (1024.0 * 1024.0 * 1024.0),
+                     analysis.available_vram / (1024.0 * 1024.0 * 1024.0));
+
+            return compute_streaming_true(n_threads, x, timesteps, context, t5_ids, t5_weights, output, output_ctx);
+        }
+
+        bool compute_streaming_true(int n_threads,
+                                     struct ggml_tensor* x,
+                                     struct ggml_tensor* timesteps,
+                                     struct ggml_tensor* context,
+                                     struct ggml_tensor* t5_ids    = nullptr,
+                                     struct ggml_tensor* t5_weights = nullptr,
+                                     struct ggml_tensor** output    = nullptr,
+                                     struct ggml_context* output_ctx = nullptr) {
+            auto& registry = streaming_engine_->get_registry();
+            int64_t t_start = ggml_time_ms();
+
+            const int64_t num_blocks = net.get_num_layers();
+            const int patch_size = net.get_patch_size();
+            const int64_t W = x->ne[0];
+            const int64_t H = x->ne[1];
+
+            LOG_INFO("TRUE per-layer streaming - %lld blocks", num_blocks);
+
+            // Load global layers
+            LOG_DEBUG("Loading global layers");
+            if (!registry.move_layer_to_gpu("_global")) {
+                LOG_ERROR("Failed to load _global to GPU");
+                return false;
+            }
+
+            // Prepare PE tensors
+            int64_t pad_h = (patch_size - H % patch_size) % patch_size;
+            int64_t pad_w = (patch_size - W % patch_size) % patch_size;
+            int64_t h_pad = H + pad_h;
+            int64_t w_pad = W + pad_w;
+            image_pe_vec = gen_anima_image_pe_vec(1,
+                                                   static_cast<int>(h_pad),
+                                                   static_cast<int>(w_pad),
+                                                   patch_size,
+                                                   net.theta,
+                                                   net.axes_dim,
+                                                   4.0f,  // h_extrapolation_ratio
+                                                   4.0f,  // w_extrapolation_ratio
+                                                   1.0f); // t_extrapolation_ratio
+
+            // Persistent storage. Backed by a single GPU-pinned host buffer
+            // (ensure_pinned_act_buffers) so per-block ggml_backend_tensor_get
+            // / set_backend_tensor_data run at full PCIe bandwidth. context
+            // is optional in some Anima variants.
+            std::vector<float> persistent_x_fallback;
+            std::vector<float> persistent_context_fallback;
+            std::vector<float> persistent_embedded_ts_fallback;
+            std::vector<float> persistent_temb_fallback;
+            float* persistent_x           = nullptr;
+            float* persistent_context     = nullptr;
+            float* persistent_embedded_ts = nullptr;
+            float* persistent_temb        = nullptr;
+            size_t persistent_x_count           = 0;
+            size_t persistent_context_count     = 0;
+            size_t persistent_embedded_ts_count = 0;
+            size_t persistent_temb_count        = 0;
+            int64_t x_ne[4], context_ne[4], embedded_ts_ne[4], temb_ne[4];
+
+            LOG_DEBUG("Executing input stage");
+            {
+                ggml_tensor* x_output = nullptr;
+                ggml_tensor* context_output = nullptr;
+                ggml_tensor* embedded_ts_output = nullptr;
+                ggml_tensor* temb_output = nullptr;
+
+                auto get_input_graph = [&]() -> struct ggml_cgraph* {
+                    struct ggml_cgraph* gf = new_graph_custom(ANIMA_GRAPH_SIZE / 4);
+                    auto runner_ctx = get_context();
+
+                    ggml_tensor* x_backend = to_backend(x);
+                    ggml_tensor* timesteps_backend = to_backend(timesteps);
+                    ggml_tensor* context_backend = context ? to_backend(context) : nullptr;
+                    ggml_tensor* t5_ids_backend = t5_ids ? to_backend(t5_ids) : nullptr;
+                    ggml_tensor* t5_weights_backend = t5_weights ? to_backend(t5_weights) : nullptr;
+
+                    // Adapter PE (if needed)
+                    ggml_tensor* adapter_q_pe_t = nullptr;
+                    ggml_tensor* adapter_k_pe_t = nullptr;
+                    if (t5_ids != nullptr && !adapter_q_pe_vec.empty()) {
+                        adapter_q_pe_t = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, 64, 512);
+                        adapter_k_pe_t = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, 64, 512);
+                        set_backend_tensor_data(adapter_q_pe_t, adapter_q_pe_vec.data());
+                        set_backend_tensor_data(adapter_k_pe_t, adapter_k_pe_vec.data());
+                    }
+
+                    auto result = net.forward_input_stage(&runner_ctx, x_backend, timesteps_backend,
+                                                           context_backend, t5_ids_backend, t5_weights_backend,
+                                                           adapter_q_pe_t, adapter_k_pe_t, H, W);
+
+                    x_output = result.x;
+                    context_output = result.encoder_hidden_states;
+                    embedded_ts_output = result.embedded_timestep;
+                    temb_output = result.temb;
+
+                    ggml_build_forward_expand(gf, x_output);
+                    if (context_output) ggml_build_forward_expand(gf, context_output);
+                    ggml_build_forward_expand(gf, embedded_ts_output);
+                    ggml_build_forward_expand(gf, temb_output);
+
+                    return gf;
+                };
+
+                // Don't free compute buffer immediately - we need to read outputs first
+                if (!GGMLRunner::compute(get_input_graph, n_threads, false, nullptr, nullptr, true)) {
+                    LOG_ERROR("Input stage failed");
+                    return false;
+                }
+
+                // Extract to persistent storage
+                if (x_output && embedded_ts_output && temb_output) {
+                    size_t x_size           = ggml_nelements(x_output);
+                    size_t embedded_ts_size = ggml_nelements(embedded_ts_output);
+                    size_t temb_size        = ggml_nelements(temb_output);
+                    size_t context_size     = context_output ? ggml_nelements(context_output) : 0;
+
+                    persistent_x_count           = x_size;
+                    persistent_embedded_ts_count = embedded_ts_size;
+                    persistent_temb_count        = temb_size;
+                    persistent_context_count     = context_size;
+
+                    std::vector<float*> ptrs;
+                    if (ensure_pinned_act_buffers({x_size           * sizeof(float),
+                                                   embedded_ts_size * sizeof(float),
+                                                   temb_size        * sizeof(float),
+                                                   context_size     * sizeof(float)}, ptrs)) {
+                        persistent_x           = ptrs[0];
+                        persistent_embedded_ts = ptrs[1];
+                        persistent_temb        = ptrs[2];
+                        persistent_context     = context_size ? ptrs[3] : nullptr;
+                    } else {
+                        persistent_x_fallback.resize(x_size);
+                        persistent_embedded_ts_fallback.resize(embedded_ts_size);
+                        persistent_temb_fallback.resize(temb_size);
+                        persistent_x           = persistent_x_fallback.data();
+                        persistent_embedded_ts = persistent_embedded_ts_fallback.data();
+                        persistent_temb        = persistent_temb_fallback.data();
+                        if (context_size) {
+                            persistent_context_fallback.resize(context_size);
+                            persistent_context = persistent_context_fallback.data();
+                        }
+                    }
+
+                    ggml_backend_tensor_get(x_output, persistent_x, 0, x_size * sizeof(float));
+                    ggml_backend_tensor_get(embedded_ts_output, persistent_embedded_ts, 0, embedded_ts_size * sizeof(float));
+                    ggml_backend_tensor_get(temb_output, persistent_temb, 0, temb_size * sizeof(float));
+
+                    for (int i = 0; i < 4; i++) {
+                        x_ne[i] = x_output->ne[i];
+                        embedded_ts_ne[i] = embedded_ts_output->ne[i];
+                        temb_ne[i] = temb_output->ne[i];
+                    }
+
+                    if (context_output) {
+                        ggml_backend_tensor_get(context_output, persistent_context, 0, context_size * sizeof(float));
+                        for (int i = 0; i < 4; i++) {
+                            context_ne[i] = context_output->ne[i];
+                        }
+                    }
+                } else {
+                    LOG_ERROR("Failed to get input stage outputs");
+                    free_compute_buffer();
+                    return false;
+                }
+
+                // Now safe to free compute buffer
+                free_compute_buffer();
+            }
+
+            LOG_DEBUG("Input stage done, x=%ldx%ldx%ld", x_ne[0], x_ne[1], x_ne[2]);
+
+            auto block_name_at = [](int i) { return "blocks." + std::to_string(i); };
+
+            if (resident_blocks_ < 0 && streaming_engine_) {
+                resident_blocks_ = streaming_engine_->compute_resident_block_count(
+                    "blocks.0", static_cast<int>(num_blocks));
+                LOG_INFO("%s blocks cache: %d resident, %d streamed per step",
+                         get_desc().c_str(),
+                         resident_blocks_,
+                         static_cast<int>(num_blocks) - resident_blocks_);
+            }
+
+            int prefetch_start = 0;
+            while (prefetch_start < static_cast<int>(num_blocks) &&
+                   registry.is_layer_on_gpu(block_name_at(prefetch_start))) {
+                prefetch_start++;
+            }
+            if (streaming_engine_) {
+                streaming_engine_->prime_prefetch(block_name_at, prefetch_start, static_cast<int>(num_blocks));
+            }
+
+            for (int64_t block_idx = 0; block_idx < num_blocks; block_idx++) {
+                std::string block_name = block_name_at(static_cast<int>(block_idx));
+                int64_t t_block_start = ggml_time_ms();
+
+                // Wait for this block's prefetch to complete (if async prefetch was started)
+                if (streaming_engine_) {
+                    streaming_engine_->wait_for_prefetch(block_name);
+                }
+
+                // Load this block's weights (sync load if prefetch didn't happen)
+                if (!registry.move_layer_to_gpu(block_name)) {
+                    LOG_ERROR("Failed to load %s", block_name.c_str());
+                    return false;
+                }
+
+                // Keep the prefetch window full
+                if (streaming_engine_) {
+                    streaming_engine_->advance_prefetch(block_name_at, static_cast<int>(block_idx),
+                                                        static_cast<int>(num_blocks));
+                }
+
+                ggml_tensor* x_out = nullptr;
+
+                auto get_block_graph = [&]() -> struct ggml_cgraph* {
+                    struct ggml_cgraph* gf = new_graph_custom(ANIMA_GRAPH_SIZE / 4);
+
+                    // Create input tensors from persistent storage
+                    ggml_tensor* x_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, x_ne[0], x_ne[1], x_ne[2], x_ne[3]);
+                    ggml_tensor* embedded_ts_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, embedded_ts_ne[0], embedded_ts_ne[1], embedded_ts_ne[2], embedded_ts_ne[3]);
+                    ggml_tensor* temb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, temb_ne[0], temb_ne[1], temb_ne[2], temb_ne[3]);
+
+                    x_in = to_backend(x_in);
+                    embedded_ts_in = to_backend(embedded_ts_in);
+                    temb_in = to_backend(temb_in);
+
+                    set_backend_tensor_data(x_in, persistent_x);
+                    set_backend_tensor_data(embedded_ts_in, persistent_embedded_ts);
+                    set_backend_tensor_data(temb_in, persistent_temb);
+
+                    ggml_tensor* context_in = nullptr;
+                    if (persistent_context_count > 0) {
+                        context_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, context_ne[0], context_ne[1], context_ne[2], context_ne[3]);
+                        context_in = to_backend(context_in);
+                        set_backend_tensor_data(context_in, persistent_context);
+                    }
+
+                    // Image PE tensor (shape matches [2, 2, head_dim/2, pos_len])
+                    int64_t image_pos_len = static_cast<int64_t>(image_pe_vec.size()) / (2 * 2 * (net.head_dim / 2));
+                    ggml_tensor* image_pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, net.head_dim / 2, image_pos_len);
+                    set_backend_tensor_data(image_pe, image_pe_vec.data());
+
+                    auto runner_ctx = get_context();
+                    x_out = net.forward_block(&runner_ctx, static_cast<int>(block_idx), x_in, context_in,
+                                              embedded_ts_in, temb_in, image_pe);
+
+                    ggml_build_forward_expand(gf, x_out);
+
+                    return gf;
+                };
+
+                // Don't free compute buffer immediately - we need to read outputs first
+                if (!GGMLRunner::compute(get_block_graph, n_threads, false, nullptr, nullptr, true)) {
+                    LOG_ERROR("Block %lld execution failed", block_idx);
+                    return false;
+                }
+
+                // Extract output to persistent storage
+                if (x_out) {
+                    ggml_backend_tensor_get(x_out, persistent_x, 0, persistent_x_count * sizeof(float));
+                    for (int i = 0; i < 4; i++) {
+                        x_ne[i] = x_out->ne[i];
+                    }
+                }
+
+                // Now safe to free compute buffer
+                free_compute_buffer();
+
+                // Resident blocks stay on GPU across sampling steps.
+                if (static_cast<int>(block_idx) >= resident_blocks_) {
+                    registry.move_layer_to_cpu(block_name);
+                }
+
+                LOG_DEBUG("Block %lld/%lld done (%.2fms)",
+                          block_idx + 1, num_blocks, (ggml_time_ms() - t_block_start) / 1.0);
+            }
+
+            LOG_DEBUG("Executing output stage");
+            {
+                auto get_output_graph = [&]() -> struct ggml_cgraph* {
+                    struct ggml_cgraph* gf = new_graph_custom(ANIMA_GRAPH_SIZE / 4);
+
+                    ggml_tensor* x_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, x_ne[0], x_ne[1], x_ne[2], x_ne[3]);
+                    ggml_tensor* embedded_ts_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, embedded_ts_ne[0], embedded_ts_ne[1], embedded_ts_ne[2], embedded_ts_ne[3]);
+                    ggml_tensor* temb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, temb_ne[0], temb_ne[1], temb_ne[2], temb_ne[3]);
+
+                    x_in = to_backend(x_in);
+                    embedded_ts_in = to_backend(embedded_ts_in);
+                    temb_in = to_backend(temb_in);
+
+                    set_backend_tensor_data(x_in, persistent_x);
+                    set_backend_tensor_data(embedded_ts_in, persistent_embedded_ts);
+                    set_backend_tensor_data(temb_in, persistent_temb);
+
+                    auto runner_ctx = get_context();
+                    auto final_out = net.forward_output_stage(&runner_ctx, x_in, embedded_ts_in, temb_in);
+
+                    // Unpatchify
+                    final_out = DiT::unpatchify_and_crop(compute_ctx, final_out, H, W, patch_size, patch_size, false);
+
+                    ggml_build_forward_expand(gf, final_out);
+
+                    return gf;
+                };
+
+                if (!GGMLRunner::compute(get_output_graph, n_threads, true, output, output_ctx, true)) {
+                    LOG_ERROR("Output stage failed");
+                    return false;
+                }
+            }
+
+            int64_t t_end = ggml_time_ms();
+            LOG_INFO("TRUE per-layer streaming completed in %.2fs (%lld blocks)",
+                     (t_end - t_start) / 1000.0, num_blocks);
+
+            return true;
+        }
     };
 }  // namespace Anima
 
diff --git a/src/chunk_graph.hpp b/src/chunk_graph.hpp
new file mode 100644
index 000000000..0ee676930
--- /dev/null
+++ b/src/chunk_graph.hpp
@@ -0,0 +1,232 @@
+#ifndef __CHUNK_GRAPH_HPP__
+#define __CHUNK_GRAPH_HPP__
+
+#include <array>
+#include <cstdint>
+#include <functional>
+#include <string>
+#include <vector>
+
+#include "ggml-alloc.h"
+#include "ggml-backend.h"
+#include "ggml.h"
+
+#include "util.h"
+
+namespace LayerStreaming {
+
+// Shared helper that compiles K consecutive transformer layers into a single
+// ggml graph and dispatches them as one ggml_backend_graph_compute call,
+// instead of one tiny graph per layer. Reusable across runners (z_image,
+// flux, mmdit, anima, qwen_image, ...).
+//
+// Cached state (ggml_context, gallocr, cgraph) survives across compute() calls
+// on the runner's main compute_ctx. Inputs are shape-bound, so the graph is
+// rebuilt whenever shape / layer count changes (e.g. between two queue jobs
+// with different prompt lengths).
+class ChunkGraph {
+public:
+    using BuildFn = std::function<ggml_tensor*(
+        ggml_context*                       ctx,
+        const std::vector<ggml_tensor*>&    inputs,
+        int                                 K)>;
+
+    ChunkGraph() = default;
+    ~ChunkGraph() { clear(); }
+    ChunkGraph(const ChunkGraph&)            = delete;
+    ChunkGraph& operator=(const ChunkGraph&) = delete;
+
+    // Build (or keep cached) a graph for K layers with the given input shapes.
+    // The cached graph is reused only if K, every input shape, AND the
+    // caller-supplied state_token match the last build; otherwise the old
+    // graph is freed and a fresh one is built.
+    //
+    // state_token: caller-computed fingerprint of any external state that the
+    // graph captures by reference and can become stale (e.g. weight_adapter
+    // pointer when LoRAs change, or runner flag bits like flash_attn). If two
+    // builds would topologically differ, give them different tokens.
+    //
+    // build_fn receives the freshly created input tensors (one per entry of
+    // input_shapes, in the same order) and must wire them through K layers,
+    // returning the output tensor. The output is automatically marked as a
+    // graph output.
+    //
+    // Returns false on allocator / context failure; on success the graph is
+    // ready to dispatch.
+    bool ensure_built(ggml_backend_t                              backend,
+                      int                                          K,
+                      const std::vector<std::array<int64_t, 4>>&  input_shapes,
+                      ggml_type                                    input_type,
+                      uint64_t                                     state_token,
+                      BuildFn                                      build_fn,
+                      size_t                                       graph_node_capacity,
+                      const std::string&                           desc_tag) {
+        if (gf_ != nullptr
+            && layer_count_ == K
+            && state_token_ == state_token
+            && shapes_match(input_shapes)) {
+            return true;
+        }
+        clear();
+
+        // 16 MB headroom for op metadata is plenty for typical K (~30 layers).
+        size_t ctx_size = 16 * 1024 * 1024;
+        ctx_ = ggml_init({ctx_size, nullptr, true});
+        if (ctx_ == nullptr) {
+            LOG_ERROR("%s chunk_ctx alloc failed", desc_tag.c_str());
+            return false;
+        }
+
+        gf_ = ggml_new_graph_custom(ctx_, graph_node_capacity, false);
+
+        inputs_.clear();
+        inputs_.reserve(input_shapes.size());
+        for (const auto& shape : input_shapes) {
+            ggml_tensor* t = ggml_new_tensor_4d(ctx_, input_type,
+                                                 shape[0], shape[1], shape[2], shape[3]);
+            ggml_set_input(t);
+            inputs_.push_back(t);
+        }
+
+        // Mirror GGMLRunner::prepare_build_in_tensor_before(): create the
+        // named build-in scalar tensors on the chunk context so anything in
+        // build_fn that uses ggml_ext_full / ggml_ext_zeros / ggml_ext_ones /
+        // ggml_ext_cast_f32 (all of which look these up by name via
+        // ggml_get_tensor) finds them. Without this they're null in our
+        // standalone context and the next op SEGVs — surfaces in attention's
+        // KV-pad mask creation when token sequences are short.
+        // ggml_set_input is required: without it the gallocr treats these as
+        // regular scratch nodes and may reuse their buffer slot for op
+        // intermediates, overwriting our uploaded scalar values before compute
+        // reads them. (GGMLRunner avoids this by registering them via
+        // set_backend_tensor_data, which keeps the data outside the allocator.)
+        one_tensor_ = ggml_new_tensor_1d(ctx_, GGML_TYPE_F32, 1);
+        ggml_set_name(one_tensor_, "ggml_runner_build_in_tensor:one");
+        ggml_set_input(one_tensor_);
+        zero_int_tensor_ = ggml_new_tensor_1d(ctx_, GGML_TYPE_I32, 1);
+        ggml_set_name(zero_int_tensor_, "ggml_runner_build_in_tensor:zero_int");
+        ggml_set_input(zero_int_tensor_);
+
+        out_ = build_fn(ctx_, inputs_, K);
+        if (out_ == nullptr) {
+            LOG_ERROR("%s chunk build_fn returned null", desc_tag.c_str());
+            clear();
+            return false;
+        }
+        ggml_set_output(out_);
+        ggml_build_forward_expand(gf_, one_tensor_);
+        ggml_build_forward_expand(gf_, zero_int_tensor_);
+        ggml_build_forward_expand(gf_, out_);
+
+        allocr_ = ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend));
+        if (allocr_ == nullptr) {
+            LOG_ERROR("%s chunk gallocr_new failed", desc_tag.c_str());
+            clear();
+            return false;
+        }
+        if (!ggml_gallocr_reserve(allocr_, gf_)) {
+            LOG_ERROR("%s chunk gallocr_reserve failed", desc_tag.c_str());
+            clear();
+            return false;
+        }
+        size_t buf_size = ggml_gallocr_get_buffer_size(allocr_, 0);
+        LOG_INFO("%s chunk graph: %d layers, compute buffer = %.2f MB",
+                 desc_tag.c_str(), K, buf_size / (1024.0 * 1024.0));
+
+        layer_count_   = K;
+        cached_shapes_ = input_shapes;
+        state_token_   = state_token;
+        return true;
+    }
+
+    // Allocate/upload-inputs/compute/read-output for one step. host_data and
+    // host_nbytes must have one entry per input (matching the order passed to
+    // ensure_built). out_buf must be sized for at least ggml_nbytes(out_).
+    bool dispatch(ggml_backend_t                       backend,
+                  const std::vector<const void*>&      host_data,
+                  const std::vector<size_t>&           host_nbytes,
+                  void*                                out_buf,
+                  size_t                               out_nbytes) {
+        if (gf_ == nullptr) {
+            return false;
+        }
+        if (host_data.size() != inputs_.size() || host_nbytes.size() != inputs_.size()) {
+            LOG_ERROR("chunk dispatch: host_data/host_nbytes size mismatch");
+            return false;
+        }
+        if (!ggml_gallocr_alloc_graph(allocr_, gf_)) {
+            LOG_ERROR("chunk alloc_graph failed");
+            return false;
+        }
+        for (size_t i = 0; i < inputs_.size(); i++) {
+            ggml_backend_tensor_set(inputs_[i], host_data[i], 0, host_nbytes[i]);
+        }
+        // Upload the build-in scalars each dispatch (gallocr_alloc_graph may
+        // re-bind tensor data offsets within the compute buffer).
+        static constexpr float   kOneVal     = 1.0f;
+        static constexpr int32_t kZeroIntVal = 0;
+        ggml_backend_tensor_set(one_tensor_, &kOneVal, 0, sizeof(kOneVal));
+        ggml_backend_tensor_set(zero_int_tensor_, &kZeroIntVal, 0, sizeof(kZeroIntVal));
+
+        ggml_status status = ggml_backend_graph_compute(backend, gf_);
+        if (status != GGML_STATUS_SUCCESS) {
+            LOG_ERROR("chunk compute failed: %s", ggml_status_to_string(status));
+            return false;
+        }
+        ggml_backend_tensor_get(out_, out_buf, 0, out_nbytes);
+        return true;
+    }
+
+    ggml_tensor* output() const { return out_; }
+    int          layer_count() const { return layer_count_; }
+    bool         is_built() const { return gf_ != nullptr; }
+
+    void clear() {
+        if (allocr_ != nullptr) {
+            ggml_gallocr_free(allocr_);
+            allocr_ = nullptr;
+        }
+        if (ctx_ != nullptr) {
+            ggml_free(ctx_);
+            ctx_ = nullptr;
+        }
+        gf_              = nullptr;
+        out_             = nullptr;
+        one_tensor_      = nullptr;
+        zero_int_tensor_ = nullptr;
+        inputs_.clear();
+        cached_shapes_.clear();
+        layer_count_  = 0;
+        state_token_  = 0;
+    }
+
+private:
+    bool shapes_match(const std::vector<std::array<int64_t, 4>>& shapes) const {
+        if (shapes.size() != cached_shapes_.size()) {
+            return false;
+        }
+        for (size_t i = 0; i < shapes.size(); i++) {
+            for (int j = 0; j < 4; j++) {
+                if (shapes[i][j] != cached_shapes_[i][j]) {
+                    return false;
+                }
+            }
+        }
+        return true;
+    }
+
+    ggml_context*                          ctx_              = nullptr;
+    ggml_gallocr_t                         allocr_           = nullptr;
+    ggml_cgraph*                           gf_               = nullptr;
+    std::vector<ggml_tensor*>              inputs_;
+    ggml_tensor*                           out_              = nullptr;
+    ggml_tensor*                           one_tensor_       = nullptr;
+    ggml_tensor*                           zero_int_tensor_  = nullptr;
+    int                                    layer_count_      = 0;
+    uint64_t                               state_token_      = 0;
+    std::vector<std::array<int64_t, 4>>    cached_shapes_;
+};
+
+}  // namespace LayerStreaming
+
+#endif
diff --git a/src/conditioner.hpp b/src/conditioner.hpp
index 4907938b0..ff73ab3b8 100644
--- a/src/conditioner.hpp
+++ b/src/conditioner.hpp
@@ -95,6 +95,13 @@ struct Conditioner {
     virtual std::string remove_trigger_from_prompt(const std::string& prompt) {
         GGML_ABORT("Not implemented yet!");
     }
+
+    // Dynamic tensor offloading interface
+    virtual bool is_params_on_gpu() const { return false; }
+    virtual bool move_params_to_cpu() { return false; }
+    virtual bool move_params_to_gpu() { return false; }
+    virtual size_t get_params_vram_size() const { return 0; }
+    virtual void set_auto_offload(bool enabled) {}
 };
 
 // ldm.modules.encoders.modules.FrozenCLIPEmbedder
@@ -187,6 +194,46 @@ struct FrozenCLIPEmbedderWithCustomWords : public Conditioner {
         }
     }
 
+    // Dynamic tensor offloading
+    bool is_params_on_gpu() const override {
+        bool on_gpu = text_model->is_params_on_gpu();
+        if (sd_version_is_sdxl(version) && text_model2) {
+            on_gpu = on_gpu && text_model2->is_params_on_gpu();
+        }
+        return on_gpu;
+    }
+
+    bool move_params_to_cpu() override {
+        bool success = text_model->move_params_to_cpu();
+        if (sd_version_is_sdxl(version) && text_model2) {
+            success = text_model2->move_params_to_cpu() && success;
+        }
+        return success;
+    }
+
+    bool move_params_to_gpu() override {
+        bool success = text_model->move_params_to_gpu();
+        if (sd_version_is_sdxl(version) && text_model2) {
+            success = text_model2->move_params_to_gpu() && success;
+        }
+        return success;
+    }
+
+    size_t get_params_vram_size() const override {
+        size_t size = text_model->get_params_vram_size();
+        if (sd_version_is_sdxl(version) && text_model2) {
+            size += text_model2->get_params_vram_size();
+        }
+        return size;
+    }
+
+    void set_auto_offload(bool enabled) override {
+        text_model->set_auto_offload(enabled);
+        if (sd_version_is_sdxl(version) && text_model2) {
+            text_model2->set_auto_offload(enabled);
+        }
+    }
+
     bool load_embedding(std::string embd_name, std::string embd_path, std::vector<int32_t>& bpe_tokens) {
         ModelLoader model_loader;
         if (!model_loader.init_from_file_and_convert_name(embd_path)) {
@@ -825,6 +872,75 @@ struct SD3CLIPEmbedder : public Conditioner {
         }
     }
 
+    // Dynamic tensor offloading
+    bool is_params_on_gpu() const override {
+        bool on_gpu = true;
+        if (clip_l) {
+            on_gpu = on_gpu && clip_l->is_params_on_gpu();
+        }
+        if (clip_g) {
+            on_gpu = on_gpu && clip_g->is_params_on_gpu();
+        }
+        if (t5) {
+            on_gpu = on_gpu && t5->is_params_on_gpu();
+        }
+        return on_gpu;
+    }
+
+    bool move_params_to_cpu() override {
+        bool success = true;
+        if (clip_l) {
+            success = clip_l->move_params_to_cpu() && success;
+        }
+        if (clip_g) {
+            success = clip_g->move_params_to_cpu() && success;
+        }
+        if (t5) {
+            success = t5->move_params_to_cpu() && success;
+        }
+        return success;
+    }
+
+    bool move_params_to_gpu() override {
+        bool success = true;
+        if (clip_l) {
+            success = clip_l->move_params_to_gpu() && success;
+        }
+        if (clip_g) {
+            success = clip_g->move_params_to_gpu() && success;
+        }
+        if (t5) {
+            success = t5->move_params_to_gpu() && success;
+        }
+        return success;
+    }
+
+    size_t get_params_vram_size() const override {
+        size_t size = 0;
+        if (clip_l) {
+            size += clip_l->get_params_vram_size();
+        }
+        if (clip_g) {
+            size += clip_g->get_params_vram_size();
+        }
+        if (t5) {
+            size += t5->get_params_vram_size();
+        }
+        return size;
+    }
+
+    void set_auto_offload(bool enabled) override {
+        if (clip_l) {
+            clip_l->set_auto_offload(enabled);
+        }
+        if (clip_g) {
+            clip_g->set_auto_offload(enabled);
+        }
+        if (t5) {
+            t5->set_auto_offload(enabled);
+        }
+    }
+
     std::vector<std::pair<std::vector<int>, std::vector<float>>> tokenize(std::string text,
                                                                           size_t min_length          = 0,
                                                                           size_t max_length          = 0,
@@ -1171,6 +1287,60 @@ struct FluxCLIPEmbedder : public Conditioner {
         }
     }
 
+    // Dynamic tensor offloading
+    bool is_params_on_gpu() const override {
+        bool on_gpu = true;
+        if (clip_l) {
+            on_gpu = on_gpu && clip_l->is_params_on_gpu();
+        }
+        if (t5) {
+            on_gpu = on_gpu && t5->is_params_on_gpu();
+        }
+        return on_gpu;
+    }
+
+    bool move_params_to_cpu() override {
+        bool success = true;
+        if (clip_l) {
+            success = clip_l->move_params_to_cpu() && success;
+        }
+        if (t5) {
+            success = t5->move_params_to_cpu() && success;
+        }
+        return success;
+    }
+
+    bool move_params_to_gpu() override {
+        bool success = true;
+        if (clip_l) {
+            success = clip_l->move_params_to_gpu() && success;
+        }
+        if (t5) {
+            success = t5->move_params_to_gpu() && success;
+        }
+        return success;
+    }
+
+    size_t get_params_vram_size() const override {
+        size_t size = 0;
+        if (clip_l) {
+            size += clip_l->get_params_vram_size();
+        }
+        if (t5) {
+            size += t5->get_params_vram_size();
+        }
+        return size;
+    }
+
+    void set_auto_offload(bool enabled) override {
+        if (clip_l) {
+            clip_l->set_auto_offload(enabled);
+        }
+        if (t5) {
+            t5->set_auto_offload(enabled);
+        }
+    }
+
     std::vector<std::pair<std::vector<int>, std::vector<float>>> tokenize(std::string text,
                                                                           size_t min_length = 0,
                                                                           size_t max_length = 0) {
@@ -1525,6 +1695,29 @@ struct T5CLIPEmbedder : public Conditioner {
                                             conditioner_params.clip_skip,
                                             conditioner_params.zero_out_masked);
     }
+
+    // Dynamic tensor offloading
+    bool is_params_on_gpu() const override {
+        return t5 ? t5->is_params_on_gpu() : false;
+    }
+
+    bool move_params_to_cpu() override {
+        return t5 ? t5->move_params_to_cpu() : false;
+    }
+
+    bool move_params_to_gpu() override {
+        return t5 ? t5->move_params_to_gpu() : false;
+    }
+
+    size_t get_params_vram_size() const override {
+        return t5 ? t5->get_params_vram_size() : 0;
+    }
+
+    void set_auto_offload(bool enabled) override {
+        if (t5) {
+            t5->set_auto_offload(enabled);
+        }
+    }
 };
 
 struct AnimaConditioner : public Conditioner {
@@ -1572,6 +1765,27 @@ struct AnimaConditioner : public Conditioner {
         llm->set_weight_adapter(adapter);
     }
 
+    // Dynamic tensor offloading - delegate to LLM
+    bool is_params_on_gpu() const override {
+        return llm->is_params_on_gpu();
+    }
+
+    bool move_params_to_cpu() override {
+        return llm->move_params_to_cpu();
+    }
+
+    bool move_params_to_gpu() override {
+        return llm->move_params_to_gpu();
+    }
+
+    size_t get_params_vram_size() const override {
+        return llm->get_params_vram_size();
+    }
+
+    void set_auto_offload(bool enabled) override {
+        llm->set_auto_offload(enabled);
+    }
+
     std::tuple<std::vector<int>, std::vector<float>, std::vector<int>, std::vector<float>> tokenize(std::string text) {
         auto parsed_attention = parse_prompt_attention(text);
 
@@ -1999,6 +2213,29 @@ struct LLMEmbedder : public Conditioner {
         result.extra_c_crossattns = std::move(extra_hidden_states_vec);
         return result;
     }
+
+    // Dynamic tensor offloading
+    bool is_params_on_gpu() const override {
+        return llm ? llm->is_params_on_gpu() : false;
+    }
+
+    bool move_params_to_cpu() override {
+        return llm ? llm->move_params_to_cpu() : false;
+    }
+
+    bool move_params_to_gpu() override {
+        return llm ? llm->move_params_to_gpu() : false;
+    }
+
+    size_t get_params_vram_size() const override {
+        return llm ? llm->get_params_vram_size() : 0;
+    }
+
+    void set_auto_offload(bool enabled) override {
+        if (llm) {
+            llm->set_auto_offload(enabled);
+        }
+    }
 };
 
 #endif
diff --git a/src/diffusion_model.hpp b/src/diffusion_model.hpp
index 1a202a1a7..66ec562a8 100644
--- a/src/diffusion_model.hpp
+++ b/src/diffusion_model.hpp
@@ -37,6 +37,43 @@ static inline const sd::Tensor<T>& tensor_or_empty(const sd::Tensor<T>* tensor)
     return tensor != nullptr ? *tensor : kEmpty;
 }
 
+// Helper to convert sd::Tensor<T> pointers to temporary ggml_tensor* for streaming code paths.
+// The returned ggml_tensors live in the provided ggml_context and point to the sd::Tensor's data.
+struct StreamingParamConverter {
+    ggml_context* ctx = nullptr;
+
+    StreamingParamConverter() {
+        struct ggml_init_params params = {
+            /*.mem_size   =*/ 16 * ggml_tensor_overhead(),
+            /*.mem_buffer =*/ nullptr,
+            /*.no_alloc   =*/ true,
+        };
+        ctx = ggml_init(params);
+    }
+
+    ~StreamingParamConverter() {
+        if (ctx) ggml_free(ctx);
+    }
+
+    template <typename T>
+    ggml_tensor* convert(const sd::Tensor<T>* tensor) {
+        if (tensor == nullptr || tensor->numel() == 0) return nullptr;
+        ggml_tensor* t = sd::make_ggml_tensor(ctx, *tensor, false);
+        t->data = const_cast<void*>(static_cast<const void*>(tensor->data()));
+        return t;
+    }
+
+    std::vector<ggml_tensor*> convert_vec(const std::vector<sd::Tensor<float>>* tensors) {
+        std::vector<ggml_tensor*> result;
+        if (tensors == nullptr) return result;
+        for (const auto& t : *tensors) {
+            sd::Tensor<float> tmp_ref = t;  // non-const copy for convert
+            result.push_back(convert(&tmp_ref));
+        }
+        return result;
+    }
+};
+
 struct DiffusionModel {
     virtual std::string get_desc()                                               = 0;
     virtual sd::Tensor<float> compute(int n_threads,
@@ -51,6 +88,81 @@ struct DiffusionModel {
     virtual void set_flash_attention_enabled(bool enabled)           = 0;
     virtual void set_max_graph_vram_bytes(size_t max_vram_bytes)     = 0;
     virtual void set_circular_axes(bool circular_x, bool circular_y) = 0;
+
+    // Dynamic tensor offloading interface
+    virtual bool is_params_on_gpu() const { return false; }
+    virtual bool move_params_to_cpu() { return false; }
+    virtual bool move_params_to_gpu() { return false; }
+    virtual size_t get_params_vram_size() const { return 0; }
+
+    // Layer streaming interface (for granular tensor offloading)
+    virtual bool supports_layer_streaming() const { return false; }
+    virtual void enable_layer_streaming(int prefetch_layers = 1, size_t min_free_vram = 512 * 1024 * 1024) {
+        (void)prefetch_layers;
+        (void)min_free_vram;
+    }
+    virtual void disable_layer_streaming() {}
+    virtual bool is_layer_streaming_enabled() const { return false; }
+    virtual bool compute_streaming(int n_threads,
+                                   DiffusionParams diffusion_params,
+                                   struct ggml_tensor** output     = nullptr,
+                                   struct ggml_context* output_ctx = nullptr) {
+        // Default: fall back to regular compute, copy result to output
+        auto result = compute(n_threads, diffusion_params);
+        if (output != nullptr && result.numel() > 0) {
+            if (*output == nullptr && output_ctx != nullptr) {
+                auto shape = result.shape();
+                int n_dims = std::min(static_cast<int>(shape.size()), GGML_MAX_DIMS);
+                std::array<int64_t, GGML_MAX_DIMS> ne = {1, 1, 1, 1};
+                for (int i = 0; i < n_dims; i++) ne[i] = shape[i];
+                *output = ggml_new_tensor(output_ctx, GGML_TYPE_F32, n_dims, ne.data());
+            }
+            if (*output != nullptr) {
+                memcpy((*output)->data, result.data(), result.numel() * sizeof(float));
+            }
+        }
+        return result.numel() > 0;
+    }
+    // Offload all streaming layers to CPU (free GPU memory after diffusion)
+    virtual void offload_streaming_layers() {}
+
+    // Bridge: dispatch to streaming or regular compute based on layer streaming state,
+    // returning sd::Tensor<float> for compatibility with the upstream sample loop.
+    //
+    // Streaming output shape matches the input x shape (diffusion preserves shape).
+    // We pre-allocate the destination sd::Tensor and have the streaming runner write
+    // directly into its memory via a tiny no_alloc ggml_context — no per-step malloc.
+    sd::Tensor<float> compute_dispatch(int n_threads, const DiffusionParams& diffusion_params) {
+        if (!is_layer_streaming_enabled()) {
+            return compute(n_threads, diffusion_params);
+        }
+        if (diffusion_params.x == nullptr) {
+            LOG_ERROR("compute_dispatch: diffusion_params.x is null");
+            return {};
+        }
+
+        // Pre-allocate result with x's shape; stream writes will land here directly.
+        sd::Tensor<float> result(diffusion_params.x->shape());
+
+        // Tiny no_alloc context — only holds tensor metadata, no data backing.
+        ggml_init_params params = {2 * ggml_tensor_overhead(), nullptr, true};
+        ggml_context* out_ctx   = ggml_init(params);
+        if (out_ctx == nullptr) {
+            LOG_ERROR("compute_dispatch: ggml_init failed");
+            return {};
+        }
+
+        // Make a metadata tensor with the same shape as result and point its data
+        // pointer at result's memory. The runner's ggml_ext_backend_tensor_get_and_sync
+        // will copy GPU→here directly. Skip ggml_dup_tensor by passing non-null *output.
+        ggml_tensor* out_tensor = sd::make_ggml_tensor(out_ctx, result, false);
+        out_tensor->data        = result.data();
+
+        bool ok = compute_streaming(n_threads, diffusion_params, &out_tensor, out_ctx);
+        ggml_free(out_ctx);
+        if (!ok) return {};
+        return result;
+    }
 };
 
 struct UNetModel : public DiffusionModel {
@@ -122,6 +234,53 @@ struct UNetModel : public DiffusionModel {
                             diffusion_params.controls ? *diffusion_params.controls : empty_controls,
                             diffusion_params.control_strength);
     }
+
+    // Dynamic tensor offloading
+    bool is_params_on_gpu() const override { return unet.is_params_on_gpu(); }
+    bool move_params_to_cpu() override { return unet.move_params_to_cpu(); }
+    bool move_params_to_gpu() override { return unet.move_params_to_gpu(); }
+    size_t get_params_vram_size() const override { return unet.get_params_vram_size(); }
+
+    // Layer streaming (coarse-stage for UNet due to skip connections)
+    bool supports_layer_streaming() const override { return true; }
+
+    void enable_layer_streaming(int prefetch_layers, size_t min_free_vram) override {
+        LayerStreaming::StreamingConfig config;
+        config.prefetch_layers = prefetch_layers;
+        config.min_free_vram = min_free_vram;
+        unet.enable_layer_streaming(config);
+    }
+
+    void disable_layer_streaming() override {
+        unet.disable_layer_streaming();
+    }
+
+    bool is_layer_streaming_enabled() const override {
+        return unet.is_streaming_enabled();
+    }
+
+    void offload_streaming_layers() override {
+        unet.offload_streaming_layers();
+    }
+
+    bool compute_streaming(int n_threads,
+                           DiffusionParams diffusion_params,
+                           struct ggml_tensor** output     = nullptr,
+                           struct ggml_context* output_ctx = nullptr) override {
+        StreamingParamConverter cvt;
+        auto controls_vec = cvt.convert_vec(diffusion_params.controls);
+        return unet.compute_streaming(n_threads,
+                                      cvt.convert(diffusion_params.x),
+                                      cvt.convert(diffusion_params.timesteps),
+                                      cvt.convert(diffusion_params.context),
+                                      cvt.convert(diffusion_params.c_concat),
+                                      cvt.convert(diffusion_params.y),
+                                      diffusion_params.num_video_frames,
+                                      controls_vec,
+                                      diffusion_params.control_strength,
+                                      output,
+                                      output_ctx);
+    }
 };
 
 struct MMDiTModel : public DiffusionModel {
@@ -189,6 +348,50 @@ struct MMDiTModel : public DiffusionModel {
                              tensor_or_empty(diffusion_params.y),
                              diffusion_params.skip_layers ? *diffusion_params.skip_layers : empty_skip_layers);
     }
+
+    // Dynamic tensor offloading
+    bool is_params_on_gpu() const override { return mmdit.is_params_on_gpu(); }
+    bool move_params_to_cpu() override { return mmdit.move_params_to_cpu(); }
+    bool move_params_to_gpu() override { return mmdit.move_params_to_gpu(); }
+    size_t get_params_vram_size() const override { return mmdit.get_params_vram_size(); }
+
+    // Layer streaming (granular tensor offloading)
+    bool supports_layer_streaming() const override { return true; }
+
+    void enable_layer_streaming(int prefetch_layers, size_t min_free_vram) override {
+        LayerStreaming::StreamingConfig config;
+        config.prefetch_layers = prefetch_layers;
+        config.min_free_vram = min_free_vram;
+        mmdit.enable_layer_streaming(config);
+    }
+
+    void disable_layer_streaming() override {
+        mmdit.disable_layer_streaming();
+    }
+
+    bool is_layer_streaming_enabled() const override {
+        return mmdit.is_streaming_enabled();
+    }
+
+    void offload_streaming_layers() override {
+        mmdit.offload_streaming_layers();
+    }
+
+    bool compute_streaming(int n_threads,
+                           DiffusionParams diffusion_params,
+                           struct ggml_tensor** output     = nullptr,
+                           struct ggml_context* output_ctx = nullptr) override {
+        StreamingParamConverter cvt;
+        auto skip = diffusion_params.skip_layers ? *diffusion_params.skip_layers : std::vector<int>();
+        return mmdit.compute_streaming(n_threads,
+                                       cvt.convert(diffusion_params.x),
+                                       cvt.convert(diffusion_params.timesteps),
+                                       cvt.convert(diffusion_params.context),
+                                       cvt.convert(diffusion_params.y),
+                                       output,
+                                       output_ctx,
+                                       skip);
+    }
 };
 
 struct FluxModel : public DiffusionModel {
@@ -263,6 +466,55 @@ struct FluxModel : public DiffusionModel {
                             diffusion_params.increase_ref_index,
                             diffusion_params.skip_layers ? *diffusion_params.skip_layers : empty_skip_layers);
     }
+
+    // Dynamic tensor offloading
+    bool is_params_on_gpu() const override { return flux.is_params_on_gpu(); }
+    bool move_params_to_cpu() override { return flux.move_params_to_cpu(); }
+    bool move_params_to_gpu() override { return flux.move_params_to_gpu(); }
+    size_t get_params_vram_size() const override { return flux.get_params_vram_size(); }
+
+    // Layer streaming (granular tensor offloading)
+    bool supports_layer_streaming() const override { return true; }
+
+    void enable_layer_streaming(int prefetch_layers, size_t min_free_vram) override {
+        LayerStreaming::StreamingConfig config;
+        config.prefetch_layers = prefetch_layers;
+        config.min_free_vram = min_free_vram;
+        flux.enable_layer_streaming(config);
+    }
+
+    void disable_layer_streaming() override {
+        flux.disable_layer_streaming();
+    }
+
+    bool is_layer_streaming_enabled() const override {
+        return flux.is_streaming_enabled();
+    }
+
+    void offload_streaming_layers() override {
+        flux.offload_streaming_layers();
+    }
+
+    bool compute_streaming(int n_threads,
+                           DiffusionParams diffusion_params,
+                           struct ggml_tensor** output     = nullptr,
+                           struct ggml_context* output_ctx = nullptr) override {
+        StreamingParamConverter cvt;
+        auto ref_vec  = cvt.convert_vec(diffusion_params.ref_latents);
+        auto skip     = diffusion_params.skip_layers ? *diffusion_params.skip_layers : std::vector<int>();
+        return flux.compute_streaming(n_threads,
+                                      cvt.convert(diffusion_params.x),
+                                      cvt.convert(diffusion_params.timesteps),
+                                      cvt.convert(diffusion_params.context),
+                                      cvt.convert(diffusion_params.c_concat),
+                                      cvt.convert(diffusion_params.y),
+                                      cvt.convert(diffusion_params.guidance),
+                                      ref_vec,
+                                      diffusion_params.increase_ref_index,
+                                      output,
+                                      output_ctx,
+                                      skip);
+    }
 };
 
 struct AnimaModel : public DiffusionModel {
@@ -331,6 +583,42 @@ struct AnimaModel : public DiffusionModel {
                              tensor_or_empty(diffusion_params.t5_ids),
                              tensor_or_empty(diffusion_params.t5_weights));
     }
+
+    bool supports_layer_streaming() const override { return true; }
+
+    void enable_layer_streaming(int prefetch_layers, size_t min_free_vram) override {
+        LayerStreaming::StreamingConfig config;
+        config.prefetch_layers = prefetch_layers;
+        config.min_free_vram = min_free_vram;
+        anima.enable_layer_streaming(config);
+    }
+
+    void disable_layer_streaming() override {
+        anima.disable_layer_streaming();
+    }
+
+    bool is_layer_streaming_enabled() const override {
+        return anima.is_streaming_enabled();
+    }
+
+    void offload_streaming_layers() override {
+        anima.offload_streaming_layers();
+    }
+
+    bool compute_streaming(int n_threads,
+                           DiffusionParams diffusion_params,
+                           struct ggml_tensor** output     = nullptr,
+                           struct ggml_context* output_ctx = nullptr) override {
+        StreamingParamConverter cvt;
+        return anima.compute_streaming(n_threads,
+                                       cvt.convert(diffusion_params.x),
+                                       cvt.convert(diffusion_params.timesteps),
+                                       cvt.convert(diffusion_params.context),
+                                       cvt.convert(diffusion_params.t5_ids),
+                                       cvt.convert(diffusion_params.t5_weights),
+                                       output,
+                                       output_ctx);
+    }
 };
 
 struct WanModel : public DiffusionModel {
@@ -403,6 +691,52 @@ struct WanModel : public DiffusionModel {
                            tensor_or_empty(diffusion_params.vace_context),
                            diffusion_params.vace_strength);
     }
+
+    // Dynamic tensor offloading
+    bool is_params_on_gpu() const override { return wan.is_params_on_gpu(); }
+    bool move_params_to_cpu() override { return wan.move_params_to_cpu(); }
+    bool move_params_to_gpu() override { return wan.move_params_to_gpu(); }
+    size_t get_params_vram_size() const override { return wan.get_params_vram_size(); }
+
+    // Layer streaming (granular tensor offloading)
+    bool supports_layer_streaming() const override { return true; }
+
+    void enable_layer_streaming(int prefetch_layers, size_t min_free_vram) override {
+        LayerStreaming::StreamingConfig config;
+        config.prefetch_layers = prefetch_layers;
+        config.min_free_vram = min_free_vram;
+        wan.enable_layer_streaming(config);
+    }
+
+    void disable_layer_streaming() override {
+        wan.disable_layer_streaming();
+    }
+
+    bool is_layer_streaming_enabled() const override {
+        return wan.is_streaming_enabled();
+    }
+
+    void offload_streaming_layers() override {
+        wan.offload_streaming_layers();
+    }
+
+    bool compute_streaming(int n_threads,
+                           DiffusionParams diffusion_params,
+                           struct ggml_tensor** output     = nullptr,
+                           struct ggml_context* output_ctx = nullptr) override {
+        StreamingParamConverter cvt;
+        return wan.compute_streaming(n_threads,
+                                     cvt.convert(diffusion_params.x),
+                                     cvt.convert(diffusion_params.timesteps),
+                                     cvt.convert(diffusion_params.context),
+                                     cvt.convert(diffusion_params.y),
+                                     cvt.convert(diffusion_params.c_concat),
+                                     nullptr,
+                                     cvt.convert(diffusion_params.vace_context),
+                                     diffusion_params.vace_strength,
+                                     output,
+                                     output_ctx);
+    }
 };
 
 struct QwenImageModel : public DiffusionModel {
@@ -474,6 +808,50 @@ struct QwenImageModel : public DiffusionModel {
                                   diffusion_params.ref_latents ? *diffusion_params.ref_latents : empty_ref_latents,
                                   true);
     }
+
+    // Dynamic tensor offloading
+    bool is_params_on_gpu() const override { return qwen_image.is_params_on_gpu(); }
+    bool move_params_to_cpu() override { return qwen_image.move_params_to_cpu(); }
+    bool move_params_to_gpu() override { return qwen_image.move_params_to_gpu(); }
+    size_t get_params_vram_size() const override { return qwen_image.get_params_vram_size(); }
+
+    // Layer streaming (granular tensor offloading)
+    bool supports_layer_streaming() const override { return true; }
+
+    void enable_layer_streaming(int prefetch_layers, size_t min_free_vram) override {
+        LayerStreaming::StreamingConfig config;
+        config.prefetch_layers = prefetch_layers;
+        config.min_free_vram = min_free_vram;
+        qwen_image.enable_layer_streaming(config);
+    }
+
+    void disable_layer_streaming() override {
+        qwen_image.disable_layer_streaming();
+    }
+
+    bool is_layer_streaming_enabled() const override {
+        return qwen_image.is_streaming_enabled();
+    }
+
+    bool compute_streaming(int n_threads,
+                           DiffusionParams diffusion_params,
+                           struct ggml_tensor** output     = nullptr,
+                           struct ggml_context* output_ctx = nullptr) override {
+        StreamingParamConverter cvt;
+        auto ref_vec = cvt.convert_vec(diffusion_params.ref_latents);
+        return qwen_image.compute_streaming(n_threads,
+                                            cvt.convert(diffusion_params.x),
+                                            cvt.convert(diffusion_params.timesteps),
+                                            cvt.convert(diffusion_params.context),
+                                            ref_vec,
+                                            true,  // increase_ref_index
+                                            output,
+                                            output_ctx);
+    }
+
+    void offload_streaming_layers() override {
+        qwen_image.offload_streaming_layers();
+    }
 };
 
 struct ZImageModel : public DiffusionModel {
@@ -544,6 +922,50 @@ struct ZImageModel : public DiffusionModel {
                                diffusion_params.ref_latents ? *diffusion_params.ref_latents : empty_ref_latents,
                                true);
     }
+
+    // Dynamic tensor offloading
+    bool is_params_on_gpu() const override { return z_image.is_params_on_gpu(); }
+    bool move_params_to_cpu() override { return z_image.move_params_to_cpu(); }
+    bool move_params_to_gpu() override { return z_image.move_params_to_gpu(); }
+    size_t get_params_vram_size() const override { return z_image.get_params_vram_size(); }
+
+    // Layer streaming (granular tensor offloading)
+    bool supports_layer_streaming() const override { return true; }
+
+    void enable_layer_streaming(int prefetch_layers, size_t min_free_vram) override {
+        LayerStreaming::StreamingConfig config;
+        config.prefetch_layers = prefetch_layers;
+        config.min_free_vram = min_free_vram;
+        z_image.enable_layer_streaming(config);
+    }
+
+    void disable_layer_streaming() override {
+        z_image.disable_layer_streaming();
+    }
+
+    bool is_layer_streaming_enabled() const override {
+        return z_image.is_streaming_enabled();
+    }
+
+    void offload_streaming_layers() override {
+        z_image.offload_streaming_layers();
+    }
+
+    bool compute_streaming(int n_threads,
+                           DiffusionParams diffusion_params,
+                           struct ggml_tensor** output     = nullptr,
+                           struct ggml_context* output_ctx = nullptr) override {
+        StreamingParamConverter cvt;
+        auto ref_vec = cvt.convert_vec(diffusion_params.ref_latents);
+        return z_image.compute_streaming(n_threads,
+                                         cvt.convert(diffusion_params.x),
+                                         cvt.convert(diffusion_params.timesteps),
+                                         cvt.convert(diffusion_params.context),
+                                         ref_vec,
+                                         true,  // increase_ref_index
+                                         output,
+                                         output_ctx);
+    }
 };
 
 struct ErnieImageModel : public DiffusionModel {
diff --git a/src/flux.hpp b/src/flux.hpp
index 732a37197..22653d531 100644
--- a/src/flux.hpp
+++ b/src/flux.hpp
@@ -5,6 +5,7 @@
 #include <vector>
 
 #include "common_dit.hpp"
+#include "layer_streaming.hpp"
 #include "model.h"
 #include "rope.hpp"
 
@@ -847,6 +848,142 @@ namespace Flux {
             }
         }
 
+        struct StreamingInputResult {
+            ggml_tensor* img;
+            ggml_tensor* txt;
+            ggml_tensor* vec;
+            ggml_tensor* txt_img_mask;
+            std::vector<ModulationOut> ds_img_mods;
+            std::vector<ModulationOut> ds_txt_mods;
+            std::vector<ModulationOut> ss_mods;
+            int64_t n_txt_tokens;
+        };
+
+        StreamingInputResult forward_input_stage(GGMLRunnerContext* ctx,
+                                                 ggml_tensor* img,
+                                                 ggml_tensor* txt,
+                                                 ggml_tensor* timesteps,
+                                                 ggml_tensor* y,
+                                                 ggml_tensor* guidance,
+                                                 ggml_tensor* mod_index_arange = nullptr) {
+            auto img_in = std::dynamic_pointer_cast<Linear>(blocks["img_in"]);
+            auto txt_in = std::dynamic_pointer_cast<Linear>(blocks["txt_in"]);
+
+            int64_t n_txt_tokens = txt->ne[1];
+
+            if (img_in) {
+                img = img_in->forward(ctx, img);
+            }
+
+            ggml_tensor* vec;
+            ggml_tensor* txt_img_mask = nullptr;
+            if (params.is_chroma) {
+                int64_t mod_index_length = 344;
+                auto approx = std::dynamic_pointer_cast<ChromaApproximator>(blocks["distilled_guidance_layer"]);
+                auto distill_timestep = ggml_ext_timestep_embedding(ctx->ggml_ctx, timesteps, 16, 10000, 1000.f);
+                auto distill_guidance = ggml_ext_timestep_embedding(ctx->ggml_ctx, guidance, 16, 10000, 1000.f);
+
+                GGML_ASSERT(mod_index_arange != nullptr);
+                auto modulation_index = ggml_ext_timestep_embedding(ctx->ggml_ctx, mod_index_arange, 32, 10000, 1000.f);
+                modulation_index = ggml_repeat(ctx->ggml_ctx, modulation_index, ggml_new_tensor_3d(ctx->ggml_ctx, GGML_TYPE_F32, modulation_index->ne[0], modulation_index->ne[1], img->ne[2]));
+
+                auto timestep_guidance = ggml_concat(ctx->ggml_ctx, distill_timestep, distill_guidance, 0);
+                timestep_guidance = ggml_repeat(ctx->ggml_ctx, timestep_guidance, modulation_index);
+
+                vec = ggml_concat(ctx->ggml_ctx, timestep_guidance, modulation_index, 0);
+                vec = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, vec, 0, 2, 1, 3));
+                vec = approx->forward(ctx, vec);
+
+                if (y != nullptr) {
+                    txt_img_mask = ggml_pad(ctx->ggml_ctx, y, static_cast<int>(img->ne[1]), 0, 0, 0);
+                }
+            } else {
+                auto time_in = std::dynamic_pointer_cast<MLPEmbedder>(blocks["time_in"]);
+                vec = time_in->forward(ctx, ggml_ext_timestep_embedding(ctx->ggml_ctx, timesteps, 256, 10000, 1000.f));
+                if (params.guidance_embed) {
+                    GGML_ASSERT(guidance != nullptr);
+                    auto guidance_in = std::dynamic_pointer_cast<MLPEmbedder>(blocks["guidance_in"]);
+                    auto g_in = ggml_ext_timestep_embedding(ctx->ggml_ctx, guidance, 256, 10000, 1000.f);
+                    vec = ggml_add(ctx->ggml_ctx, vec, guidance_in->forward(ctx, g_in));
+                }
+                if (params.vec_in_dim > 0) {
+                    auto vector_in = std::dynamic_pointer_cast<MLPEmbedder>(blocks["vector_in"]);
+                    vec = ggml_add(ctx->ggml_ctx, vec, vector_in->forward(ctx, y));
+                }
+            }
+
+            std::vector<ModulationOut> ds_img_mods;
+            std::vector<ModulationOut> ds_txt_mods;
+            std::vector<ModulationOut> ss_mods;
+            if (params.share_modulation) {
+                auto double_stream_modulation_img = std::dynamic_pointer_cast<Modulation>(blocks["double_stream_modulation_img"]);
+                auto double_stream_modulation_txt = std::dynamic_pointer_cast<Modulation>(blocks["double_stream_modulation_txt"]);
+                auto single_stream_modulation = std::dynamic_pointer_cast<Modulation>(blocks["single_stream_modulation"]);
+
+                ds_img_mods = double_stream_modulation_img->forward(ctx, vec);
+                ds_txt_mods = double_stream_modulation_txt->forward(ctx, vec);
+                ss_mods = single_stream_modulation->forward(ctx, vec);
+            }
+
+            if (params.semantic_txt_norm) {
+                auto semantic_txt_norm = std::dynamic_pointer_cast<RMSNorm>(blocks["txt_norm"]);
+                txt = semantic_txt_norm->forward(ctx, txt);
+            }
+
+            txt = txt_in->forward(ctx, txt);
+
+            return {img, txt, vec, txt_img_mask, ds_img_mods, ds_txt_mods, ss_mods, n_txt_tokens};
+        }
+
+        std::pair<ggml_tensor*, ggml_tensor*> forward_double_block(GGMLRunnerContext* ctx,
+                                                                    int block_idx,
+                                                                    ggml_tensor* img,
+                                                                    ggml_tensor* txt,
+                                                                    ggml_tensor* vec,
+                                                                    ggml_tensor* pe,
+                                                                    ggml_tensor* txt_img_mask,
+                                                                    std::vector<ModulationOut>& ds_img_mods,
+                                                                    std::vector<ModulationOut>& ds_txt_mods) {
+            auto block = std::dynamic_pointer_cast<DoubleStreamBlock>(blocks["double_blocks." + std::to_string(block_idx)]);
+            auto img_txt = block->forward(ctx, img, txt, vec, pe, txt_img_mask, ds_img_mods, ds_txt_mods);
+            return img_txt;
+        }
+
+        ggml_tensor* forward_single_block(GGMLRunnerContext* ctx,
+                                           int block_idx,
+                                           ggml_tensor* txt_img,
+                                           ggml_tensor* vec,
+                                           ggml_tensor* pe,
+                                           ggml_tensor* txt_img_mask,
+                                           std::vector<ModulationOut>& ss_mods) {
+            auto block = std::dynamic_pointer_cast<SingleStreamBlock>(blocks["single_blocks." + std::to_string(block_idx)]);
+            return block->forward(ctx, txt_img, vec, pe, txt_img_mask, ss_mods);
+        }
+
+        ggml_tensor* forward_output_stage(GGMLRunnerContext* ctx,
+                                           ggml_tensor* txt_img,
+                                           ggml_tensor* vec,
+                                           int64_t n_img_tokens,
+                                           int64_t n_txt_tokens) {
+            auto final_layer = std::dynamic_pointer_cast<LastLayer>(blocks["final_layer"]);
+
+            // Extract img from txt_img
+            auto img = ggml_view_3d(ctx->ggml_ctx,
+                                    txt_img,
+                                    txt_img->ne[0],
+                                    n_img_tokens,
+                                    txt_img->ne[2],
+                                    txt_img->nb[1],
+                                    txt_img->nb[2],
+                                    n_txt_tokens * txt_img->nb[1]);
+
+            if (final_layer) {
+                img = final_layer->forward(ctx, img, vec);
+            }
+
+            return img;
+        }
+
         ggml_tensor* forward_orig(GGMLRunnerContext* ctx,
                                   ggml_tensor* img,
                                   ggml_tensor* txt,
@@ -1175,6 +1312,190 @@ namespace Flux {
                                            skip_layers);
             }
         }
+
+        struct StreamingContext {
+            // Intermediate tensors (persist across blocks)
+            ggml_tensor* img = nullptr;            // Image features
+            ggml_tensor* txt = nullptr;            // Text features
+            ggml_tensor* vec = nullptr;            // Time/guidance embedding
+            ggml_tensor* pe  = nullptr;            // Positional encoding
+            ggml_tensor* txt_img_mask = nullptr;   // Mask for attention
+
+            // Precomputed modulations (computed once, used by all blocks)
+            std::vector<ModulationOut> ds_img_mods;
+            std::vector<ModulationOut> ds_txt_mods;
+            std::vector<ModulationOut> ss_mods;
+
+            // State tracking
+            int current_double_block = 0;
+            int current_single_block = 0;
+            bool preprocessing_done  = false;
+            bool double_blocks_done  = false;
+            bool single_blocks_done  = false;
+
+            // Concatenated tensor for single blocks
+            ggml_tensor* txt_img = nullptr;
+
+            void reset() {
+                img = txt = vec = pe = txt_img_mask = txt_img = nullptr;
+                ds_img_mods.clear();
+                ds_txt_mods.clear();
+                ss_mods.clear();
+                current_double_block = 0;
+                current_single_block = 0;
+                preprocessing_done = false;
+                double_blocks_done = false;
+                single_blocks_done = false;
+            }
+        };
+
+        void forward_preprocessing(GGMLRunnerContext* ctx,
+                                   StreamingContext& stream_ctx,
+                                   ggml_tensor* img,
+                                   ggml_tensor* txt,
+                                   ggml_tensor* timesteps,
+                                   ggml_tensor* y,
+                                   ggml_tensor* guidance,
+                                   ggml_tensor* pe,
+                                   ggml_tensor* mod_index_arange = nullptr) {
+            auto img_in      = std::dynamic_pointer_cast<Linear>(blocks["img_in"]);
+            auto txt_in      = std::dynamic_pointer_cast<Linear>(blocks["txt_in"]);
+
+            // Image input projection
+            if (img_in) {
+                stream_ctx.img = img_in->forward(ctx, img);
+            } else {
+                stream_ctx.img = img;
+            }
+
+            // Compute vec (time/guidance embedding)
+            if (params.is_chroma) {
+                int64_t mod_index_length = 344;
+                auto approx = std::dynamic_pointer_cast<ChromaApproximator>(blocks["distilled_guidance_layer"]);
+                auto distill_timestep = ggml_ext_timestep_embedding(ctx->ggml_ctx, timesteps, 16, 10000, 1000.f);
+                auto distill_guidance = ggml_ext_timestep_embedding(ctx->ggml_ctx, guidance, 16, 10000, 1000.f);
+
+                GGML_ASSERT(mod_index_arange != nullptr);
+                auto modulation_index = ggml_ext_timestep_embedding(ctx->ggml_ctx, mod_index_arange, 32, 10000, 1000.f);
+                modulation_index = ggml_repeat(ctx->ggml_ctx, modulation_index,
+                    ggml_new_tensor_3d(ctx->ggml_ctx, GGML_TYPE_F32, modulation_index->ne[0], modulation_index->ne[1], img->ne[2]));
+
+                auto timestep_guidance = ggml_concat(ctx->ggml_ctx, distill_timestep, distill_guidance, 0);
+                timestep_guidance = ggml_repeat(ctx->ggml_ctx, timestep_guidance, modulation_index);
+
+                stream_ctx.vec = ggml_concat(ctx->ggml_ctx, timestep_guidance, modulation_index, 0);
+                stream_ctx.vec = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, stream_ctx.vec, 0, 2, 1, 3));
+                stream_ctx.vec = approx->forward(ctx, stream_ctx.vec);
+
+                if (y != nullptr) {
+                    stream_ctx.txt_img_mask = ggml_pad(ctx->ggml_ctx, y, static_cast<int>(img->ne[1]), 0, 0, 0);
+                }
+            } else {
+                auto time_in = std::dynamic_pointer_cast<MLPEmbedder>(blocks["time_in"]);
+                stream_ctx.vec = time_in->forward(ctx, ggml_ext_timestep_embedding(ctx->ggml_ctx, timesteps, 256, 10000, 1000.f));
+
+                if (params.guidance_embed) {
+                    GGML_ASSERT(guidance != nullptr);
+                    auto guidance_in = std::dynamic_pointer_cast<MLPEmbedder>(blocks["guidance_in"]);
+                    auto g_in = ggml_ext_timestep_embedding(ctx->ggml_ctx, guidance, 256, 10000, 1000.f);
+                    stream_ctx.vec = ggml_add(ctx->ggml_ctx, stream_ctx.vec, guidance_in->forward(ctx, g_in));
+                }
+
+                if (params.vec_in_dim > 0) {
+                    auto vector_in = std::dynamic_pointer_cast<MLPEmbedder>(blocks["vector_in"]);
+                    stream_ctx.vec = ggml_add(ctx->ggml_ctx, stream_ctx.vec, vector_in->forward(ctx, y));
+                }
+            }
+
+            // Precompute modulations (used by all blocks)
+            if (params.share_modulation) {
+                auto double_stream_modulation_img = std::dynamic_pointer_cast<Modulation>(blocks["double_stream_modulation_img"]);
+                auto double_stream_modulation_txt = std::dynamic_pointer_cast<Modulation>(blocks["double_stream_modulation_txt"]);
+                auto single_stream_modulation = std::dynamic_pointer_cast<Modulation>(blocks["single_stream_modulation"]);
+
+                stream_ctx.ds_img_mods = double_stream_modulation_img->forward(ctx, stream_ctx.vec);
+                stream_ctx.ds_txt_mods = double_stream_modulation_txt->forward(ctx, stream_ctx.vec);
+                stream_ctx.ss_mods = single_stream_modulation->forward(ctx, stream_ctx.vec);
+            }
+
+            // Text normalization and projection
+            if (params.semantic_txt_norm) {
+                auto semantic_txt_norm = std::dynamic_pointer_cast<RMSNorm>(blocks["txt_norm"]);
+                txt = semantic_txt_norm->forward(ctx, txt);
+            }
+            stream_ctx.txt = txt_in->forward(ctx, txt);
+
+            // Store PE
+            stream_ctx.pe = pe;
+
+            stream_ctx.preprocessing_done = true;
+            stream_ctx.current_double_block = 0;
+            stream_ctx.current_single_block = 0;
+        }
+
+        bool forward_double_block(GGMLRunnerContext* ctx,
+                                  StreamingContext& stream_ctx,
+                                  int block_idx) {
+            GGML_ASSERT(stream_ctx.preprocessing_done);
+            GGML_ASSERT(block_idx < params.depth);
+
+            auto block = std::dynamic_pointer_cast<DoubleStreamBlock>(blocks["double_blocks." + std::to_string(block_idx)]);
+            auto img_txt = block->forward(ctx, stream_ctx.img, stream_ctx.txt, stream_ctx.vec,
+                                          stream_ctx.pe, stream_ctx.txt_img_mask,
+                                          stream_ctx.ds_img_mods, stream_ctx.ds_txt_mods);
+            stream_ctx.img = img_txt.first;
+            stream_ctx.txt = img_txt.second;
+
+            stream_ctx.current_double_block = block_idx + 1;
+            if (stream_ctx.current_double_block >= params.depth) {
+                stream_ctx.double_blocks_done = true;
+                // Prepare for single blocks by concatenating txt and img
+                stream_ctx.txt_img = ggml_concat(ctx->ggml_ctx, stream_ctx.txt, stream_ctx.img, 1);
+                return true;
+            }
+            return false;
+        }
+
+        bool forward_single_block(GGMLRunnerContext* ctx,
+                                  StreamingContext& stream_ctx,
+                                  int block_idx) {
+            GGML_ASSERT(stream_ctx.double_blocks_done);
+            GGML_ASSERT(block_idx < params.depth_single_blocks);
+
+            auto block = std::dynamic_pointer_cast<SingleStreamBlock>(blocks["single_blocks." + std::to_string(block_idx)]);
+            stream_ctx.txt_img = block->forward(ctx, stream_ctx.txt_img, stream_ctx.vec,
+                                                stream_ctx.pe, stream_ctx.txt_img_mask, stream_ctx.ss_mods);
+
+            stream_ctx.current_single_block = block_idx + 1;
+            if (stream_ctx.current_single_block >= params.depth_single_blocks) {
+                stream_ctx.single_blocks_done = true;
+                return true;
+            }
+            return false;
+        }
+
+        ggml_tensor* forward_postprocessing(GGMLRunnerContext* ctx,
+                                            StreamingContext& stream_ctx) {
+            GGML_ASSERT(stream_ctx.single_blocks_done);
+
+            auto final_layer = std::dynamic_pointer_cast<LastLayer>(blocks["final_layer"]);
+
+            // Extract img from txt_img
+            auto img = ggml_view_3d(ctx->ggml_ctx,
+                                    stream_ctx.txt_img,
+                                    stream_ctx.txt_img->ne[0],
+                                    stream_ctx.img->ne[1],
+                                    stream_ctx.txt_img->ne[2],
+                                    stream_ctx.txt_img->nb[1],
+                                    stream_ctx.txt_img->nb[2],
+                                    stream_ctx.txt->ne[1] * stream_ctx.txt_img->nb[1]);
+
+            if (final_layer) {
+                img = final_layer->forward(ctx, img, stream_ctx.vec);
+            }
+
+            return img;
+        }
     };
 
     struct FluxRunner : public GGMLRunner {
@@ -1188,6 +1509,12 @@ namespace Flux {
         SDVersion version;
         bool use_mask = false;
 
+        // Static layer cache decided on the first sampling step. -1 = not yet
+        // computed; 0..N = number of "double_blocks.X" / "single_blocks.X"
+        // blocks kept resident on GPU across sampling steps.
+        int resident_double_blocks_ = -1;
+        int resident_single_blocks_ = -1;
+
         FluxRunner(ggml_backend_t backend,
                    bool offload_params_to_cpu,
                    const String2TensorStorage& tensor_storage_map = {},
@@ -1465,6 +1792,101 @@ namespace Flux {
             return gf;
         }
 
+        // Raw tensor build_graph used by streaming infrastructure
+        ggml_cgraph* build_graph(ggml_tensor* x,
+                                 ggml_tensor* timesteps,
+                                 ggml_tensor* context,
+                                 ggml_tensor* c_concat,
+                                 ggml_tensor* y,
+                                 ggml_tensor* guidance,
+                                 std::vector<ggml_tensor*> ref_latents = {},
+                                 bool increase_ref_index               = false,
+                                 std::vector<int> skip_layers          = {}) {
+            GGML_ASSERT(x->ne[3] == 1);
+            ggml_cgraph* gf = new_graph_custom(FLUX_GRAPH_SIZE);
+
+            x         = to_backend(x);
+            timesteps = to_backend(timesteps);
+            context   = to_backend(context);
+            c_concat  = to_backend(c_concat);
+            y         = to_backend(y);
+            guidance  = to_backend(guidance);
+            for (auto& ref : ref_latents) {
+                ref = to_backend(ref);
+            }
+
+            ggml_tensor* mod_index_arange = nullptr;
+            ggml_tensor* dct              = nullptr;
+
+            if (flux_params.is_chroma) {
+                if (!use_mask) {
+                    y = nullptr;
+                }
+                mod_index_arange_vec = arange(0, 344);
+                mod_index_arange     = ggml_new_tensor_1d(compute_ctx, GGML_TYPE_F32, mod_index_arange_vec.size());
+                set_backend_tensor_data(mod_index_arange, mod_index_arange_vec.data());
+            }
+            std::set<int> txt_arange_dims;
+            if (sd_version_is_flux2(version)) {
+                txt_arange_dims    = {3};
+                increase_ref_index = true;
+            } else if (version == VERSION_OVIS_IMAGE) {
+                txt_arange_dims = {1, 2};
+            }
+
+            pe_vec = Rope::gen_flux_pe(static_cast<int>(x->ne[1]),
+                                       static_cast<int>(x->ne[0]),
+                                       flux_params.patch_size,
+                                       static_cast<int>(x->ne[3]),
+                                       static_cast<int>(context->ne[1]),
+                                       txt_arange_dims,
+                                       ref_latents,
+                                       increase_ref_index,
+                                       flux_params.ref_index_scale,
+                                       flux_params.theta,
+                                       circular_y_enabled,
+                                       circular_x_enabled,
+                                       flux_params.axes_dim);
+            int pos_len = static_cast<int>(pe_vec.size() / flux_params.axes_dim_sum / 2);
+            auto pe     = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, flux_params.axes_dim_sum / 2, pos_len);
+            set_backend_tensor_data(pe, pe_vec.data());
+
+            if (version == VERSION_CHROMA_RADIANCE) {
+                int patch_size     = flux_params.patch_size;
+                int nerf_max_freqs = flux_params.chroma_radiance_params.nerf_max_freqs;
+                dct_vec            = fetch_dct_pos(patch_size, nerf_max_freqs);
+                dct                = ggml_new_tensor_2d(compute_ctx, GGML_TYPE_F32, nerf_max_freqs * nerf_max_freqs, patch_size * patch_size);
+                set_backend_tensor_data(dct, dct_vec.data());
+            }
+
+            auto runner_ctx = get_context();
+            ggml_tensor* out = flux.forward(&runner_ctx, x, timesteps, context, c_concat, y,
+                                            guidance, pe, mod_index_arange, dct, ref_latents, skip_layers);
+            ggml_build_forward_expand(gf, out);
+            return gf;
+        }
+
+        // Raw tensor compute used by streaming infrastructure
+        bool compute(int n_threads,
+                     ggml_tensor* x,
+                     ggml_tensor* timesteps,
+                     ggml_tensor* context,
+                     ggml_tensor* c_concat,
+                     ggml_tensor* y,
+                     ggml_tensor* guidance,
+                     std::vector<ggml_tensor*> ref_latents = {},
+                     bool increase_ref_index               = false,
+                     ggml_tensor** output                  = nullptr,
+                     ggml_context* output_ctx              = nullptr,
+                     std::vector<int> skip_layers          = {},
+                     bool skip_param_offload               = false) {
+            auto get_graph = [&]() -> ggml_cgraph* {
+                return build_graph(x, timesteps, context, c_concat, y, guidance,
+                                   ref_latents, increase_ref_index, skip_layers);
+            };
+            return GGMLRunner::compute(get_graph, n_threads, false, output, output_ctx, skip_param_offload);
+        }
+
         sd::Tensor<float> compute(int n_threads,
                                   const sd::Tensor<float>& x,
                                   const sd::Tensor<float>& timesteps,
@@ -1584,6 +2006,534 @@ namespace Flux {
             LOG_INFO("flux model loaded");
             flux->test();
         }
+
+        void enable_layer_streaming(const LayerStreaming::StreamingConfig& config = {}) {
+            std::map<std::string, ggml_tensor*> tensor_map;
+            flux.get_param_tensors(tensor_map, "model.diffusion_model");
+            init_streaming(config, tensor_map, LayerStreaming::flux_layer_pattern);
+            LOG_INFO("%s layer streaming enabled with %zu layers",
+                     get_desc().c_str(), streaming_engine_->get_registry().get_layer_count());
+        }
+
+        bool compute_streaming(int n_threads,
+                               struct ggml_tensor* x,
+                               struct ggml_tensor* timesteps,
+                               struct ggml_tensor* context,
+                               struct ggml_tensor* c_concat,
+                               struct ggml_tensor* y,
+                               struct ggml_tensor* guidance,
+                               std::vector<ggml_tensor*> ref_latents = {},
+                               bool increase_ref_index               = false,
+                               struct ggml_tensor** output           = nullptr,
+                               struct ggml_context* output_ctx       = nullptr,
+                               std::vector<int> skip_layers          = std::vector<int>()) {
+            if (!is_streaming_enabled()) {
+                LOG_ERROR("%s streaming not enabled", get_desc().c_str());
+                return false;
+            }
+
+            int64_t t0 = ggml_time_ms();
+            auto analysis = analyze_vram_budget();
+
+            if (analysis.fits_in_vram) {
+                LOG_INFO("%s model fits in VRAM, using coarse-stage streaming", get_desc().c_str());
+                load_all_layers_coarse();
+
+                bool result = compute(n_threads, x, timesteps, context, c_concat, y, guidance,
+                                      ref_latents, increase_ref_index, output, output_ctx,
+                                      skip_layers, true);
+
+                int64_t t1 = ggml_time_ms();
+                LOG_INFO("%s coarse-stage streaming completed in %.2fs", get_desc().c_str(), (t1 - t0) / 1000.0);
+                free_compute_buffer();
+                return result;
+            }
+
+            LOG_INFO("%s remaining %.2f GB exceeds available %.2f GB, using per-layer streaming",
+                     get_desc().c_str(),
+                     analysis.remaining_to_load / (1024.0 * 1024.0 * 1024.0),
+                     analysis.available_vram / (1024.0 * 1024.0 * 1024.0));
+
+            return compute_streaming_true(n_threads, x, timesteps, context, c_concat, y, guidance,
+                                          ref_latents, increase_ref_index, output, output_ctx, skip_layers);
+        }
+
+        bool compute_streaming_true(int n_threads,
+                                    struct ggml_tensor* x,
+                                    struct ggml_tensor* timesteps,
+                                    struct ggml_tensor* context,
+                                    struct ggml_tensor* c_concat,
+                                    struct ggml_tensor* y,
+                                    struct ggml_tensor* guidance,
+                                    std::vector<ggml_tensor*> ref_latents,
+                                    bool increase_ref_index,
+                                    struct ggml_tensor** output,
+                                    struct ggml_context* output_ctx,
+                                    std::vector<int> skip_layers) {
+            auto& registry = streaming_engine_->get_registry();
+            int64_t t_start = ggml_time_ms();
+
+            const int num_double_blocks = flux_params.depth;
+            const int num_single_blocks = flux_params.depth_single_blocks;
+            LOG_INFO("TRUE per-layer streaming - %d double + %d single blocks",
+                     num_double_blocks, num_single_blocks);
+
+            // Load global layers (_global contains input projections, final_layer, etc)
+            LOG_DEBUG("Loading global layers");
+            if (!registry.move_layer_to_gpu("_global")) {
+                LOG_ERROR("Failed to load _global to GPU");
+                return false;
+            }
+            LOG_DEBUG("_global loaded successfully");
+
+            // Set up txt_arange_dims based on version
+            std::set<int> txt_arange_dims;
+            if (sd_version_is_flux2(version)) {
+                txt_arange_dims    = {3};
+                increase_ref_index = true;
+            } else if (version == VERSION_OVIS_IMAGE) {
+                txt_arange_dims = {1, 2};
+            }
+
+            // Pre-generate PE
+            pe_vec = Rope::gen_flux_pe(static_cast<int>(x->ne[1]),
+                                       static_cast<int>(x->ne[0]),
+                                       flux_params.patch_size,
+                                       static_cast<int>(x->ne[3]),
+                                       static_cast<int>(context->ne[1]),
+                                       txt_arange_dims,
+                                       ref_latents,
+                                       increase_ref_index,
+                                       flux_params.ref_index_scale,
+                                       flux_params.theta,
+                                       circular_y_enabled,
+                                       circular_x_enabled,
+                                       flux_params.axes_dim);
+
+            LOG_DEBUG("PE generated");
+
+            // Pre-generate mod_index_arange for Chroma
+            if (flux_params.is_chroma) {
+                mod_index_arange_vec.clear();
+                for (int i = 0; i < 344; i++) {
+                    mod_index_arange_vec.push_back(static_cast<float>(i));
+                }
+            }
+
+            LOG_DEBUG("About to execute input stage");
+
+            // Persistent storage for intermediate tensors. Backed by a single
+            // GPU-pinned host buffer (via ensure_pinned_act_buffers) so the
+            // per-block ggml_backend_tensor_get / set_backend_tensor_data
+            // calls run at full PCIe bandwidth. Falls back to pageable
+            // std::vector if pinned alloc fails.
+            std::vector<float> persistent_img_fallback;
+            std::vector<float> persistent_txt_fallback;
+            std::vector<float> persistent_vec_fallback;
+            std::vector<float> persistent_txt_img_fallback;
+            float* persistent_img     = nullptr;
+            float* persistent_txt     = nullptr;
+            float* persistent_vec     = nullptr;
+            float* persistent_txt_img = nullptr;
+            size_t persistent_img_count     = 0;
+            size_t persistent_txt_count     = 0;
+            size_t persistent_vec_count     = 0;
+            size_t persistent_txt_img_count = 0;
+            int64_t img_ne[4], txt_ne[4], vec_ne[4], txt_img_ne[4];
+            int64_t n_txt_tokens = 0;
+            int64_t n_img_tokens = 0;
+
+            LOG_DEBUG("Executing input stage");
+            {
+                ggml_tensor* img_output = nullptr;
+                ggml_tensor* txt_output = nullptr;
+                ggml_tensor* vec_output = nullptr;
+
+                auto get_input_graph = [&]() -> struct ggml_cgraph* {
+                    struct ggml_cgraph* gf = new_graph_custom(FLUX_GRAPH_SIZE / 4);
+                    auto runner_ctx = get_context();
+
+                    ggml_tensor* x_patched = DiT::pad_and_patchify(&runner_ctx, to_backend(x),
+                                                                    flux_params.patch_size, flux_params.patch_size);
+                    n_img_tokens = x_patched->ne[1];
+
+                    // Handle ref_latents
+                    for (auto& ref : ref_latents) {
+                        auto ref_patched = DiT::pad_and_patchify(&runner_ctx, to_backend(ref),
+                                                                  flux_params.patch_size, flux_params.patch_size);
+                        x_patched = ggml_concat(compute_ctx, x_patched, ref_patched, 1);
+                    }
+
+                    ggml_tensor* context_backend = to_backend(context);
+                    ggml_tensor* timesteps_backend = to_backend(timesteps);
+                    ggml_tensor* y_backend = y ? to_backend(y) : nullptr;
+                    ggml_tensor* guidance_backend = guidance ? to_backend(guidance) : nullptr;
+
+                    ggml_tensor* mod_index_arange = nullptr;
+                    if (flux_params.is_chroma && !mod_index_arange_vec.empty()) {
+                        mod_index_arange = ggml_new_tensor_1d(compute_ctx, GGML_TYPE_F32, mod_index_arange_vec.size());
+                        set_backend_tensor_data(mod_index_arange, mod_index_arange_vec.data());
+                    }
+
+                    auto result = flux.forward_input_stage(&runner_ctx, x_patched, context_backend,
+                                                           timesteps_backend, y_backend, guidance_backend,
+                                                           mod_index_arange);
+
+                    img_output = result.img;
+                    txt_output = result.txt;
+                    vec_output = result.vec;
+                    n_txt_tokens = result.n_txt_tokens;
+
+                    ggml_build_forward_expand(gf, img_output);
+                    ggml_build_forward_expand(gf, txt_output);
+                    ggml_build_forward_expand(gf, vec_output);
+
+                    return gf;
+                };
+
+                // Don't free compute buffer immediately - we need to read outputs first
+                if (!GGMLRunner::compute(get_input_graph, n_threads, false, nullptr, nullptr, true)) {
+                    LOG_ERROR("Input stage failed");
+                    return false;
+                }
+
+                // Extract to persistent storage
+                if (img_output && txt_output && vec_output) {
+                    size_t img_size = ggml_nelements(img_output);
+                    size_t txt_size = ggml_nelements(txt_output);
+                    size_t vec_size = ggml_nelements(vec_output);
+                    // txt_img region is sized to hold the concatenated
+                    // (txt + img) activations consumed by single blocks.
+                    size_t txt_img_size = txt_size + img_size;
+
+                    persistent_img_count     = img_size;
+                    persistent_txt_count     = txt_size;
+                    persistent_vec_count     = vec_size;
+                    persistent_txt_img_count = txt_img_size;
+
+                    std::vector<float*> ptrs;
+                    if (ensure_pinned_act_buffers({img_size     * sizeof(float),
+                                                   txt_size     * sizeof(float),
+                                                   vec_size     * sizeof(float),
+                                                   txt_img_size * sizeof(float)}, ptrs)) {
+                        persistent_img     = ptrs[0];
+                        persistent_txt     = ptrs[1];
+                        persistent_vec     = ptrs[2];
+                        persistent_txt_img = ptrs[3];
+                    } else {
+                        persistent_img_fallback.resize(img_size);
+                        persistent_txt_fallback.resize(txt_size);
+                        persistent_vec_fallback.resize(vec_size);
+                        persistent_txt_img_fallback.resize(txt_img_size);
+                        persistent_img     = persistent_img_fallback.data();
+                        persistent_txt     = persistent_txt_fallback.data();
+                        persistent_vec     = persistent_vec_fallback.data();
+                        persistent_txt_img = persistent_txt_img_fallback.data();
+                    }
+
+                    ggml_backend_tensor_get(img_output, persistent_img, 0, img_size * sizeof(float));
+                    ggml_backend_tensor_get(txt_output, persistent_txt, 0, txt_size * sizeof(float));
+                    ggml_backend_tensor_get(vec_output, persistent_vec, 0, vec_size * sizeof(float));
+
+                    for (int i = 0; i < 4; i++) {
+                        img_ne[i] = img_output->ne[i];
+                        txt_ne[i] = txt_output->ne[i];
+                        vec_ne[i] = vec_output->ne[i];
+                    }
+                } else {
+                    LOG_ERROR("Failed to get input stage outputs");
+                    free_compute_buffer();
+                    return false;
+                }
+
+                // Now safe to free compute buffer
+                free_compute_buffer();
+            }
+
+            LOG_DEBUG("Input stage done, img=%ldx%ldx%ld, txt=%ldx%ldx%ld",
+                      img_ne[0], img_ne[1], img_ne[2], txt_ne[0], txt_ne[1], txt_ne[2]);
+
+            auto double_name_at = [](int i) { return "double_blocks." + std::to_string(i); };
+
+            if (resident_double_blocks_ < 0 && streaming_engine_) {
+                resident_double_blocks_ = streaming_engine_->compute_resident_block_count(
+                    "double_blocks.0", num_double_blocks);
+                LOG_INFO("%s double_blocks cache: %d resident, %d streamed per step",
+                         get_desc().c_str(),
+                         resident_double_blocks_,
+                         num_double_blocks - resident_double_blocks_);
+            }
+
+            int double_prefetch_start = 0;
+            while (double_prefetch_start < num_double_blocks &&
+                   registry.is_layer_on_gpu(double_name_at(double_prefetch_start))) {
+                double_prefetch_start++;
+            }
+            if (streaming_engine_) {
+                streaming_engine_->prime_prefetch(double_name_at, double_prefetch_start, num_double_blocks);
+            }
+
+            for (int block_idx = 0; block_idx < num_double_blocks; block_idx++) {
+                // Check skip_layers
+                if (skip_layers.size() > 0 && std::find(skip_layers.begin(), skip_layers.end(), block_idx) != skip_layers.end()) {
+                    LOG_DEBUG("Skipping double_block %d", block_idx);
+                    continue;
+                }
+
+                std::string block_name = double_name_at(block_idx);
+                int64_t t_block_start = ggml_time_ms();
+
+                // Wait for this block's prefetch to complete (if async prefetch was started)
+                if (streaming_engine_) {
+                    streaming_engine_->wait_for_prefetch(block_name);
+                }
+
+                // Load this block's weights (sync load if prefetch didn't happen)
+                if (!registry.move_layer_to_gpu(block_name)) {
+                    LOG_ERROR("Failed to load %s", block_name.c_str());
+                    return false;
+                }
+
+                // Keep the prefetch window full
+                if (streaming_engine_) {
+                    streaming_engine_->advance_prefetch(double_name_at, block_idx, num_double_blocks);
+                }
+
+                ggml_tensor* img_out = nullptr;
+                ggml_tensor* txt_out = nullptr;
+
+                auto get_block_graph = [&]() -> struct ggml_cgraph* {
+                    struct ggml_cgraph* gf = new_graph_custom(FLUX_GRAPH_SIZE / 4);
+
+                    // Create input tensors from persistent storage
+                    ggml_tensor* img_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, img_ne[0], img_ne[1], img_ne[2], img_ne[3]);
+                    ggml_tensor* txt_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, txt_ne[0], txt_ne[1], txt_ne[2], txt_ne[3]);
+                    ggml_tensor* vec_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, vec_ne[0], vec_ne[1], vec_ne[2], vec_ne[3]);
+
+                    img_in = to_backend(img_in);
+                    txt_in = to_backend(txt_in);
+                    vec_in = to_backend(vec_in);
+
+                    set_backend_tensor_data(img_in, persistent_img);
+                    set_backend_tensor_data(txt_in, persistent_txt);
+                    set_backend_tensor_data(vec_in, persistent_vec);
+
+                    // PE tensor
+                    int pos_len = static_cast<int>(pe_vec.size() / flux_params.axes_dim_sum / 2);
+                    ggml_tensor* pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, flux_params.axes_dim_sum / 2, pos_len);
+                    set_backend_tensor_data(pe, pe_vec.data());
+
+                    std::vector<ModulationOut> ds_img_mods, ds_txt_mods;
+                    auto runner_ctx = get_context();
+                    auto result = flux.forward_double_block(&runner_ctx, block_idx, img_in, txt_in, vec_in, pe,
+                                                            nullptr, ds_img_mods, ds_txt_mods);
+
+                    img_out = result.first;
+                    txt_out = result.second;
+
+                    ggml_build_forward_expand(gf, img_out);
+                    ggml_build_forward_expand(gf, txt_out);
+
+                    return gf;
+                };
+
+                // Don't free compute buffer immediately - we need to read outputs first
+                if (!GGMLRunner::compute(get_block_graph, n_threads, false, nullptr, nullptr, true)) {
+                    LOG_ERROR("Double block %d execution failed", block_idx);
+                    return false;
+                }
+
+                // Extract outputs to persistent storage
+                if (img_out && txt_out) {
+                    ggml_backend_tensor_get(img_out, persistent_img, 0, persistent_img_count * sizeof(float));
+                    ggml_backend_tensor_get(txt_out, persistent_txt, 0, persistent_txt_count * sizeof(float));
+
+                    for (int i = 0; i < 4; i++) {
+                        img_ne[i] = img_out->ne[i];
+                        txt_ne[i] = txt_out->ne[i];
+                    }
+                }
+
+                // Now safe to free compute buffer
+                free_compute_buffer();
+
+                // Resident blocks stay on GPU across sampling steps.
+                if (block_idx >= resident_double_blocks_) {
+                    registry.move_layer_to_cpu(block_name);
+                }
+
+                LOG_DEBUG("Double block %d/%d done (%.2fms)",
+                          block_idx + 1, num_double_blocks, (ggml_time_ms() - t_block_start) / 1.0);
+            }
+
+            {
+                // Concatenate txt and img into txt_img
+                size_t txt_img_size = persistent_txt_count + persistent_img_count;
+                // persistent_txt_img was already sized in ensure_pinned_act_buffers
+                // (txt_img region == txt_count + img_count). Just concat into it.
+
+                // txt goes first, then img (along dimension 1)
+                // Since we store flattened, we need to handle this carefully
+                // txt: [hidden_size, n_txt_tokens, N]
+                // img: [hidden_size, n_img_tokens, N]
+                // txt_img: [hidden_size, n_txt_tokens + n_img_tokens, N]
+                std::copy(persistent_txt, persistent_txt + persistent_txt_count, persistent_txt_img);
+                std::copy(persistent_img, persistent_img + persistent_img_count, persistent_txt_img + persistent_txt_count);
+
+                txt_img_ne[0] = img_ne[0];  // hidden_size
+                txt_img_ne[1] = txt_ne[1] + img_ne[1];  // n_txt_tokens + n_img_tokens
+                txt_img_ne[2] = img_ne[2];  // N
+                txt_img_ne[3] = 1;
+            }
+
+            auto single_name_at = [](int i) { return "single_blocks." + std::to_string(i); };
+
+            if (resident_single_blocks_ < 0 && streaming_engine_) {
+                resident_single_blocks_ = streaming_engine_->compute_resident_block_count(
+                    "single_blocks.0", num_single_blocks);
+                LOG_INFO("%s single_blocks cache: %d resident, %d streamed per step",
+                         get_desc().c_str(),
+                         resident_single_blocks_,
+                         num_single_blocks - resident_single_blocks_);
+            }
+
+            int single_prefetch_start = 0;
+            while (single_prefetch_start < num_single_blocks &&
+                   registry.is_layer_on_gpu(single_name_at(single_prefetch_start))) {
+                single_prefetch_start++;
+            }
+            if (streaming_engine_) {
+                streaming_engine_->prime_prefetch(single_name_at, single_prefetch_start, num_single_blocks);
+            }
+
+            for (int block_idx = 0; block_idx < num_single_blocks; block_idx++) {
+                // Check skip_layers (single blocks start at depth offset)
+                int skip_idx = block_idx + flux_params.depth;
+                if (skip_layers.size() > 0 && std::find(skip_layers.begin(), skip_layers.end(), skip_idx) != skip_layers.end()) {
+                    LOG_DEBUG("Skipping single_block %d", block_idx);
+                    continue;
+                }
+
+                std::string block_name = single_name_at(block_idx);
+                int64_t t_block_start = ggml_time_ms();
+
+                // Wait for this block's prefetch to complete (if async prefetch was started)
+                if (streaming_engine_) {
+                    streaming_engine_->wait_for_prefetch(block_name);
+                }
+
+                // Load this block's weights (sync load if prefetch didn't happen)
+                if (!registry.move_layer_to_gpu(block_name)) {
+                    LOG_ERROR("Failed to load %s", block_name.c_str());
+                    return false;
+                }
+
+                // Keep the prefetch window full
+                if (streaming_engine_) {
+                    streaming_engine_->advance_prefetch(single_name_at, block_idx, num_single_blocks);
+                }
+
+                ggml_tensor* txt_img_out = nullptr;
+
+                auto get_block_graph = [&]() -> struct ggml_cgraph* {
+                    struct ggml_cgraph* gf = new_graph_custom(FLUX_GRAPH_SIZE / 4);
+
+                    // Create input tensors
+                    ggml_tensor* txt_img_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32,
+                                                                  txt_img_ne[0], txt_img_ne[1], txt_img_ne[2], txt_img_ne[3]);
+                    ggml_tensor* vec_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, vec_ne[0], vec_ne[1], vec_ne[2], vec_ne[3]);
+
+                    txt_img_in = to_backend(txt_img_in);
+                    vec_in = to_backend(vec_in);
+
+                    set_backend_tensor_data(txt_img_in, persistent_txt_img);
+                    set_backend_tensor_data(vec_in, persistent_vec);
+
+                    // PE tensor
+                    int pos_len = static_cast<int>(pe_vec.size() / flux_params.axes_dim_sum / 2);
+                    ggml_tensor* pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, flux_params.axes_dim_sum / 2, pos_len);
+                    set_backend_tensor_data(pe, pe_vec.data());
+
+                    std::vector<ModulationOut> ss_mods;
+                    auto runner_ctx = get_context();
+                    txt_img_out = flux.forward_single_block(&runner_ctx, block_idx, txt_img_in, vec_in, pe,
+                                                             nullptr, ss_mods);
+
+                    ggml_build_forward_expand(gf, txt_img_out);
+
+                    return gf;
+                };
+
+                // Don't free compute buffer immediately - we need to read outputs first
+                if (!GGMLRunner::compute(get_block_graph, n_threads, false, nullptr, nullptr, true)) {
+                    LOG_ERROR("Single block %d execution failed", block_idx);
+                    return false;
+                }
+
+                // Extract output to persistent storage
+                if (txt_img_out) {
+                    ggml_backend_tensor_get(txt_img_out, persistent_txt_img, 0, persistent_txt_img_count * sizeof(float));
+
+                    for (int i = 0; i < 4; i++) {
+                        txt_img_ne[i] = txt_img_out->ne[i];
+                    }
+                }
+
+                // Now safe to free compute buffer
+                free_compute_buffer();
+
+                // Resident blocks stay on GPU across sampling steps.
+                if (block_idx >= resident_single_blocks_) {
+                    registry.move_layer_to_cpu(block_name);
+                }
+
+                LOG_DEBUG("Single block %d/%d done (%.2fms)",
+                          block_idx + 1, num_single_blocks, (ggml_time_ms() - t_block_start) / 1.0);
+            }
+
+            LOG_DEBUG("Executing output stage");
+            {
+                auto get_output_graph = [&]() -> struct ggml_cgraph* {
+                    struct ggml_cgraph* gf = new_graph_custom(FLUX_GRAPH_SIZE / 4);
+
+                    ggml_tensor* txt_img_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32,
+                                                                  txt_img_ne[0], txt_img_ne[1], txt_img_ne[2], txt_img_ne[3]);
+                    ggml_tensor* vec_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, vec_ne[0], vec_ne[1], vec_ne[2], vec_ne[3]);
+
+                    txt_img_in = to_backend(txt_img_in);
+                    vec_in = to_backend(vec_in);
+
+                    set_backend_tensor_data(txt_img_in, persistent_txt_img);
+                    set_backend_tensor_data(vec_in, persistent_vec);
+
+                    auto runner_ctx = get_context();
+                    auto final_out = flux.forward_output_stage(&runner_ctx, txt_img_in, vec_in, n_img_tokens, n_txt_tokens);
+
+                    // Unpatchify
+                    int64_t W = x->ne[0];
+                    int64_t H = x->ne[1];
+                    final_out = DiT::unpatchify_and_crop(compute_ctx, final_out, H, W, flux_params.patch_size, flux_params.patch_size);
+
+                    ggml_build_forward_expand(gf, final_out);
+
+                    return gf;
+                };
+
+                if (!GGMLRunner::compute(get_output_graph, n_threads, true, output, output_ctx, true)) {
+                    LOG_ERROR("Output stage failed");
+                    return false;
+                }
+            }
+
+            int64_t t_end = ggml_time_ms();
+            LOG_INFO("TRUE per-layer streaming completed in %.2fs (%d double + %d single blocks)",
+                     (t_end - t_start) / 1000.0, num_double_blocks, num_single_blocks);
+
+            return true;
+        }
+
+    private:
+        Flux::StreamingContext streaming_ctx_;
     };
 
 }  // namespace Flux
diff --git a/src/ggml_extend.hpp b/src/ggml_extend.hpp
index 362303229..5a999e8eb 100644
--- a/src/ggml_extend.hpp
+++ b/src/ggml_extend.hpp
@@ -29,6 +29,7 @@
 #include "ggml_extend_backend.hpp"
 #include "ggml_graph_cut.h"
 
+#include "layer_streaming.hpp"
 #include "model.h"
 #include "tensor.hpp"
 
@@ -1721,6 +1722,7 @@ struct GGMLRunner {
     ggml_context* offload_ctx                   = nullptr;
     ggml_backend_buffer_t runtime_params_buffer = nullptr;
     bool params_on_runtime_backend              = false;
+    bool auto_offload_after_compute             = true;  // If false, don't auto-offload in free_compute_buffer
 
     ggml_context* cache_ctx            = nullptr;
     ggml_backend_buffer_t cache_buffer = nullptr;
@@ -1728,11 +1730,20 @@ struct GGMLRunner {
     ggml_context* compute_ctx    = nullptr;
     ggml_gallocr* compute_allocr = nullptr;
 
+    // Graph-cut segmented param offload (`--max-vram` budget): the executor
+    // streams only the params needed by the current sub-graph segment.
     ggml_context* partial_offload_ctx                   = nullptr;
     ggml_backend_buffer_t partial_runtime_params_buffer = nullptr;
     std::vector<std::pair<ggml_tensor*, ggml_tensor*>> partial_offload_pairs;
     size_t max_graph_vram_bytes = 0;
 
+    // GPU-pinned host buffer shared across the per-runner persistent
+    // activation regions used by layer-streaming compute paths (txt_img,
+    // t_emb, pe, vec, ...). Allocated lazily in ensure_pinned_act_buffers()
+    // and freed in ~GGMLRunner.
+    ggml_backend_buffer_t persistent_act_host_buf_ = nullptr;
+    size_t persistent_act_host_size_               = 0;
+
     std::shared_ptr<WeightAdapter> weight_adapter = nullptr;
 
     std::vector<float> one_vec = {1.f};
@@ -1750,9 +1761,103 @@ struct GGMLRunner {
     bool circular_x_enabled    = false;
     bool circular_y_enabled    = false;
 
+    // Graph-cut planner state — caches the segment plan + the set of param
+    // tensors so the planner doesn't rebuild on every dispatch.
     sd::ggml_graph_cut::PlanCache graph_cut_plan_cache_;
     std::unordered_set<const ggml_tensor*> params_tensor_set_;
 
+    // Layer-streaming engine: drives per-layer prefetch + dispatch when the
+    // runner is configured with `--offload-mode layer_streaming`.
+    std::unique_ptr<LayerStreaming::LayerExecutionEngine> streaming_engine_;
+
+    using layer_pattern_fn_t = std::function<std::pair<std::string, int>(const std::string&)>;
+
+    void init_streaming(const LayerStreaming::StreamingConfig& config,
+                        const std::map<std::string, ggml_tensor*>& tensor_map,
+                        layer_pattern_fn_t pattern_fn) {
+        if (!params_backend || !runtime_backend) {
+            LOG_WARN("%s cannot enable streaming without both CPU and GPU backends", get_desc().c_str());
+            return;
+        }
+        if (!streaming_engine_) {
+            streaming_engine_ = std::make_unique<LayerStreaming::LayerExecutionEngine>(
+                runtime_backend, params_backend);
+        }
+        // set_max_graph_vram_bytes() may have been called before this point
+        // (it's set per-runner during model load, while the streaming engine
+        // is created lazily here). Apply the stored cap to the engine's
+        // budget so --max-vram works for our streaming planner too.
+        streaming_engine_->get_budget().set_max_vram_cap_bytes(max_graph_vram_bytes);
+        auto cfg = config;
+        cfg.enabled = true;
+        streaming_engine_->set_config(cfg);
+        streaming_engine_->register_model_layers_from_map(tensor_map, pattern_fn);
+    }
+
+    struct StreamingVramAnalysis {
+        size_t total_model_size = 0;
+        size_t available_vram   = 0;
+        size_t already_on_gpu   = 0;
+        size_t remaining_to_load = 0;
+        bool fits_in_vram       = false;
+    };
+
+    StreamingVramAnalysis analyze_vram_budget() {
+        StreamingVramAnalysis result = {};
+        if (!streaming_engine_) return result;
+
+        auto& registry = streaming_engine_->get_registry();
+        auto& budget   = streaming_engine_->get_budget();
+
+        auto all_layers = registry.get_layer_names_sorted();
+        for (const auto& name : all_layers) {
+            result.total_model_size += registry.get_layer_size(name);
+        }
+
+        // Subtract a compute-buffer reserve from available VRAM. The fits_in_vram
+        // decision picks coarse-stage (load all params resident) when params fit;
+        // without this reserve the planner ignores the runtime compute graph's
+        // alloc, which on tight caps (e.g. SDXL 1024x1024 with --max-vram 6) tips
+        // params + CB over the budget mid-step and crashes cudaMalloc.
+        size_t raw_available    = budget.get_available_vram();
+        size_t cb_reserve       = budget.get_compute_buffer_reserve();
+        result.available_vram   = (raw_available > cb_reserve) ? (raw_available - cb_reserve) : 0;
+
+        for (const auto& name : all_layers) {
+            if (registry.is_layer_on_gpu(name)) {
+                result.already_on_gpu += registry.get_layer_size(name);
+            }
+        }
+
+        result.remaining_to_load = (result.total_model_size > result.already_on_gpu)
+            ? (result.total_model_size - result.already_on_gpu) : 0;
+        result.fits_in_vram = (result.remaining_to_load <= result.available_vram);
+
+        LOG_DEBUG("%s model size = %.2f GB, on GPU = %.2f GB, remaining = %.2f GB, available VRAM = %.2f GB (CB reserve = %.2f GB)",
+                  get_desc().c_str(),
+                  result.total_model_size / (1024.0 * 1024.0 * 1024.0),
+                  result.already_on_gpu / (1024.0 * 1024.0 * 1024.0),
+                  result.remaining_to_load / (1024.0 * 1024.0 * 1024.0),
+                  result.available_vram / (1024.0 * 1024.0 * 1024.0),
+                  cb_reserve / (1024.0 * 1024.0 * 1024.0));
+
+        return result;
+    }
+
+    bool load_all_layers_coarse() {
+        if (!streaming_engine_) return false;
+        auto& registry = streaming_engine_->get_registry();
+        auto& budget   = streaming_engine_->get_budget();
+        auto all_layers = registry.get_layer_names_sorted();
+        for (const auto& name : all_layers) {
+            if (!registry.is_layer_on_gpu(name)) {
+                budget.ensure_vram_for_layer(name, 0);
+                registry.move_layer_to_gpu(name);
+            }
+        }
+        return true;
+    }
+
     template <typename T>
     static sd::Tensor<T> take_or_empty(std::optional<sd::Tensor<T>> tensor) {
         if (!tensor.has_value()) {
@@ -1888,6 +1993,11 @@ struct GGMLRunner {
         return gf;
     }
 
+    // Two-step compute graph + buffer setup. Upstream split alloc_compute_buffer
+    // into prepare_compute_graph + alloc_compute_buffer(gf) so the graph-cut
+    // planner can inspect the graph before reserving (it needs to know which
+    // params each segment touches). The old single-call form is preserved as
+    // an overload below for callers that don't need the inspection step.
     bool prepare_compute_graph(get_graph_cb_t get_graph,
                                ggml_cgraph** gf_out) {
         GGML_ASSERT(gf_out != nullptr);
@@ -1910,13 +2020,11 @@ struct GGMLRunner {
         compute_allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(runtime_backend));
 
         if (!ggml_gallocr_reserve(compute_allocr, gf)) {
-            // failed to allocate the compute buffer
             LOG_ERROR("%s: failed to allocate the compute buffer\n", get_desc().c_str());
             free_compute_buffer();
             return false;
         }
 
-        // compute the required memory
         size_t compute_buffer_size = ggml_gallocr_get_buffer_size(compute_allocr, 0);
         LOG_DEBUG("%s compute buffer size: %.2f MB(%s)",
                   get_desc().c_str(),
@@ -1925,6 +2033,29 @@ struct GGMLRunner {
         return true;
     }
 
+    // Backward-compatible single-call overload. Used by the layer-streaming
+    // path which doesn't need to re-inspect the graph before allocating; it
+    // wraps prepare_compute_graph + alloc_compute_buffer(gf) and returns the
+    // built graph via *out_gf so the caller can reuse it for the subsequent
+    // ggml_gallocr_alloc_graph() pass (avoids tensor pointer mismatches).
+    bool alloc_compute_buffer(get_graph_cb_t get_graph, struct ggml_cgraph** out_gf = nullptr) {
+        if (compute_allocr != nullptr) {
+            if (out_gf) *out_gf = nullptr;
+            return true;
+        }
+        ggml_cgraph* gf = nullptr;
+        if (!prepare_compute_graph(get_graph, &gf)) {
+            if (out_gf) *out_gf = nullptr;
+            return false;
+        }
+        if (!alloc_compute_buffer(gf)) {
+            if (out_gf) *out_gf = nullptr;
+            return false;
+        }
+        if (out_gf) *out_gf = gf;
+        return true;
+    }
+
     void free_cache_buffer() {
         if (cache_buffer != nullptr) {
             ggml_backend_buffer_free(cache_buffer);
@@ -2015,29 +2146,44 @@ struct GGMLRunner {
         return true;
     }
 
-    void copy_data_to_backend_tensor(ggml_cgraph* gf, bool clear_after_copy = true) {
-        GGML_ASSERT(gf != nullptr);
+    // Upload entries from backend_tensor_data_map to their backend tensors.
+    // When a graph is supplied, only tensors that appear in the graph are
+    // uploaded (graph-cut needs this so segment-N inputs aren't touched
+    // outside their segment); otherwise every entry is uploaded
+    // unconditionally, which is what the layer-streaming dispatch path
+    // wants since each layer's mini-graph carries only its own inputs.
+    void copy_data_to_backend_tensor(ggml_cgraph* gf = nullptr, bool clear_after_copy = true) {
         std::unordered_set<const ggml_tensor*> graph_tensor_set;
-        const int n_leafs = sd::ggml_graph_cut::leaf_count(gf);
-        const int n_nodes = ggml_graph_n_nodes(gf);
-        graph_tensor_set.reserve(static_cast<size_t>(n_leafs + n_nodes));
-        for (int i = 0; i < n_leafs; ++i) {
-            graph_tensor_set.insert(sd::ggml_graph_cut::leaf_tensor(gf, i));
-        }
-        for (int i = 0; i < n_nodes; ++i) {
-            graph_tensor_set.insert(ggml_graph_node(gf, i));
+        if (gf != nullptr) {
+            const int n_leafs = sd::ggml_graph_cut::leaf_count(gf);
+            const int n_nodes = ggml_graph_n_nodes(gf);
+            graph_tensor_set.reserve(static_cast<size_t>(n_leafs + n_nodes));
+            for (int i = 0; i < n_leafs; ++i) {
+                graph_tensor_set.insert(sd::ggml_graph_cut::leaf_tensor(gf, i));
+            }
+            for (int i = 0; i < n_nodes; ++i) {
+                graph_tensor_set.insert(ggml_graph_node(gf, i));
+            }
         }
 
+        int copied_count  = 0;
+        int skipped_count = 0;
+
         for (auto& kv : backend_tensor_data_map) {
             auto tensor = kv.first;
             auto data   = kv.second;
 
-            if (graph_tensor_set.find(tensor) == graph_tensor_set.end()) {
+            if (gf != nullptr && graph_tensor_set.find(tensor) == graph_tensor_set.end()) {
                 continue;
             }
 
             ggml_backend_buffer_t buf = tensor->view_src ? tensor->view_src->buffer : tensor->buffer;
             if (buf == nullptr) {
+                // Either an input the graph didn't actually allocate, or a
+                // genuine missing-buffer bug. Log once with enough context
+                // to debug; treat as skip rather than crash so layer streaming
+                // (which adds inputs that may go unused in some sub-graphs)
+                // doesn't trip on benign cases.
                 LOG_WARN("%s graph exec skip tensor copy: name=%s op=%s reason=buffer_not_set data=%p view_src=%p view_src_buffer=%p",
                          get_desc().c_str(),
                          tensor && tensor->name[0] != '\0' ? tensor->name : "<unnamed>",
@@ -2045,10 +2191,16 @@ struct GGMLRunner {
                          data,
                          tensor ? tensor->view_src : nullptr,
                          (tensor && tensor->view_src) ? tensor->view_src->buffer : nullptr);
+                skipped_count++;
                 continue;
             }
 
             ggml_backend_tensor_set(tensor, data, 0, ggml_nbytes(tensor));
+            copied_count++;
+        }
+
+        if (copied_count > 0 || skipped_count > 0) {
+            LOG_DEBUG("copy_data_to_backend_tensor: copied %d tensors, skipped %d", copied_count, skipped_count);
         }
 
         if (clear_after_copy) {
@@ -2539,6 +2691,21 @@ struct GGMLRunner {
 
     virtual ~GGMLRunner() {
         free_params_buffer();
+        // Also free the runtime-side weight buffers if allocated. free_params_buffer()
+        // only releases the CPU-side params_buffer; the runtime backend can hold up to
+        // two more buffers (full + partial) that need explicit cleanup here.
+        if (runtime_params_buffer != nullptr) {
+            ggml_backend_buffer_free(runtime_params_buffer);
+            runtime_params_buffer = nullptr;
+        }
+        if (partial_runtime_params_buffer != nullptr) {
+            ggml_backend_buffer_free(partial_runtime_params_buffer);
+            partial_runtime_params_buffer = nullptr;
+        }
+        if (persistent_act_host_buf_ != nullptr) {
+            ggml_backend_buffer_free(persistent_act_host_buf_);
+            persistent_act_host_buf_ = nullptr;
+        }
         free_compute_buffer();
         free_params_ctx();
         free_compute_ctx();
@@ -2548,6 +2715,57 @@ struct GGMLRunner {
         free_cache_ctx_and_buffer();
     }
 
+    // Allocates (or grows) a single GPU-pinned host buffer that backs all the
+    // runner's persistent activation regions for streaming compute paths, and
+    // writes 256-byte-aligned start pointers for each region into out_ptrs
+    // (same length as sizes_bytes). Pinned host memory makes the per-layer
+    // ggml_backend_tensor_get / copy_data_to_backend_tensor calls run at
+    // full PCIe bandwidth instead of staging through CUDA's bounce buffer.
+    //
+    // Returns true on success. On failure (pinned alloc rejected by the
+    // backend, e.g. out of locked pages) returns false so the caller can
+    // fall back to pageable std::vector storage — output is still correct,
+    // just slower.
+    bool ensure_pinned_act_buffers(const std::vector<size_t>& sizes_bytes,
+                                   std::vector<float*>& out_ptrs) {
+        out_ptrs.assign(sizes_bytes.size(), nullptr);
+        const size_t align = 256;
+        std::vector<size_t> aligned_sizes(sizes_bytes.size());
+        size_t total = 0;
+        for (size_t i = 0; i < sizes_bytes.size(); i++) {
+            aligned_sizes[i] = ((sizes_bytes[i] + align - 1) / align) * align;
+            total += aligned_sizes[i];
+        }
+
+        if (persistent_act_host_buf_ == nullptr || persistent_act_host_size_ < total) {
+            if (persistent_act_host_buf_ != nullptr) {
+                ggml_backend_buffer_free(persistent_act_host_buf_);
+                persistent_act_host_buf_ = nullptr;
+            }
+            ggml_backend_dev_t gpu_dev = runtime_backend ? ggml_backend_get_device(runtime_backend) : nullptr;
+            ggml_backend_buffer_type_t host_buft = gpu_dev ? ggml_backend_dev_host_buffer_type(gpu_dev) : nullptr;
+            if (host_buft != nullptr) {
+                persistent_act_host_buf_ = ggml_backend_buft_alloc_buffer(host_buft, total);
+            }
+            if (persistent_act_host_buf_ == nullptr) {
+                LOG_WARN("%s pinned activation buffer alloc failed (%.2f MB), "
+                         "falling back to pageable",
+                         get_desc().c_str(), total / (1024.0 * 1024.0));
+                persistent_act_host_size_ = 0;
+                return false;
+            }
+            persistent_act_host_size_ = total;
+        }
+
+        char* base = static_cast<char*>(ggml_backend_buffer_get_base(persistent_act_host_buf_));
+        size_t offset = 0;
+        for (size_t i = 0; i < sizes_bytes.size(); i++) {
+            out_ptrs[i] = reinterpret_cast<float*>(base + offset);
+            offset += aligned_sizes[i];
+        }
+        return true;
+    }
+
     virtual GGMLRunnerContext get_context() {
         GGMLRunnerContext runner_ctx;
         runner_ctx.ggml_ctx              = compute_ctx;
@@ -2567,7 +2785,34 @@ struct GGMLRunner {
 
     bool alloc_params_buffer() {
         size_t num_tensors = ggml_tensor_num(params_ctx);
-        params_buffer      = ggml_backend_alloc_ctx_tensors(params_ctx, params_backend);
+        bool used_pinned_host = false;
+
+        // When weights live on CPU but get streamed/transferred to GPU during
+        // compute, allocate them in the GPU device's pinned host buffer so
+        // async H2D copies actually overlap with compute. Without pinning,
+        // CUDA falls back to a staged sync copy through an internal bounce
+        // buffer (and Vulkan/Metal hit similar slow paths).
+        if (params_backend != runtime_backend && ggml_backend_is_cpu(params_backend)) {
+            ggml_backend_dev_t gpu_dev = ggml_backend_get_device(runtime_backend);
+            if (gpu_dev != nullptr) {
+                ggml_backend_buffer_type_t host_buft = ggml_backend_dev_host_buffer_type(gpu_dev);
+                if (host_buft != nullptr) {
+                    params_buffer = ggml_backend_alloc_ctx_tensors_from_buft(params_ctx, host_buft);
+                    if (params_buffer != nullptr) {
+                        used_pinned_host = true;
+                    } else {
+                        LOG_WARN("%s pinned host alloc failed (system out of locked pages?), "
+                                 "falling back to pageable",
+                                 get_desc().c_str());
+                    }
+                }
+            }
+        }
+
+        if (params_buffer == nullptr) {
+            params_buffer = ggml_backend_alloc_ctx_tensors(params_ctx, params_backend);
+        }
+
         if (params_buffer == nullptr) {
             LOG_ERROR("%s alloc params backend buffer failed, num_tensors = %i",
                       get_desc().c_str(),
@@ -2577,15 +2822,20 @@ struct GGMLRunner {
         rebuild_params_tensor_set();
         ggml_backend_buffer_set_usage(params_buffer, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
         size_t params_buffer_size = ggml_backend_buffer_get_size(params_buffer);
-        LOG_DEBUG("%s params backend buffer size = % 6.2f MB(%s) (%i tensors)",
+        LOG_DEBUG("%s params backend buffer size = % 6.2f MB(%s%s) (%i tensors)",
                   get_desc().c_str(),
                   params_buffer_size / (1024.f * 1024.f),
                   ggml_backend_is_cpu(params_backend) ? "RAM" : "VRAM",
+                  used_pinned_host ? ",pinned" : "",
                   num_tensors);
         return true;
     }
 
     void free_params_buffer() {
+        // If params are on GPU, move them back to CPU first (this also frees runtime_params_buffer)
+        if (params_on_runtime_backend) {
+            restore_all_params();
+        }
         if (params_buffer != nullptr) {
             ggml_backend_buffer_free(params_buffer);
             params_buffer = nullptr;
@@ -2599,6 +2849,128 @@ struct GGMLRunner {
         return 0;
     }
 
+    // Estimate compute buffer size without actually allocating (dry-run)
+    // Returns 0 on failure, otherwise the required buffer size in bytes
+    size_t estimate_compute_buffer_size(get_graph_cb_t get_graph) {
+        reset_compute_ctx();
+        struct ggml_cgraph* gf = get_compute_graph(get_graph);
+        backend_tensor_data_map.clear();
+
+        ggml_gallocr_t temp_allocr = ggml_gallocr_new(ggml_backend_get_default_buffer_type(runtime_backend));
+        if (temp_allocr == nullptr) {
+            return 0;
+        }
+
+        size_t result = 0;
+        if (ggml_gallocr_reserve(temp_allocr, gf)) {
+            result = ggml_gallocr_get_buffer_size(temp_allocr, 0);
+        }
+
+        ggml_gallocr_free(temp_allocr);
+        reset_compute_ctx();  // Clean up after estimation
+        return result;
+    }
+
+    // Dynamic tensor offloading API
+    // Returns true if params are currently on the runtime (GPU) backend
+    bool is_params_on_gpu() const {
+        // If params_backend == runtime_backend, params are always "on GPU"
+        // (or always on CPU if CPU-only mode)
+        if (params_backend == runtime_backend) {
+            return !ggml_backend_is_cpu(runtime_backend);
+        }
+        // Otherwise check the offload state
+        return params_on_runtime_backend;
+    }
+
+    // Move params from GPU to CPU (params_backend), freeing GPU memory
+    // Returns true on success, false if already on CPU or not applicable
+    bool move_params_to_cpu() {
+        if (params_backend == runtime_backend) {
+            // No separate CPU backend configured, can't offload
+            return false;
+        }
+        if (!params_on_runtime_backend) {
+            // Already on CPU
+            return true;
+        }
+        restore_all_params();
+        return true;
+    }
+
+    // Move params from CPU to GPU (runtime_backend), allocating GPU memory
+    // Returns true on success, false if already on GPU or allocation failed
+    bool move_params_to_gpu() {
+        if (params_backend == runtime_backend) {
+            // No separate CPU backend, params are always on runtime backend
+            return true;
+        }
+        if (params_on_runtime_backend) {
+            // Already on GPU
+            return true;
+        }
+        return offload_all_params();
+    }
+
+    // Get the size of params buffer (VRAM usage when on GPU)
+    size_t get_params_vram_size() const {
+        if (params_buffer != nullptr) {
+            return ggml_backend_buffer_get_size(params_buffer);
+        }
+        return 0;
+    }
+
+    // Control automatic offloading after compute operations
+    // When disabled, params stay on GPU until explicitly moved via move_params_to_cpu()
+    void set_auto_offload(bool enabled) {
+        auto_offload_after_compute = enabled;
+    }
+
+    bool get_auto_offload() const {
+        return auto_offload_after_compute;
+    }
+
+    bool is_streaming_enabled() const {
+        return streaming_engine_ && streaming_engine_->get_config().enabled;
+    }
+
+    void disable_layer_streaming() {
+        if (streaming_engine_) {
+            auto cfg = streaming_engine_->get_config();
+            cfg.enabled = false;
+            streaming_engine_->set_config(cfg);
+        }
+    }
+
+    void offload_streaming_layers() {
+        if (!streaming_engine_) return;
+        auto& registry = streaming_engine_->get_registry();
+        auto layers = registry.get_layer_names_sorted();
+        size_t offloaded = 0;
+        for (const auto& layer : layers) {
+            if (registry.is_layer_on_gpu(layer)) {
+                registry.move_layer_to_cpu(layer);
+                offloaded++;
+            }
+        }
+        if (offloaded > 0) {
+            LOG_INFO("%s offloaded %zu streaming layers to CPU", get_desc().c_str(), offloaded);
+        }
+        // Hook: runners can drop any cached state that referenced the resident
+        // layers (e.g. ZImageRunner's Phase 4 chunk graph), since those tensors
+        // have just been moved to CPU.
+        on_streaming_layers_offloaded();
+    }
+
+    // Override in subclasses to release any cached state tied to the
+    // streaming layers' GPU residency (e.g. cached chunk graphs whose ops
+    // reference the now-evicted weight tensors).
+    virtual void on_streaming_layers_offloaded() {}
+
+    LayerStreaming::LayerExecutionEngine* get_streaming_engine() {
+        return streaming_engine_.get();
+    }
+
     void free_cache_ctx_and_buffer() {
         free_cache_buffer();
         free_cache_ctx();
@@ -2609,8 +2981,16 @@ struct GGMLRunner {
             ggml_gallocr_free(compute_allocr);
             compute_allocr = nullptr;
         }
+        // Graph-cut path: undo any per-segment partial offload so the next
+        // compute starts fresh. Both restore_* calls are no-ops if not active.
         restore_partial_params();
         restore_all_params();
+        // Layer-streaming / offload-mode path: when the runner has been told
+        // to drop params back to the params backend after each compute (e.g.
+        // cond_diffusion / aggressive modes), do that here.
+        if (auto_offload_after_compute) {
+            restore_all_params();
+        }
     }
 
     // do copy after alloc graph
@@ -2669,6 +3049,69 @@ struct GGMLRunner {
         return ggml_get_tensor(cache_ctx, name.c_str());
     }
 
+    // Our fork's compute overload with output tensor and skip_param_offload support
+    bool compute(get_graph_cb_t get_graph,
+                 int n_threads,
+                 bool free_compute_buffer_immediately = true,
+                 ggml_tensor** output                 = nullptr,
+                 ggml_context* output_ctx             = nullptr,
+                 bool skip_param_offload              = false) {
+        // In streaming mode, weights are managed by the streaming engine
+        // so skip the bulk offload which would fail due to VRAM limits
+        if (!skip_param_offload && !offload_all_params()) {
+            LOG_ERROR("%s offload params to runtime backend failed", get_desc().c_str());
+            return false;
+        }
+
+        ggml_cgraph* gf = nullptr;
+        if (!alloc_compute_buffer(get_graph, &gf)) {
+            LOG_ERROR("%s alloc compute buffer failed", get_desc().c_str());
+            return false;
+        }
+        // If alloc_compute_buffer just created a new allocator, gf contains the graph
+        // used for reservation and we MUST reuse it (same tensor pointers).
+        // If allocator already existed, gf is nullptr and we need to rebuild.
+        if (gf == nullptr) {
+            backend_tensor_data_map.clear();
+            reset_compute_ctx();
+            gf = get_compute_graph(get_graph);
+        }
+
+        if (!ggml_gallocr_alloc_graph(compute_allocr, gf)) {
+            LOG_ERROR("%s alloc compute graph failed", get_desc().c_str());
+            return false;
+        }
+        copy_data_to_backend_tensor();
+        if (ggml_backend_is_cpu(runtime_backend)) {
+            ggml_backend_cpu_set_n_threads(runtime_backend, n_threads);
+        }
+
+        ggml_status status = ggml_backend_graph_compute(runtime_backend, gf);
+        if (status != GGML_STATUS_SUCCESS) {
+            LOG_ERROR("%s compute failed: %s", get_desc().c_str(), ggml_status_to_string(status));
+            return false;
+        }
+#ifdef GGML_PERF
+        ggml_graph_print(gf);
+#endif
+        copy_cache_tensors_to_cache_buffer();
+        if (output != nullptr) {
+            auto result = ggml_get_tensor(compute_ctx, final_result_name.c_str());
+            if (*output == nullptr && output_ctx != nullptr) {
+                *output = ggml_dup_tensor(output_ctx, result);
+            }
+            if (*output != nullptr) {
+                ggml_ext_backend_tensor_get_and_sync(runtime_backend, result, (*output)->data, 0, ggml_nbytes(*output));
+            }
+        }
+
+        if (free_compute_buffer_immediately) {
+            free_compute_buffer();
+        }
+        return true;
+    }
+
+    // Upstream's templated compute returning sd::Tensor
     template <typename T>
     std::optional<sd::Tensor<T>> compute(get_graph_cb_t get_graph,
                                          int n_threads,
@@ -2680,6 +3123,10 @@ struct GGMLRunner {
         }
         GGML_ASSERT(gf != nullptr);
 
+        // Try the graph-cut segmented path first when --max-vram is set and
+        // params live on a different backend than the runtime. The planner
+        // may decide a single segment is enough, in which case we fall
+        // through to the regular alloc + execute path below.
         if (can_attempt_graph_cut_segmented_compute()) {
             GraphCutPlan plan;
             if (!resolve_graph_cut_plan(gf, &plan)) {
@@ -2725,6 +3172,12 @@ struct GGMLRunner {
 
     void set_max_graph_vram_bytes(size_t max_vram_bytes) {
         max_graph_vram_bytes = max_vram_bytes;
+        // Forward to the layer-streaming budget too, so --max-vram caps both
+        // the graph-cut planner (above) and our streaming planner. Lets a
+        // single flag drive the simulated-smaller-card case for both paths.
+        if (streaming_engine_) {
+            streaming_engine_->get_budget().set_max_vram_cap_bytes(max_vram_bytes);
+        }
     }
 
     ggml_backend_t get_runtime_backend() {
diff --git a/src/layer_streaming.hpp b/src/layer_streaming.hpp
new file mode 100644
index 000000000..be7a30b72
--- /dev/null
+++ b/src/layer_streaming.hpp
@@ -0,0 +1,513 @@
+#ifndef __LAYER_STREAMING_HPP__
+#define __LAYER_STREAMING_HPP__
+
+#include <algorithm>
+#include <functional>
+#include <map>
+#include <set>
+#include <string>
+#include <vector>
+
+#include "ggml-alloc.h"
+#include "ggml-backend.h"
+#include "ggml.h"
+
+#include "memory_budget.hpp"
+#include "tensor_registry.hpp"
+#include "util.h"
+
+namespace LayerStreaming {
+
+class LayerExecutionEngine;
+
+struct LayerSubgraph {
+    std::string name;
+    int index;
+    size_t estimated_compute_size = 0;
+
+    using ExecuteFn = std::function<std::vector<ggml_tensor*>(
+        ggml_context* ctx,
+        ggml_backend_t backend,
+        const std::vector<ggml_tensor*>& inputs)>;
+
+    ExecuteFn execute_fn;
+};
+
+struct StreamingConfig {
+    bool enabled              = false;
+    int prefetch_layers       = 1;
+    int keep_layers_behind    = 0;
+    size_t min_free_vram      = 512 * 1024 * 1024;
+    bool async_prefetch       = true;
+    bool log_operations       = false;
+};
+
+class IntermediateTensorManager {
+public:
+    IntermediateTensorManager(ggml_backend_t gpu_backend)
+        : gpu_backend_(gpu_backend) {}
+
+    ~IntermediateTensorManager() {
+        clear();
+    }
+
+    ggml_tensor* store(const std::string& name, ggml_tensor* tensor) {
+        if (contexts_.find(name) != contexts_.end()) {
+            if (buffers_.find(name) != buffers_.end()) {
+                ggml_backend_buffer_free(buffers_[name]);
+            }
+            ggml_free(contexts_[name]);
+        }
+
+        size_t ctx_size = ggml_tensor_overhead() + 1024;
+        struct ggml_init_params params = {
+            ctx_size,
+            nullptr,
+            true  // no_alloc
+        };
+        ggml_context* ctx = ggml_init(params);
+        if (ctx == nullptr) {
+            LOG_ERROR("failed to create context for '%s'", name.c_str());
+            return nullptr;
+        }
+
+        ggml_tensor* stored = ggml_dup_tensor(ctx, tensor);
+        ggml_set_name(stored, name.c_str());
+
+        ggml_backend_buffer_t buffer = ggml_backend_alloc_ctx_tensors(ctx, gpu_backend_);
+        if (buffer == nullptr) {
+            LOG_ERROR("failed to allocate buffer for '%s'", name.c_str());
+            ggml_free(ctx);
+            return nullptr;
+        }
+
+        ggml_backend_tensor_copy(tensor, stored);
+        ggml_backend_synchronize(gpu_backend_);
+
+        contexts_[name] = ctx;
+        buffers_[name]  = buffer;
+        tensors_[name]  = stored;
+
+        return stored;
+    }
+
+    ggml_tensor* get(const std::string& name) {
+        auto it = tensors_.find(name);
+        if (it == tensors_.end()) {
+            return nullptr;
+        }
+        return it->second;
+    }
+
+    bool has(const std::string& name) const {
+        return tensors_.find(name) != tensors_.end();
+    }
+
+    void remove(const std::string& name) {
+        auto buf_it = buffers_.find(name);
+        if (buf_it != buffers_.end()) {
+            ggml_backend_buffer_free(buf_it->second);
+            buffers_.erase(buf_it);
+        }
+
+        auto ctx_it = contexts_.find(name);
+        if (ctx_it != contexts_.end()) {
+            ggml_free(ctx_it->second);
+            contexts_.erase(ctx_it);
+        }
+
+        tensors_.erase(name);
+    }
+
+    void clear() {
+        for (auto& [name, buffer] : buffers_) {
+            ggml_backend_buffer_free(buffer);
+        }
+        for (auto& [name, ctx] : contexts_) {
+            ggml_free(ctx);
+        }
+        tensors_.clear();
+        buffers_.clear();
+        contexts_.clear();
+    }
+
+    size_t get_memory_usage() const {
+        size_t total = 0;
+        for (const auto& [name, buffer] : buffers_) {
+            total += ggml_backend_buffer_get_size(buffer);
+        }
+        return total;
+    }
+
+private:
+    ggml_backend_t gpu_backend_;
+    std::unordered_map<std::string, ggml_context*> contexts_;
+    std::unordered_map<std::string, ggml_backend_buffer_t> buffers_;
+    std::unordered_map<std::string, ggml_tensor*> tensors_;
+};
+
+class LayerExecutionEngine {
+public:
+    LayerExecutionEngine(ggml_backend_t gpu_backend,
+                         ggml_backend_t cpu_backend)
+        : gpu_backend_(gpu_backend),
+          cpu_backend_(cpu_backend),
+          registry_(gpu_backend, cpu_backend),
+          budget_(registry_, gpu_backend),
+          intermediates_(gpu_backend) {}
+
+    void set_config(const StreamingConfig& config) {
+        config_ = config;
+    }
+
+    const StreamingConfig& get_config() const {
+        return config_;
+    }
+
+    TensorRegistry& get_registry() {
+        return registry_;
+    }
+
+    MemoryBudgetManager& get_budget() {
+        return budget_;
+    }
+
+    // Prefer register_model_layers_from_map() - context tensors often lack proper names
+    void register_model_layers(ggml_context* params_ctx,
+                               std::function<std::pair<std::string, int>(const std::string&)> layer_pattern_fn) {
+        registry_.register_from_context(params_ctx, "", layer_pattern_fn);
+        log_registered_layers();
+    }
+
+    void register_model_layers_from_map(const std::map<std::string, ggml_tensor*>& tensors,
+                                        std::function<std::pair<std::string, int>(const std::string&)> layer_pattern_fn) {
+        registry_.register_from_map(tensors, layer_pattern_fn);
+        log_registered_layers();
+    }
+
+private:
+    void log_registered_layers() {
+        if (config_.log_operations) {
+            auto layers = registry_.get_layer_names_sorted();
+            LOG_INFO("registered %zu layers", layers.size());
+            for (const auto& layer : layers) {
+                LOG_DEBUG("  - %s: %.2f MB",
+                          layer.c_str(),
+                          registry_.get_layer_size(layer) / (1024.0 * 1024.0));
+            }
+        }
+    }
+
+public:
+
+    std::vector<ggml_tensor*> execute_streaming(
+        const std::vector<LayerSubgraph>& layers,
+        const std::vector<ggml_tensor*>& initial_inputs,
+        ggml_context* output_ctx) {
+
+        if (!config_.enabled || layers.empty()) {
+            LOG_WARN("streaming disabled or no layers");
+            return {};
+        }
+
+        int64_t total_start = ggml_time_ms();
+        std::vector<ggml_tensor*> current_inputs = initial_inputs;
+
+        for (size_t i = 0; i < layers.size(); i++) {
+            const auto& layer = layers[i];
+            int64_t layer_start = ggml_time_ms();
+
+            if (!ensure_layer_loaded(layer.name, static_cast<int>(i))) {
+                LOG_ERROR("failed to load layer '%s'", layer.name.c_str());
+                return {};
+            }
+
+            if (config_.async_prefetch) {
+                for (int j = 1; j <= config_.prefetch_layers && i + j < layers.size(); j++) {
+                    prefetch_layer(layers[i + j].name);
+                }
+            }
+
+            ggml_context* layer_ctx = create_layer_context(layer);
+            if (layer_ctx == nullptr) {
+                LOG_ERROR("failed to create context for layer '%s'", layer.name.c_str());
+                return {};
+            }
+
+            std::vector<ggml_tensor*> outputs = layer.execute_fn(layer_ctx, gpu_backend_, current_inputs);
+
+            for (size_t j = 0; j < outputs.size(); j++) {
+                std::string name = "intermediate_" + std::to_string(i) + "_" + std::to_string(j);
+                ggml_tensor* stored = intermediates_.store(name, outputs[j]);
+                if (stored != nullptr) {
+                    outputs[j] = stored;
+                }
+            }
+
+            if (should_offload_layer(layer.name, static_cast<int>(i), layers)) {
+                registry_.move_layer_to_cpu(layer.name);
+            }
+
+            ggml_free(layer_ctx);
+
+            current_inputs = outputs;
+
+            if (config_.log_operations) {
+                int64_t layer_end = ggml_time_ms();
+                LOG_DEBUG("executed layer '%s' in %.2fs",
+                          layer.name.c_str(),
+                          (layer_end - layer_start) / 1000.0);
+            }
+        }
+
+        int64_t total_end = ggml_time_ms();
+        if (config_.log_operations) {
+            LOG_INFO("executed %zu layers in %.2fs",
+                     layers.size(),
+                     (total_end - total_start) / 1000.0);
+        }
+
+        return current_inputs;
+    }
+
+    void clear() {
+        intermediates_.clear();
+    }
+
+    // Clears everything including registry (for new model)
+    void reset() {
+        intermediates_.clear();
+        registry_.clear();
+    }
+
+    void prefetch_layer(const std::string& layer_name) {
+        if (!config_.async_prefetch) {
+            return;
+        }
+
+        if (registry_.is_layer_on_gpu(layer_name)) {
+            return;
+        }
+
+        if (pending_prefetches_.find(layer_name) != pending_prefetches_.end()) {
+            return;
+        }
+
+        if (registry_.start_async_layer_load(layer_name, gpu_backend_, cpu_backend_)) {
+            pending_prefetches_.insert(layer_name);
+            if (config_.log_operations) {
+                LOG_DEBUG("started async prefetch for '%s'", layer_name.c_str());
+            }
+        }
+    }
+
+    void wait_for_prefetch(const std::string& layer_name) {
+        auto it = pending_prefetches_.find(layer_name);
+        if (it == pending_prefetches_.end()) {
+            return;
+        }
+
+        if (registry_.complete_async_layer_load(layer_name, gpu_backend_)) {
+            pending_prefetches_.erase(it);
+            if (config_.log_operations) {
+                LOG_DEBUG("completed async prefetch for '%s'", layer_name.c_str());
+            }
+        }
+    }
+
+    void wait_for_all_prefetches() {
+        for (const auto& layer_name : pending_prefetches_) {
+            registry_.complete_async_layer_load(layer_name, gpu_backend_);
+        }
+        pending_prefetches_.clear();
+    }
+
+    bool is_prefetch_pending(const std::string& layer_name) const {
+        return pending_prefetches_.find(layer_name) != pending_prefetches_.end();
+    }
+
+    // Decides how many blocks to keep permanently resident on GPU for a
+    // section of the model (e.g. all "layers.N" or all "double_blocks.N").
+    // Static partition follows ComfyUI's partially_load() — for the cyclic
+    // sequential access pattern of diffusion sampling, caching a fixed
+    // prefix is simpler and faster than dynamic eviction. Caller is
+    // responsible for storing the result and only computing it once per
+    // section so that consecutive calls inside the same generation see a
+    // consistent VRAM budget.
+    //
+    // sample_block_name should be a real block in the section (e.g.
+    // "layers.0") so per-block size can be measured. compute_buffer_reserve
+    // should be set per-runner to the peak compute buffer observed during
+    // a single block forward pass.
+    int compute_resident_block_count(const std::string& sample_block_name,
+                                     int num_blocks,
+                                     size_t compute_buffer_reserve = 768ULL * 1024 * 1024) {
+        if (num_blocks <= 0) {
+            return 0;
+        }
+
+        size_t per_block = registry_.get_layer_size(sample_block_name);
+        if (per_block == 0) {
+            return 0;
+        }
+
+        // Headroom: prefetch window in flight + the active block + the
+        // upcoming compute buffer + a hard safety margin. Without this
+        // slack the next prefetch's cudaMalloc can fail mid-loop.
+        int prefetch_count = std::max(1, config_.prefetch_layers);
+        size_t prefetch_reserve = static_cast<size_t>(prefetch_count + 1) * per_block;
+        size_t safety = std::max<size_t>(config_.min_free_vram, 512ULL * 1024 * 1024);
+        size_t reserved = prefetch_reserve + safety + compute_buffer_reserve;
+
+        size_t free_vram = budget_.get_free_vram();
+        if (free_vram <= reserved) {
+            return 0;
+        }
+        size_t available = free_vram - reserved;
+        int max_resident = static_cast<int>(available / per_block);
+        return std::min(num_blocks, max_resident);
+    }
+
+    // Prime the prefetch pipeline by kicking off transfers for the first
+    // prefetch_layers blocks starting at start_idx. Call once before the
+    // streaming loop. name_for(i) -> the registry key for block i.
+    void prime_prefetch(const std::function<std::string(int)>& name_for,
+                        int start_idx, int num_blocks) {
+        int n = config_.prefetch_layers > 0 ? config_.prefetch_layers : 1;
+        for (int j = 0; j < n && (start_idx + j) < num_blocks; j++) {
+            prefetch_layer(name_for(start_idx + j));
+        }
+    }
+
+    // After moving block current_idx to GPU, kick off prefetch of the slot
+    // (current_idx + prefetch_layers) so the window stays full.
+    void advance_prefetch(const std::function<std::string(int)>& name_for,
+                          int current_idx, int num_blocks) {
+        int n = config_.prefetch_layers > 0 ? config_.prefetch_layers : 1;
+        int target = current_idx + n;
+        if (target < num_blocks) {
+            prefetch_layer(name_for(target));
+        }
+    }
+
+private:
+    bool ensure_layer_loaded(const std::string& layer_name, int current_idx) {
+        if (registry_.is_layer_on_gpu(layer_name)) {
+            return true;
+        }
+
+        if (!budget_.ensure_vram_for_layer(layer_name, current_idx)) {
+            LOG_ERROR("cannot ensure VRAM for layer '%s'", layer_name.c_str());
+            return false;
+        }
+
+        return registry_.move_layer_to_gpu(layer_name);
+    }
+
+    bool should_offload_layer(const std::string& layer_name,
+                              int layer_idx,
+                              const std::vector<LayerSubgraph>& layers) {
+        if (layer_name == "_global") {
+            return false;
+        }
+
+        size_t free_vram = budget_.get_available_vram();
+        if (free_vram > config_.min_free_vram * 2) {
+            return false;
+        }
+
+        // UNet skip connections need more sophisticated logic
+        if (config_.keep_layers_behind > 0) {
+            return false;
+        }
+
+        return free_vram < config_.min_free_vram;
+    }
+
+    ggml_context* create_layer_context(const LayerSubgraph& layer) {
+        size_t ctx_size = 1024 * 1024;
+        if (layer.estimated_compute_size > 0) {
+            ctx_size = layer.estimated_compute_size;
+        }
+
+        struct ggml_init_params params = {
+            ctx_size,
+            nullptr,
+            true  // no_alloc
+        };
+
+        return ggml_init(params);
+    }
+
+    ggml_backend_t gpu_backend_;
+    ggml_backend_t cpu_backend_;
+
+    TensorRegistry registry_;
+    MemoryBudgetManager budget_;
+    IntermediateTensorManager intermediates_;
+
+    StreamingConfig config_;
+
+    std::set<std::string> pending_prefetches_;
+};
+
+inline std::vector<LayerSubgraph> build_flux_layer_subgraphs(
+    int depth,
+    int depth_single,
+    const std::vector<int>& skip_layers = {}) {
+
+    std::vector<LayerSubgraph> layers;
+
+    for (int i = 0; i < depth; i++) {
+        if (std::find(skip_layers.begin(), skip_layers.end(), i) != skip_layers.end()) {
+            continue;
+        }
+
+        LayerSubgraph layer;
+        layer.name  = "double_blocks." + std::to_string(i);
+        layer.index = i;
+        layers.push_back(layer);
+    }
+
+    for (int i = 0; i < depth_single; i++) {
+        if (std::find(skip_layers.begin(), skip_layers.end(), i + depth) != skip_layers.end()) {
+            continue;
+        }
+
+        LayerSubgraph layer;
+        layer.name  = "single_blocks." + std::to_string(i);
+        layer.index = depth + i;
+        layers.push_back(layer);
+    }
+
+    return layers;
+}
+
+// UNet uses coarse stages due to skip connections
+inline std::vector<LayerSubgraph> build_unet_layer_subgraphs(
+    int num_input_blocks,
+    int num_output_blocks) {
+
+    std::vector<LayerSubgraph> layers;
+
+    LayerSubgraph input_stage;
+    input_stage.name  = "input_blocks";
+    input_stage.index = 0;
+    layers.push_back(input_stage);
+
+    LayerSubgraph middle_stage;
+    middle_stage.name  = "middle_block";
+    middle_stage.index = 1;
+    layers.push_back(middle_stage);
+
+    LayerSubgraph output_stage;
+    output_stage.name  = "output_blocks";
+    output_stage.index = 2;
+    layers.push_back(output_stage);
+
+    return layers;
+}
+
+}  // namespace LayerStreaming
+
+#endif  // __LAYER_STREAMING_HPP__
diff --git a/src/lora.hpp b/src/lora.hpp
index b57bc4226..f4e42890f 100644
--- a/src/lora.hpp
+++ b/src/lora.hpp
@@ -24,8 +24,9 @@ struct LoraModel : public GGMLRunner {
               ggml_backend_t backend,
               const std::string& file_path = "",
               std::string prefix           = "",
-              SDVersion version            = VERSION_COUNT)
-        : lora_id(lora_id), file_path(file_path), GGMLRunner(backend, false) {
+              SDVersion version            = VERSION_COUNT,
+              bool enable_offload          = false)
+        : lora_id(lora_id), file_path(file_path), GGMLRunner(backend, enable_offload) {
         prefix = "lora." + prefix;
         if (!model_loader.init_from_file_and_convert_name(file_path, prefix, version)) {
             load_failed = true;
@@ -94,6 +95,29 @@ struct LoraModel : public GGMLRunner {
         return true;
     }
 
+    // Reload params from disk after buffer was freed (for dynamic offloading)
+    // Assumes lora_tensors map is still valid (tensors exist in params_ctx)
+    bool reload_params(int n_threads) {
+        if (lora_tensors.empty()) {
+            return true;  // Nothing to reload
+        }
+
+        alloc_params_buffer();
+
+        auto on_reload_cb = [&](const TensorStorage& tensor_storage, ggml_tensor** dst_tensor) -> bool {
+            const std::string& name = tensor_storage.name;
+            auto iter = lora_tensors.find(name);
+            if (iter != lora_tensors.end()) {
+                *dst_tensor = iter->second;
+            }
+            return true;
+        };
+
+        model_loader.load_tensors(on_reload_cb, n_threads);
+        LOG_DEBUG("reloaded lora params from disk");
+        return true;
+    }
+
     void preprocess_lora_tensors(const std::map<std::string, ggml_tensor*>& model_tensors) {
         if (tensor_preprocessed) {
             return;
diff --git a/src/memory_budget.hpp b/src/memory_budget.hpp
new file mode 100644
index 000000000..199c58091
--- /dev/null
+++ b/src/memory_budget.hpp
@@ -0,0 +1,316 @@
+#ifndef __MEMORY_BUDGET_HPP__
+#define __MEMORY_BUDGET_HPP__
+
+#include <algorithm>
+#include <string>
+#include <vector>
+
+#include "ggml-backend.h"
+#include "ggml.h"
+
+#include "tensor_registry.hpp"
+#include "util.h"
+
+namespace LayerStreaming {
+
+enum class EvictionPolicy {
+    LAYER_DISTANCE,
+    LRU,
+    LARGEST_FIRST,
+};
+
+class MemoryBudgetManager {
+public:
+    MemoryBudgetManager(TensorRegistry& registry,
+                        ggml_backend_t gpu_backend,
+                        size_t safety_margin_bytes = 512 * 1024 * 1024)
+        : registry_(registry),
+          gpu_backend_(gpu_backend),
+          safety_margin_(safety_margin_bytes) {
+        query_device_memory();
+    }
+
+    void set_eviction_policy(EvictionPolicy policy) {
+        eviction_policy_ = policy;
+    }
+
+    void set_safety_margin(size_t bytes) {
+        safety_margin_ = bytes;
+    }
+
+    void query_device_memory() {
+        // Use runtime backend device API (works for CUDA, Vulkan, Metal, etc.).
+        // The previous SD_USE_CUDA gate broke after PR #1448 removed compile-time
+        // backend selection, leaving every build on the 8 GB / 4 GB fallback.
+        ggml_backend_dev_t dev = gpu_backend_ ? ggml_backend_get_device(gpu_backend_) : nullptr;
+        if (dev != nullptr) {
+            ggml_backend_dev_memory(dev, &free_vram_, &total_vram_);
+        } else {
+            total_vram_ = 8ULL * 1024 * 1024 * 1024;
+            free_vram_  = total_vram_ / 2;
+        }
+        // If the caller set a `--max-vram` budget, treat that as the upper
+        // bound on what our streaming planner is allowed to see, so the
+        // same budget knob drives both leejet's graph-cut path and our
+        // layer-streaming path. Lets users simulate a smaller card without
+        // needing a separate flag.
+        if (max_vram_cap_bytes_ > 0) {
+            if (max_vram_cap_bytes_ < free_vram_) {
+                free_vram_ = max_vram_cap_bytes_;
+            }
+            if (max_vram_cap_bytes_ < total_vram_) {
+                total_vram_ = max_vram_cap_bytes_;
+            }
+        }
+        LOG_DEBUG("total VRAM = %.2f GB, free = %.2f GB",
+                  total_vram_ / (1024.0 * 1024.0 * 1024.0),
+                  free_vram_ / (1024.0 * 1024.0 * 1024.0));
+    }
+
+    void set_max_vram_cap_bytes(size_t bytes) {
+        max_vram_cap_bytes_ = bytes;
+    }
+
+    void set_compute_buffer_reserve(size_t bytes) {
+        compute_buffer_reserve_ = bytes;
+    }
+
+    size_t get_compute_buffer_reserve() const {
+        return compute_buffer_reserve_;
+    }
+
+    size_t get_free_vram() {
+        query_device_memory();
+        return free_vram_;
+    }
+
+    size_t get_total_vram() const {
+        return total_vram_;
+    }
+
+    size_t get_available_vram() {
+        size_t free = get_free_vram();
+        if (free <= safety_margin_) {
+            return 0;
+        }
+        return free - safety_margin_;
+    }
+
+    bool has_enough_vram(size_t required_bytes) {
+        return get_available_vram() >= required_bytes;
+    }
+
+    // Evicts other layers if necessary to make room
+    bool ensure_vram_for_layer(const std::string& layer_name, int current_layer_idx = -1) {
+        if (registry_.is_layer_on_gpu(layer_name)) {
+            return true;
+        }
+
+        size_t layer_size = registry_.get_layer_size(layer_name);
+        if (layer_size == 0) {
+            LOG_ERROR("layer '%s' not found", layer_name.c_str());
+            return false;
+        }
+
+        if (has_enough_vram(layer_size)) {
+            return true;
+        }
+
+        size_t needed = layer_size - get_available_vram();
+        return evict_layers_for_space(needed, layer_name, current_layer_idx);
+    }
+
+    // Dry-run allocation to get exact buffer requirements
+    size_t estimate_compute_buffer_size(ggml_cgraph* graph) {
+        if (graph == nullptr) {
+            return 0;
+        }
+
+        ggml_gallocr_t temp_allocr = ggml_gallocr_new(
+            ggml_backend_get_default_buffer_type(gpu_backend_));
+
+        if (!ggml_gallocr_reserve(temp_allocr, graph)) {
+            ggml_gallocr_free(temp_allocr);
+            return 0;
+        }
+
+        size_t compute_size = ggml_gallocr_get_buffer_size(temp_allocr, 0);
+        ggml_gallocr_free(temp_allocr);
+
+        return compute_size;
+    }
+
+    bool should_offload_layer(const std::string& layer_name,
+                              const std::string& next_layer_name,
+                              int keep_layers_ahead = 1) {
+        size_t next_layer_size = registry_.get_layer_size(next_layer_name);
+        if (has_enough_vram(next_layer_size * (keep_layers_ahead + 1))) {
+            return false;
+        }
+        return true;
+    }
+
+    std::vector<std::string> get_suggested_gpu_layers(int current_layer_idx,
+                                                       int layers_ahead = 1,
+                                                       int layers_behind = 0) {
+        auto all_layers = registry_.get_layer_names_sorted();
+        std::vector<std::string> result;
+
+        for (const auto& name : all_layers) {
+            if (name == "_global") {
+                result.push_back(name);
+                continue;
+            }
+
+            // TODO: filter by index range once layer index tracking is implemented
+            result.push_back(name);
+        }
+
+        return result;
+    }
+
+private:
+    bool evict_layers_for_space(size_t bytes_needed,
+                                const std::string& protected_layer,
+                                int current_layer_idx) {
+        auto layers_on_gpu = registry_.get_layers_on_gpu();
+        if (layers_on_gpu.empty()) {
+            LOG_ERROR("no layers to evict but need %.2f MB",
+                      bytes_needed / (1024.0 * 1024.0));
+            return false;
+        }
+
+        layers_on_gpu.erase(
+            std::remove(layers_on_gpu.begin(), layers_on_gpu.end(), protected_layer),
+            layers_on_gpu.end());
+
+        // _global contains shared tensors, never evict
+        layers_on_gpu.erase(
+            std::remove(layers_on_gpu.begin(), layers_on_gpu.end(), "_global"),
+            layers_on_gpu.end());
+
+        if (layers_on_gpu.empty()) {
+            LOG_ERROR("no evictable layers available");
+            return false;
+        }
+
+        std::vector<std::pair<std::string, int>> scored_layers;
+        for (const auto& layer : layers_on_gpu) {
+            int score = compute_eviction_score(layer, current_layer_idx);
+            scored_layers.push_back({layer, score});
+        }
+
+        std::sort(scored_layers.begin(), scored_layers.end(),
+                  [](const auto& a, const auto& b) { return a.second > b.second; });
+
+        size_t freed = 0;
+        for (const auto& [layer, score] : scored_layers) {
+            size_t layer_size = registry_.get_layer_size(layer);
+            registry_.move_layer_to_cpu(layer);
+            freed += layer_size;
+
+            LOG_DEBUG("evicted layer '%s' (%.2f MB), total freed: %.2f MB",
+                      layer.c_str(),
+                      layer_size / (1024.0 * 1024.0),
+                      freed / (1024.0 * 1024.0));
+
+            if (freed >= bytes_needed) {
+                return true;
+            }
+        }
+
+        LOG_WARN("only freed %.2f MB, needed %.2f MB",
+                 freed / (1024.0 * 1024.0),
+                 bytes_needed / (1024.0 * 1024.0));
+        return freed >= bytes_needed;
+    }
+
+    // Higher score = more likely to evict
+    int compute_eviction_score(const std::string& layer, int current_layer_idx) {
+        switch (eviction_policy_) {
+            case EvictionPolicy::LAYER_DISTANCE: {
+                int layer_idx = extract_layer_index(layer);
+                if (layer_idx < 0 || current_layer_idx < 0) {
+                    return 0;
+                }
+                return std::abs(layer_idx - current_layer_idx);
+            }
+
+            case EvictionPolicy::LARGEST_FIRST: {
+                return static_cast<int>(registry_.get_layer_size(layer) / (1024 * 1024));
+            }
+
+            case EvictionPolicy::LRU:
+            default:
+                // TODO: LRU needs access tracking in TensorRegistry, falling back to size-based
+                return static_cast<int>(registry_.get_layer_size(layer) / (1024 * 1024));
+        }
+    }
+
+    int extract_layer_index(const std::string& layer_name) {
+        size_t db_pos = layer_name.find("double_blocks.");
+        if (db_pos != std::string::npos) {
+            size_t num_start = db_pos + 14;
+            try {
+                return std::stoi(layer_name.substr(num_start));
+            } catch (...) {
+                return -1;
+            }
+        }
+
+        size_t sb_pos = layer_name.find("single_blocks.");
+        if (sb_pos != std::string::npos) {
+            size_t num_start = sb_pos + 14;
+            try {
+                return 19 + std::stoi(layer_name.substr(num_start));  // offset past double_blocks
+            } catch (...) {
+                return -1;
+            }
+        }
+
+        size_t ib_pos = layer_name.find("input_blocks.");
+        if (ib_pos != std::string::npos) {
+            size_t num_start = ib_pos + 13;
+            try {
+                return std::stoi(layer_name.substr(num_start));
+            } catch (...) {
+                return -1;
+            }
+        }
+
+        size_t ob_pos = layer_name.find("output_blocks.");
+        if (ob_pos != std::string::npos) {
+            size_t num_start = ob_pos + 14;
+            try {
+                return 200 + std::stoi(layer_name.substr(num_start));
+            } catch (...) {
+                return -1;
+            }
+        }
+
+        if (layer_name.find("middle_block") != std::string::npos) {
+            return 100;
+        }
+
+        return -1;
+    }
+
+    TensorRegistry& registry_;
+    ggml_backend_t gpu_backend_;
+
+    size_t total_vram_              = 0;
+    size_t free_vram_               = 0;
+    size_t safety_margin_           = 512 * 1024 * 1024;
+    size_t max_vram_cap_bytes_      = 0;                  // 0 = no cap; set by --max-vram
+    size_t compute_buffer_reserve_  = 768ULL * 1024 * 1024;  // headroom for the active block's compute graph
+                                                              // alloc; matches compute_resident_block_count default.
+                                                              // Used by analyze_vram_budget() to avoid picking
+                                                              // coarse-stage when params fit but params + CB
+                                                              // would exceed VRAM.
+
+    EvictionPolicy eviction_policy_ = EvictionPolicy::LAYER_DISTANCE;
+};
+
+}  // namespace LayerStreaming
+
+#endif  // __MEMORY_BUDGET_HPP__
diff --git a/src/mmdit.hpp b/src/mmdit.hpp
index e57041dc9..fd305c3e0 100644
--- a/src/mmdit.hpp
+++ b/src/mmdit.hpp
@@ -3,7 +3,9 @@
 
 #include <memory>
 
+#include "common_dit.hpp"
 #include "ggml_extend.hpp"
+#include "layer_streaming.hpp"
 #include "model.h"
 
 #define MMDIT_GRAPH_SIZE 10240
@@ -745,6 +747,64 @@ struct MMDiT : public GGMLBlock {
         return spatial_pos_embed;
     }
 
+    struct StreamingInputResult {
+        ggml_tensor* x;        // [N, H*W, hidden_size]
+        ggml_tensor* context;  // [N, L, hidden_size]
+        ggml_tensor* c_mod;    // [N, hidden_size]
+    };
+
+    StreamingInputResult forward_input_stage(GGMLRunnerContext* ctx,
+                                             ggml_tensor* x,
+                                             ggml_tensor* t,
+                                             ggml_tensor* y,
+                                             ggml_tensor* context,
+                                             int64_t H, int64_t W) {
+        auto x_embedder = std::dynamic_pointer_cast<PatchEmbed>(blocks["x_embedder"]);
+        auto t_embedder = std::dynamic_pointer_cast<TimestepEmbedder>(blocks["t_embedder"]);
+
+        // Patch embed + pos embed
+        auto patch_embed = x_embedder->forward(ctx, x);              // [N, H*W, hidden_size]
+        auto pos_embed_out = cropped_pos_embed(ctx->ggml_ctx, H, W); // [1, H*W, hidden_size]
+        x = ggml_add(ctx->ggml_ctx, patch_embed, pos_embed_out);     // [N, H*W, hidden_size]
+
+        // Timestep embedding
+        auto c = t_embedder->forward(ctx, t);  // [N, hidden_size]
+
+        // Y embedding (if present)
+        if (y != nullptr && adm_in_channels != -1) {
+            auto y_embedder = std::dynamic_pointer_cast<VectorEmbedder>(blocks["y_embedder"]);
+            y = y_embedder->forward(ctx, y);   // [N, hidden_size]
+            c = ggml_add(ctx->ggml_ctx, c, y);
+        }
+
+        // Context embedding
+        if (context != nullptr) {
+            auto context_embedder = std::dynamic_pointer_cast<Linear>(blocks["context_embedder"]);
+            context = context_embedder->forward(ctx, context);  // [N, L, hidden_size]
+        }
+
+        return {x, context, c};
+    }
+
+    std::pair<ggml_tensor*, ggml_tensor*> forward_joint_block(GGMLRunnerContext* ctx,
+                                                              int block_idx,
+                                                              ggml_tensor* context,
+                                                              ggml_tensor* x,
+                                                              ggml_tensor* c_mod) {
+        auto block = std::dynamic_pointer_cast<JointBlock>(blocks["joint_blocks." + std::to_string(block_idx)]);
+        return block->forward(ctx, context, x, c_mod);
+    }
+
+    ggml_tensor* forward_output_stage(GGMLRunnerContext* ctx,
+                                      ggml_tensor* x,
+                                      ggml_tensor* c_mod) {
+        auto final_layer = std::dynamic_pointer_cast<FinalLayer>(blocks["final_layer"]);
+        return final_layer->forward(ctx, x, c_mod);  // (N, H*W, patch_size ** 2 * out_channels)
+    }
+
+    int get_depth() const { return depth; }
+    int get_patch_size() const { return patch_size; }
+
     ggml_tensor* forward_core_with_concat(GGMLRunnerContext* ctx,
                                           ggml_tensor* x,
                                           ggml_tensor* c_mod,
@@ -827,6 +887,10 @@ struct MMDiT : public GGMLBlock {
 struct MMDiTRunner : public GGMLRunner {
     MMDiT mmdit;
 
+    // Static layer cache decided on the first sampling step. -1 = not yet
+    // computed; 0..N = number of joint_blocks kept resident on GPU.
+    int resident_joint_blocks_ = -1;
+
     MMDiTRunner(ggml_backend_t backend,
                 bool offload_params_to_cpu,
                 const String2TensorStorage& tensor_storage_map = {},
@@ -843,6 +907,353 @@ struct MMDiTRunner : public GGMLRunner {
         mmdit.get_param_tensors(tensors, prefix);
     }
 
+    void enable_layer_streaming(const LayerStreaming::StreamingConfig& config = {}) {
+        std::map<std::string, ggml_tensor*> tensor_map;
+        mmdit.get_param_tensors(tensor_map, "model.diffusion_model");
+        init_streaming(config, tensor_map, LayerStreaming::mmdit_layer_pattern);
+        LOG_INFO("%s layer streaming enabled (%zu layers)",
+                 get_desc().c_str(), streaming_engine_->get_registry().get_layer_count());
+    }
+
+    bool compute_streaming(int n_threads,
+                           ggml_tensor* x,
+                           ggml_tensor* timesteps,
+                           ggml_tensor* context,
+                           ggml_tensor* y,
+                           ggml_tensor** output          = nullptr,
+                           ggml_context* output_ctx      = nullptr,
+                           std::vector<int> skip_layers  = std::vector<int>()) {
+        if (!is_streaming_enabled()) {
+            LOG_ERROR("%s streaming not enabled", get_desc().c_str());
+            return false;
+        }
+
+        int64_t t0 = ggml_time_ms();
+        auto analysis = analyze_vram_budget();
+
+        if (analysis.fits_in_vram) {
+            LOG_INFO("%s model fits in VRAM, using coarse-stage streaming", get_desc().c_str());
+            load_all_layers_coarse();
+            bool result = compute(n_threads, x, timesteps, context, y, output, output_ctx, skip_layers, true);
+            int64_t t1 = ggml_time_ms();
+            LOG_INFO("%s coarse-stage streaming completed in %.2fs", get_desc().c_str(), (t1 - t0) / 1000.0);
+            free_compute_buffer();
+            return result;
+        }
+
+        LOG_INFO("%s remaining %.2f GB exceeds available %.2f GB, using per-layer streaming",
+                 get_desc().c_str(),
+                 analysis.remaining_to_load / (1024.0 * 1024.0 * 1024.0),
+                 analysis.available_vram / (1024.0 * 1024.0 * 1024.0));
+
+        return compute_streaming_true(n_threads, x, timesteps, context, y, output, output_ctx, skip_layers);
+    }
+
+    bool compute_streaming_true(int n_threads,
+                                ggml_tensor* x,
+                                ggml_tensor* timesteps,
+                                ggml_tensor* context,
+                                ggml_tensor* y,
+                                ggml_tensor** output          = nullptr,
+                                ggml_context* output_ctx      = nullptr,
+                                std::vector<int> skip_layers  = std::vector<int>()) {
+        auto& registry = streaming_engine_->get_registry();
+        int64_t t_start = ggml_time_ms();
+
+        const int num_blocks = mmdit.get_depth();
+        const int patch_size = mmdit.get_patch_size();
+        const int64_t W = x->ne[0];
+        const int64_t H = x->ne[1];
+
+        LOG_INFO("TRUE per-layer streaming - %d joint_blocks", num_blocks);
+
+        // Load global layers
+        LOG_DEBUG("Loading global layers");
+        if (!registry.move_layer_to_gpu("_global")) {
+            LOG_ERROR("Failed to load _global to GPU");
+            return false;
+        }
+
+        // Persistent storage for intermediate tensors. Backed by a single
+        // GPU-pinned host buffer (ensure_pinned_act_buffers) so per-block
+        // ggml_backend_tensor_get / set_backend_tensor_data run at full
+        // PCIe bandwidth. context is optional (some MMDiT variants omit it).
+        std::vector<float> persistent_x_fallback;
+        std::vector<float> persistent_context_fallback;
+        std::vector<float> persistent_c_mod_fallback;
+        float* persistent_x       = nullptr;
+        float* persistent_context = nullptr;
+        float* persistent_c_mod   = nullptr;
+        size_t persistent_x_count       = 0;
+        size_t persistent_context_count = 0;
+        size_t persistent_c_mod_count   = 0;
+        int64_t x_ne[4], context_ne[4], c_mod_ne[4];
+
+        LOG_DEBUG("Executing input stage");
+        {
+            ggml_tensor* x_output = nullptr;
+            ggml_tensor* context_output = nullptr;
+            ggml_tensor* c_mod_output = nullptr;
+
+            auto get_input_graph = [&]() -> struct ggml_cgraph* {
+                struct ggml_cgraph* gf = new_graph_custom(MMDIT_GRAPH_SIZE / 4);
+                auto runner_ctx = get_context();
+
+                ggml_tensor* x_backend = to_backend(x);
+                ggml_tensor* timesteps_backend = to_backend(timesteps);
+                ggml_tensor* y_backend = y ? to_backend(y) : nullptr;
+                ggml_tensor* context_backend = context ? to_backend(context) : nullptr;
+
+                auto result = mmdit.forward_input_stage(&runner_ctx, x_backend, timesteps_backend,
+                                                         y_backend, context_backend, H, W);
+
+                x_output = result.x;
+                context_output = result.context;
+                c_mod_output = result.c_mod;
+
+                ggml_build_forward_expand(gf, x_output);
+                if (context_output) ggml_build_forward_expand(gf, context_output);
+                ggml_build_forward_expand(gf, c_mod_output);
+
+                return gf;
+            };
+
+            // Don't free compute buffer immediately - we need to read outputs first
+            if (!GGMLRunner::compute(get_input_graph, n_threads, false, nullptr, nullptr, true)) {
+                LOG_ERROR("Input stage failed");
+                return false;
+            }
+
+            // Extract to persistent storage
+            if (x_output && c_mod_output) {
+                size_t x_size       = ggml_nelements(x_output);
+                size_t c_mod_size   = ggml_nelements(c_mod_output);
+                size_t context_size = context_output ? ggml_nelements(context_output) : 0;
+
+                persistent_x_count       = x_size;
+                persistent_c_mod_count   = c_mod_size;
+                persistent_context_count = context_size;
+
+                std::vector<float*> ptrs;
+                if (ensure_pinned_act_buffers({x_size       * sizeof(float),
+                                               c_mod_size   * sizeof(float),
+                                               context_size * sizeof(float)}, ptrs)) {
+                    persistent_x       = ptrs[0];
+                    persistent_c_mod   = ptrs[1];
+                    persistent_context = context_size ? ptrs[2] : nullptr;
+                } else {
+                    persistent_x_fallback.resize(x_size);
+                    persistent_c_mod_fallback.resize(c_mod_size);
+                    persistent_x     = persistent_x_fallback.data();
+                    persistent_c_mod = persistent_c_mod_fallback.data();
+                    if (context_size) {
+                        persistent_context_fallback.resize(context_size);
+                        persistent_context = persistent_context_fallback.data();
+                    }
+                }
+
+                ggml_backend_tensor_get(x_output, persistent_x, 0, x_size * sizeof(float));
+                ggml_backend_tensor_get(c_mod_output, persistent_c_mod, 0, c_mod_size * sizeof(float));
+
+                for (int i = 0; i < 4; i++) {
+                    x_ne[i] = x_output->ne[i];
+                    c_mod_ne[i] = c_mod_output->ne[i];
+                }
+
+                if (context_output) {
+                    ggml_backend_tensor_get(context_output, persistent_context, 0, context_size * sizeof(float));
+                    for (int i = 0; i < 4; i++) {
+                        context_ne[i] = context_output->ne[i];
+                    }
+                }
+            } else {
+                LOG_ERROR("Failed to get input stage outputs");
+                free_compute_buffer();
+                return false;
+            }
+
+            // Now safe to free compute buffer
+            free_compute_buffer();
+        }
+
+        LOG_DEBUG("Input stage done, x=%ldx%ldx%ld", x_ne[0], x_ne[1], x_ne[2]);
+
+        auto block_name_at = [](int i) { return "joint_blocks." + std::to_string(i); };
+        if (streaming_engine_) {
+            if (resident_joint_blocks_ < 0) {
+                resident_joint_blocks_ = streaming_engine_->compute_resident_block_count(
+                    "joint_blocks.0", num_blocks);
+                LOG_INFO("%s joint_blocks cache: %d resident, %d streamed per step",
+                         get_desc().c_str(),
+                         resident_joint_blocks_,
+                         num_blocks - resident_joint_blocks_);
+            }
+
+            int prefetch_start = 0;
+            while (prefetch_start < num_blocks &&
+                   registry.is_layer_on_gpu(block_name_at(prefetch_start))) {
+                prefetch_start++;
+            }
+            streaming_engine_->prime_prefetch(block_name_at, prefetch_start, num_blocks);
+        }
+
+        for (int block_idx = 0; block_idx < num_blocks; block_idx++) {
+            // Check skip_layers
+            if (skip_layers.size() > 0 && std::find(skip_layers.begin(), skip_layers.end(), block_idx) != skip_layers.end()) {
+                LOG_DEBUG("Skipping joint_block %d", block_idx);
+                continue;
+            }
+
+            std::string block_name = block_name_at(block_idx);
+            int64_t t_block_start = ggml_time_ms();
+
+            // Wait for this block's prefetch to complete (if async prefetch was started)
+            if (streaming_engine_) {
+                streaming_engine_->wait_for_prefetch(block_name);
+            }
+
+            // Load this block's weights (sync load if prefetch didn't happen)
+            if (!registry.move_layer_to_gpu(block_name)) {
+                LOG_ERROR("Failed to load %s", block_name.c_str());
+                return false;
+            }
+
+            // Keep the prefetch window full
+            if (streaming_engine_) {
+                streaming_engine_->advance_prefetch(block_name_at, block_idx, num_blocks);
+            }
+
+            ggml_tensor* x_out = nullptr;
+            ggml_tensor* context_out = nullptr;
+
+            auto get_block_graph = [&]() -> struct ggml_cgraph* {
+                struct ggml_cgraph* gf = new_graph_custom(MMDIT_GRAPH_SIZE / 4);
+
+                // Create input tensors from persistent storage
+                ggml_tensor* x_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, x_ne[0], x_ne[1], x_ne[2], x_ne[3]);
+                ggml_tensor* c_mod_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, c_mod_ne[0], c_mod_ne[1], c_mod_ne[2], c_mod_ne[3]);
+
+                x_in = to_backend(x_in);
+                c_mod_in = to_backend(c_mod_in);
+
+                set_backend_tensor_data(x_in, persistent_x);
+                set_backend_tensor_data(c_mod_in, persistent_c_mod);
+
+                ggml_tensor* context_in = nullptr;
+                if (persistent_context_count > 0) {
+                    context_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, context_ne[0], context_ne[1], context_ne[2], context_ne[3]);
+                    context_in = to_backend(context_in);
+                    set_backend_tensor_data(context_in, persistent_context);
+                }
+
+                auto runner_ctx = get_context();
+                auto result = mmdit.forward_joint_block(&runner_ctx, block_idx, context_in, x_in, c_mod_in);
+
+                context_out = result.first;
+                x_out = result.second;
+
+                if (context_out) ggml_build_forward_expand(gf, context_out);
+                ggml_build_forward_expand(gf, x_out);
+
+                return gf;
+            };
+
+            // Don't free compute buffer immediately - we need to read outputs first
+            if (!GGMLRunner::compute(get_block_graph, n_threads, false, nullptr, nullptr, true)) {
+                LOG_ERROR("Joint block %d execution failed", block_idx);
+                return false;
+            }
+
+            // Extract outputs to persistent storage
+            if (x_out) {
+                ggml_backend_tensor_get(x_out, persistent_x, 0, persistent_x_count * sizeof(float));
+                for (int i = 0; i < 4; i++) {
+                    x_ne[i] = x_out->ne[i];
+                }
+            }
+            if (context_out && persistent_context_count > 0) {
+                ggml_backend_tensor_get(context_out, persistent_context, 0, persistent_context_count * sizeof(float));
+                for (int i = 0; i < 4; i++) {
+                    context_ne[i] = context_out->ne[i];
+                }
+            }
+
+            // Now safe to free compute buffer
+            free_compute_buffer();
+
+            // Resident blocks stay on GPU across sampling steps.
+            if (block_idx >= resident_joint_blocks_) {
+                registry.move_layer_to_cpu(block_name);
+            }
+
+            LOG_DEBUG("Joint block %d/%d done (%.2fms)",
+                      block_idx + 1, num_blocks, (ggml_time_ms() - t_block_start) / 1.0);
+        }
+
+        LOG_DEBUG("Executing output stage");
+        {
+            auto get_output_graph = [&]() -> struct ggml_cgraph* {
+                struct ggml_cgraph* gf = new_graph_custom(MMDIT_GRAPH_SIZE / 4);
+
+                ggml_tensor* x_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, x_ne[0], x_ne[1], x_ne[2], x_ne[3]);
+                ggml_tensor* c_mod_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, c_mod_ne[0], c_mod_ne[1], c_mod_ne[2], c_mod_ne[3]);
+
+                x_in = to_backend(x_in);
+                c_mod_in = to_backend(c_mod_in);
+
+                set_backend_tensor_data(x_in, persistent_x);
+                set_backend_tensor_data(c_mod_in, persistent_c_mod);
+
+                auto runner_ctx = get_context();
+                auto final_out = mmdit.forward_output_stage(&runner_ctx, x_in, c_mod_in);
+
+                // Unpatchify
+                final_out = DiT::unpatchify_and_crop(compute_ctx, final_out, H, W, patch_size, patch_size, /*patch_last*/ false);
+
+                ggml_build_forward_expand(gf, final_out);
+
+                return gf;
+            };
+
+            if (!GGMLRunner::compute(get_output_graph, n_threads, true, output, output_ctx, true)) {
+                LOG_ERROR("Output stage failed");
+                return false;
+            }
+        }
+
+        int64_t t_end = ggml_time_ms();
+        LOG_INFO("TRUE per-layer streaming completed in %.2fs (%d joint_blocks)",
+                 (t_end - t_start) / 1000.0, num_blocks);
+
+        return true;
+    }
+
+    // Old-style build_graph for streaming code that uses raw ggml_tensor pointers
+    ggml_cgraph* build_graph(ggml_tensor* x,
+                             ggml_tensor* timesteps,
+                             ggml_tensor* context,
+                             ggml_tensor* y,
+                             std::vector<int> skip_layers = std::vector<int>()) {
+        ggml_cgraph* gf = new_graph_custom(MMDIT_GRAPH_SIZE);
+
+        x         = to_backend(x);
+        context   = to_backend(context);
+        y         = to_backend(y);
+        timesteps = to_backend(timesteps);
+
+        auto runner_ctx  = get_context();
+        ggml_tensor* out = mmdit.forward(&runner_ctx,
+                                         x,
+                                         timesteps,
+                                         y,
+                                         context,
+                                         skip_layers);
+
+        ggml_build_forward_expand(gf, out);
+
+        return gf;
+    }
+
     ggml_cgraph* build_graph(const sd::Tensor<float>& x_tensor,
                              const sd::Tensor<float>& timesteps_tensor,
                              const sd::Tensor<float>& context_tensor = {},
@@ -868,6 +1279,23 @@ struct MMDiTRunner : public GGMLRunner {
         return gf;
     }
 
+    // Old-style compute for streaming code that uses raw ggml_tensor pointers
+    bool compute(int n_threads,
+                 ggml_tensor* x,
+                 ggml_tensor* timesteps,
+                 ggml_tensor* context,
+                 ggml_tensor* y,
+                 ggml_tensor** output          = nullptr,
+                 ggml_context* output_ctx      = nullptr,
+                 std::vector<int> skip_layers  = std::vector<int>(),
+                 bool skip_param_offload       = false) {
+        auto get_graph = [&]() -> ggml_cgraph* {
+            return build_graph(x, timesteps, context, y, skip_layers);
+        };
+
+        return GGMLRunner::compute(get_graph, n_threads, false, output, output_ctx, skip_param_offload);
+    }
+
     sd::Tensor<float> compute(int n_threads,
                               const sd::Tensor<float>& x,
                               const sd::Tensor<float>& timesteps,
diff --git a/src/qwen_image.hpp b/src/qwen_image.hpp
index 35d32109e..d89b19950 100644
--- a/src/qwen_image.hpp
+++ b/src/qwen_image.hpp
@@ -5,6 +5,7 @@
 
 #include "common_block.hpp"
 #include "flux.hpp"
+#include "layer_streaming.hpp"
 
 namespace Qwen {
     constexpr int QWEN_IMAGE_GRAPH_SIZE = 20480;
@@ -436,6 +437,92 @@ namespace Qwen {
             return img;
         }
 
+        struct StreamingInputResult {
+            ggml_tensor* img;
+            ggml_tensor* txt;
+            ggml_tensor* t_emb;
+        };
+
+        StreamingInputResult forward_input_stage(GGMLRunnerContext* ctx,
+                                                  ggml_tensor* x,
+                                                  ggml_tensor* timestep,
+                                                  ggml_tensor* context,
+                                                  std::vector<ggml_tensor*> ref_latents = {},
+                                                  int64_t* out_img_tokens = nullptr) {
+            auto time_text_embed = std::dynamic_pointer_cast<QwenTimestepProjEmbeddings>(blocks["time_text_embed"]);
+            auto txt_norm        = std::dynamic_pointer_cast<RMSNorm>(blocks["txt_norm"]);
+            auto img_in          = std::dynamic_pointer_cast<Linear>(blocks["img_in"]);
+            auto txt_in          = std::dynamic_pointer_cast<Linear>(blocks["txt_in"]);
+
+            auto t_emb = time_text_embed->forward(ctx, timestep);
+            if (params.zero_cond_t) {
+                auto t_emb_0 = time_text_embed->forward(ctx, ggml_ext_zeros(ctx->ggml_ctx, timestep->ne[0], timestep->ne[1], timestep->ne[2], timestep->ne[3]));
+                t_emb        = ggml_concat(ctx->ggml_ctx, t_emb, t_emb_0, 1);
+            }
+
+            // Patchify input (same as main forward())
+            auto img_patched = DiT::pad_and_patchify(ctx, x, params.patch_size, params.patch_size);
+            int64_t img_tokens = img_patched->ne[1];
+
+            // Handle reference latents
+            if (ref_latents.size() > 0) {
+                for (ggml_tensor* ref : ref_latents) {
+                    ref = DiT::pad_and_patchify(ctx, ref, params.patch_size, params.patch_size);
+                    img_patched = ggml_concat(ctx->ggml_ctx, img_patched, ref, 1);
+                }
+            }
+
+            auto img = img_in->forward(ctx, img_patched);
+            auto txt = txt_norm->forward(ctx, context);
+            txt      = txt_in->forward(ctx, txt);
+
+            if (out_img_tokens) {
+                *out_img_tokens = img_tokens;
+            }
+
+            return {img, txt, t_emb};
+        }
+
+        std::pair<ggml_tensor*, ggml_tensor*> forward_single_block(GGMLRunnerContext* ctx,
+                                                                    int block_idx,
+                                                                    ggml_tensor* img,
+                                                                    ggml_tensor* txt,
+                                                                    ggml_tensor* t_emb,
+                                                                    ggml_tensor* pe,
+                                                                    ggml_tensor* modulate_index = nullptr) {
+            auto block = std::dynamic_pointer_cast<QwenImageTransformerBlock>(blocks["transformer_blocks." + std::to_string(block_idx)]);
+            return block->forward(ctx, img, txt, t_emb, pe, modulate_index);
+        }
+
+        ggml_tensor* forward_output_stage(GGMLRunnerContext* ctx,
+                                           ggml_tensor* img,
+                                           ggml_tensor* t_emb,
+                                           int64_t img_tokens,
+                                           int64_t orig_H,
+                                           int64_t orig_W) {
+            auto norm_out = std::dynamic_pointer_cast<AdaLayerNormContinuous>(blocks["norm_out"]);
+            auto proj_out = std::dynamic_pointer_cast<Linear>(blocks["proj_out"]);
+
+            if (params.zero_cond_t) {
+                t_emb = ggml_ext_chunk(ctx->ggml_ctx, t_emb, 2, 1)[0];
+            }
+
+            // Trim to original img_tokens if ref_latents were used
+            if (img->ne[1] > img_tokens) {
+                img = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, img, 0, 2, 1, 3));
+                img = ggml_view_3d(ctx->ggml_ctx, img, img->ne[0], img->ne[1], img_tokens, img->nb[1], img->nb[2], 0);
+                img = ggml_cont(ctx->ggml_ctx, ggml_permute(ctx->ggml_ctx, img, 0, 2, 1, 3));
+            }
+
+            img = norm_out->forward(ctx, img, t_emb);
+            img = proj_out->forward(ctx, img);
+
+            // Unpatchify and crop
+            img = DiT::unpatchify_and_crop(ctx->ggml_ctx, img, orig_H, orig_W, params.patch_size, params.patch_size);
+
+            return img;
+        }
+
         ggml_tensor* forward(GGMLRunnerContext* ctx,
                              ggml_tensor* x,
                              ggml_tensor* timestep,
@@ -487,6 +574,10 @@ namespace Qwen {
         std::vector<float> modulate_index_vec;
         SDVersion version;
 
+        // Static layer cache decided on the first sampling step. -1 = not yet
+        // computed; 0..N = number of "transformer_blocks.X" kept resident.
+        int resident_transformer_blocks_ = -1;
+
         QwenImageRunner(ggml_backend_t backend,
                         bool offload_params_to_cpu,
                         const String2TensorStorage& tensor_storage_map = {},
@@ -532,6 +623,485 @@ namespace Qwen {
             qwen_image.get_param_tensors(tensors, prefix);
         }
 
+    public:
+        void enable_layer_streaming(const LayerStreaming::StreamingConfig& config = {}) {
+            std::map<std::string, ggml_tensor*> tensor_map;
+            qwen_image.get_param_tensors(tensor_map, "model.diffusion_model");
+            init_streaming(config, tensor_map, LayerStreaming::qwen_image_layer_pattern);
+            LOG_INFO("%s layer streaming enabled (%zu layers)",
+                     get_desc().c_str(), streaming_engine_->get_registry().get_layer_count());
+        }
+
+        bool compute_streaming(int n_threads,
+                               ggml_tensor* x,
+                               ggml_tensor* timesteps,
+                               ggml_tensor* context,
+                               std::vector<ggml_tensor*> ref_latents = {},
+                               bool increase_ref_index               = false,
+                               ggml_tensor** output                  = nullptr,
+                               ggml_context* output_ctx              = nullptr) {
+            if (!is_streaming_enabled()) {
+                LOG_ERROR("%s streaming not enabled", get_desc().c_str());
+                return false;
+            }
+
+            int64_t t0 = ggml_time_ms();
+            auto analysis = analyze_vram_budget();
+
+            if (analysis.fits_in_vram) {
+                LOG_INFO("%s model fits in VRAM, using coarse-stage streaming", get_desc().c_str());
+                load_all_layers_coarse();
+                bool result = compute(n_threads, x, timesteps, context, ref_latents, increase_ref_index,
+                                      output, output_ctx, true);
+                int64_t t1 = ggml_time_ms();
+                LOG_INFO("%s coarse-stage streaming completed in %.2fs", get_desc().c_str(), (t1 - t0) / 1000.0);
+                free_compute_buffer();
+                return result;
+            }
+
+            LOG_INFO("%s remaining %.2f GB exceeds available %.2f GB, using per-layer streaming",
+                     get_desc().c_str(),
+                     analysis.remaining_to_load / (1024.0 * 1024.0 * 1024.0),
+                     analysis.available_vram / (1024.0 * 1024.0 * 1024.0));
+
+            return compute_streaming_true(n_threads, x, timesteps, context, ref_latents, increase_ref_index, output, output_ctx);
+        }
+
+    private:
+        // Persistent storage for intermediate tensors between layer executions
+        struct StreamingState {
+            std::vector<float> img_data;
+            std::vector<float> txt_data;
+            std::vector<float> t_emb_data;
+            std::vector<float> pe_data;
+            std::vector<float> modulate_index_data;
+
+            // Tensor dimensions
+            int64_t img_ne[4];
+            int64_t txt_ne[4];
+            int64_t t_emb_ne[4];
+            int64_t pe_ne[4];
+            int64_t modulate_index_ne[4];
+            bool has_modulate_index = false;
+        };
+
+        void copy_tensor_to_storage(ggml_tensor* tensor, std::vector<float>& storage, int64_t* ne) {
+            size_t nelements = ggml_nelements(tensor);
+            storage.resize(nelements);
+
+            // Copy to CPU if needed
+            ggml_backend_tensor_get(tensor, storage.data(), 0, nelements * sizeof(float));
+
+            // Store dimensions
+            for (int i = 0; i < 4; i++) {
+                ne[i] = tensor->ne[i];
+            }
+        }
+
+        ggml_tensor* create_tensor_from_storage(ggml_context* ctx, const std::vector<float>& storage,
+                                                 const int64_t* ne, const char* name) {
+            ggml_tensor* tensor = ggml_new_tensor_4d(ctx, GGML_TYPE_F32, ne[0], ne[1], ne[2], ne[3]);
+            ggml_set_name(tensor, name);
+            return tensor;
+        }
+
+        bool compute_streaming_true(int n_threads,
+                                    ggml_tensor* x,
+                                    ggml_tensor* timesteps,
+                                    ggml_tensor* context,
+                                    std::vector<ggml_tensor*> ref_latents,
+                                    bool increase_ref_index,
+                                    ggml_tensor** output,
+                                    ggml_context* output_ctx) {
+            auto& registry = streaming_engine_->get_registry();
+            int64_t t_start = ggml_time_ms();
+
+            const int num_layers = qwen_image_params.num_layers;
+            LOG_INFO("TRUE per-layer streaming - %d blocks (one at a time)", num_layers);
+
+            // Phase 1: Load global layers (_global contains input/output projections)
+            LOG_DEBUG("Loading global layers");
+            if (!registry.move_layer_to_gpu("_global")) {
+                LOG_ERROR("Failed to load _global to GPU");
+                return false;
+            }
+
+            // Pre-generate PE and modulate_index vectors (needed for all blocks)
+            pe_vec = Rope::gen_qwen_image_pe(static_cast<int>(x->ne[1]),
+                                              static_cast<int>(x->ne[0]),
+                                              qwen_image_params.patch_size,
+                                              static_cast<int>(x->ne[3]),
+                                              static_cast<int>(context->ne[1]),
+                                              ref_latents,
+                                              increase_ref_index,
+                                              qwen_image_params.theta,
+                                              circular_y_enabled,
+                                              circular_x_enabled,
+                                              qwen_image_params.axes_dim);
+
+            if (qwen_image_params.zero_cond_t) {
+                modulate_index_vec.clear();
+                int64_t h_len = ((x->ne[1] + (qwen_image_params.patch_size / 2)) / qwen_image_params.patch_size);
+                int64_t w_len = ((x->ne[0] + (qwen_image_params.patch_size / 2)) / qwen_image_params.patch_size);
+                int64_t num_img_tokens = h_len * w_len;
+                modulate_index_vec.insert(modulate_index_vec.end(), num_img_tokens, 0.f);
+
+                int64_t num_ref_img_tokens = 0;
+                for (ggml_tensor* ref : ref_latents) {
+                    int64_t rh_len = ((ref->ne[1] + (qwen_image_params.patch_size / 2)) / qwen_image_params.patch_size);
+                    int64_t rw_len = ((ref->ne[0] + (qwen_image_params.patch_size / 2)) / qwen_image_params.patch_size);
+                    num_ref_img_tokens += rh_len * rw_len;
+                }
+                if (num_ref_img_tokens > 0) {
+                    modulate_index_vec.insert(modulate_index_vec.end(), num_ref_img_tokens, 1.f);
+                }
+            }
+
+            // TRUE per-layer streaming with mini-graphs
+            // Execute each block as a separate mini-graph to minimize activation memory
+
+            int64_t t_blocks_start = ggml_time_ms();
+
+            // Store original image dimensions for unpatchify
+            int64_t orig_H = x->ne[1];
+            int64_t orig_W = x->ne[0];
+
+            // Persistent storage. Backed by a single GPU-pinned host buffer
+            // (ensure_pinned_act_buffers) so per-block ggml_backend_tensor_get
+            // / set_backend_tensor_data run at full PCIe bandwidth. Falls back
+            // to pageable std::vector if pinned alloc fails.
+            std::vector<float> persistent_img_fallback;
+            std::vector<float> persistent_txt_fallback;
+            std::vector<float> persistent_t_emb_fallback;
+            float* persistent_img   = nullptr;
+            float* persistent_txt   = nullptr;
+            float* persistent_t_emb = nullptr;
+            size_t persistent_img_count   = 0;
+            size_t persistent_txt_count   = 0;
+            size_t persistent_t_emb_count = 0;
+            int64_t img_ne[4], txt_ne[4], t_emb_ne[4];
+            int64_t img_tokens_count = 0;
+
+            LOG_DEBUG("Executing input stage");
+            {
+                // Build mini-graph for input projections only
+                ggml_cgraph* input_graph = nullptr;
+                ggml_tensor* img_output = nullptr;
+                ggml_tensor* txt_output = nullptr;
+                ggml_tensor* t_emb_output = nullptr;
+                int64_t img_tokens_local = 0;
+
+                auto get_input_graph = [&]() -> ggml_cgraph* {
+                    ggml_cgraph* gf = new_graph_custom(QWEN_IMAGE_GRAPH_SIZE / 4);  // Smaller graph
+
+                    ggml_tensor* x_backend = to_backend(x);
+                    ggml_tensor* context_backend = to_backend(context);
+                    ggml_tensor* timesteps_backend = to_backend(timesteps);
+
+                    // Convert ref_latents to backend
+                    std::vector<ggml_tensor*> ref_latents_backend;
+                    for (auto& ref : ref_latents) {
+                        ref_latents_backend.push_back(to_backend(ref));
+                    }
+
+                    auto runner_ctx = get_context();
+                    auto result = qwen_image.forward_input_stage(&runner_ctx, x_backend, timesteps_backend, context_backend,
+                                                                  ref_latents_backend, &img_tokens_local);
+
+                    img_output = result.img;
+                    txt_output = result.txt;
+                    t_emb_output = result.t_emb;
+
+                    // Concatenate outputs into single tensor for extraction
+                    // We'll use img as the primary output and extract separately
+                    ggml_build_forward_expand(gf, result.img);
+                    ggml_build_forward_expand(gf, result.txt);
+                    ggml_build_forward_expand(gf, result.t_emb);
+
+                    return gf;
+                };
+
+                // Execute input stage - don't free compute buffer immediately
+                if (!GGMLRunner::compute(get_input_graph, n_threads, false, nullptr, nullptr, true)) {
+                    LOG_ERROR("Input stage failed");
+                    return false;
+                }
+
+                img_tokens_count = img_tokens_local;
+
+                // Extract computed tensors to persistent storage
+                if (img_output && txt_output && t_emb_output) {
+                    // Copy tensor data to CPU storage
+                    size_t img_size   = ggml_nelements(img_output);
+                    size_t txt_size   = ggml_nelements(txt_output);
+                    size_t t_emb_size = ggml_nelements(t_emb_output);
+
+                    persistent_img_count   = img_size;
+                    persistent_txt_count   = txt_size;
+                    persistent_t_emb_count = t_emb_size;
+
+                    std::vector<float*> ptrs;
+                    if (ensure_pinned_act_buffers({img_size   * sizeof(float),
+                                                   txt_size   * sizeof(float),
+                                                   t_emb_size * sizeof(float)}, ptrs)) {
+                        persistent_img   = ptrs[0];
+                        persistent_txt   = ptrs[1];
+                        persistent_t_emb = ptrs[2];
+                    } else {
+                        persistent_img_fallback.resize(img_size);
+                        persistent_txt_fallback.resize(txt_size);
+                        persistent_t_emb_fallback.resize(t_emb_size);
+                        persistent_img   = persistent_img_fallback.data();
+                        persistent_txt   = persistent_txt_fallback.data();
+                        persistent_t_emb = persistent_t_emb_fallback.data();
+                    }
+
+                    ggml_backend_tensor_get(img_output, persistent_img, 0, img_size * sizeof(float));
+                    ggml_backend_tensor_get(txt_output, persistent_txt, 0, txt_size * sizeof(float));
+                    ggml_backend_tensor_get(t_emb_output, persistent_t_emb, 0, t_emb_size * sizeof(float));
+
+                    for (int i = 0; i < 4; i++) {
+                        img_ne[i] = img_output->ne[i];
+                        txt_ne[i] = txt_output->ne[i];
+                        t_emb_ne[i] = t_emb_output->ne[i];
+                    }
+                } else {
+                    LOG_ERROR("Failed to get input stage outputs");
+                    free_compute_buffer();
+                    return false;
+                }
+
+                // Now safe to free compute buffer
+                free_compute_buffer();
+            }
+
+            LOG_DEBUG("Input stage done, img=%ldx%ldx%ldx%ld, txt=%ldx%ldx%ldx%ld",
+                      img_ne[0], img_ne[1], img_ne[2], img_ne[3],
+                      txt_ne[0], txt_ne[1], txt_ne[2], txt_ne[3]);
+
+            auto block_name_at = [](int i) { return "transformer_blocks." + std::to_string(i); };
+
+            if (resident_transformer_blocks_ < 0) {
+                resident_transformer_blocks_ = streaming_engine_->compute_resident_block_count(
+                    "transformer_blocks.0", num_layers);
+                LOG_INFO("%s transformer_blocks cache: %d resident, %d streamed per step",
+                         get_desc().c_str(),
+                         resident_transformer_blocks_,
+                         num_layers - resident_transformer_blocks_);
+            }
+
+            int prefetch_start = 0;
+            while (prefetch_start < num_layers &&
+                   registry.is_layer_on_gpu(block_name_at(prefetch_start))) {
+                prefetch_start++;
+            }
+            streaming_engine_->prime_prefetch(block_name_at, prefetch_start, num_layers);
+
+            for (int block_idx = 0; block_idx < num_layers; block_idx++) {
+                std::string block_name = block_name_at(block_idx);
+                int64_t t_block_start = ggml_time_ms();
+
+                // Wait for this block's prefetch to complete (if it was prefetched)
+                streaming_engine_->wait_for_prefetch(block_name);
+
+                // Load this block's weights (sync load if prefetch didn't happen)
+                if (!registry.move_layer_to_gpu(block_name)) {
+                    LOG_ERROR("Failed to load block %d", block_idx);
+                    return false;
+                }
+
+                // Keep the prefetch window full
+                streaming_engine_->advance_prefetch(block_name_at, block_idx, num_layers);
+
+                // Build and execute mini-graph for this block
+                ggml_tensor* img_out = nullptr;
+                ggml_tensor* txt_out = nullptr;
+
+                auto get_block_graph = [&]() -> ggml_cgraph* {
+                    ggml_cgraph* gf = new_graph_custom(QWEN_IMAGE_GRAPH_SIZE / 4);
+
+                    // Create input tensors from persistent storage
+                    ggml_tensor* img_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, img_ne[0], img_ne[1], img_ne[2], img_ne[3]);
+                    ggml_tensor* txt_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, txt_ne[0], txt_ne[1], txt_ne[2], txt_ne[3]);
+                    ggml_tensor* t_emb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, t_emb_ne[0], t_emb_ne[1], t_emb_ne[2], t_emb_ne[3]);
+
+                    // Copy to backend and set data
+                    img_in = to_backend(img_in);
+                    txt_in = to_backend(txt_in);
+                    t_emb_in = to_backend(t_emb_in);
+
+                    set_backend_tensor_data(img_in, persistent_img);
+                    set_backend_tensor_data(txt_in, persistent_txt);
+                    set_backend_tensor_data(t_emb_in, persistent_t_emb);
+
+                    // Generate PE
+                    int pos_len = static_cast<int>(pe_vec.size() / qwen_image_params.axes_dim_sum / 2);
+                    ggml_tensor* pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, qwen_image_params.axes_dim_sum / 2, pos_len);
+                    set_backend_tensor_data(pe, pe_vec.data());
+
+                    // Modulate index
+                    ggml_tensor* modulate_index = nullptr;
+                    if (qwen_image_params.zero_cond_t && !modulate_index_vec.empty()) {
+                        modulate_index = ggml_new_tensor_1d(compute_ctx, GGML_TYPE_F32, modulate_index_vec.size());
+                        set_backend_tensor_data(modulate_index, modulate_index_vec.data());
+                    }
+
+                    auto runner_ctx = get_context();
+                    auto [img_result, txt_result] = qwen_image.forward_single_block(&runner_ctx, block_idx,
+                                                                                     img_in, txt_in, t_emb_in, pe, modulate_index);
+
+                    img_out = img_result;
+                    txt_out = txt_result;
+
+                    ggml_build_forward_expand(gf, img_out);
+                    ggml_build_forward_expand(gf, txt_out);
+
+                    return gf;
+                };
+
+                // Don't free compute buffer immediately - we need to read outputs first
+                if (!GGMLRunner::compute(get_block_graph, n_threads, false, nullptr, nullptr, true)) {
+                    LOG_ERROR("Block %d execution failed", block_idx);
+                    return false;
+                }
+
+                // Extract outputs to persistent storage
+                if (img_out && txt_out) {
+                    ggml_backend_tensor_get(img_out, persistent_img, 0, persistent_img_count * sizeof(float));
+                    ggml_backend_tensor_get(txt_out, persistent_txt, 0, persistent_txt_count * sizeof(float));
+
+                    for (int i = 0; i < 4; i++) {
+                        img_ne[i] = img_out->ne[i];
+                        txt_ne[i] = txt_out->ne[i];
+                    }
+                }
+
+                // Now safe to free compute buffer
+                free_compute_buffer();
+
+                // Resident blocks stay on GPU across sampling steps.
+                if (block_idx >= resident_transformer_blocks_) {
+                    registry.move_layer_to_cpu(block_name);
+                }
+
+                LOG_DEBUG("Block %d/%d done (%.2fms)",
+                          block_idx + 1, num_layers, (ggml_time_ms() - t_block_start) / 1.0);
+            }
+
+            LOG_DEBUG("Executing output stage");
+            {
+                ggml_tensor* final_out = nullptr;
+
+                auto get_output_graph = [&]() -> ggml_cgraph* {
+                    ggml_cgraph* gf = new_graph_custom(QWEN_IMAGE_GRAPH_SIZE / 4);
+
+                    // Create input tensors
+                    ggml_tensor* img_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, img_ne[0], img_ne[1], img_ne[2], img_ne[3]);
+                    ggml_tensor* t_emb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, t_emb_ne[0], t_emb_ne[1], t_emb_ne[2], t_emb_ne[3]);
+
+                    img_in = to_backend(img_in);
+                    t_emb_in = to_backend(t_emb_in);
+
+                    set_backend_tensor_data(img_in, persistent_img);
+                    set_backend_tensor_data(t_emb_in, persistent_t_emb);
+
+                    auto runner_ctx = get_context();
+                    final_out = qwen_image.forward_output_stage(&runner_ctx, img_in, t_emb_in,
+                                                                 img_tokens_count, orig_H, orig_W);
+
+                    ggml_build_forward_expand(gf, final_out);
+
+                    return gf;
+                };
+
+                if (!GGMLRunner::compute(get_output_graph, n_threads, true, output, output_ctx, true)) {
+                    LOG_ERROR("Output stage failed");
+                    return false;
+                }
+            }
+
+            int64_t t_end = ggml_time_ms();
+            LOG_INFO("TRUE per-layer streaming completed in %.2fs (%d blocks)",
+                     (t_end - t_start) / 1000.0, num_layers);
+
+            return true;
+        }
+
+    public:
+
+        // Raw ggml_tensor* overload used by streaming code
+        ggml_cgraph* build_graph(ggml_tensor* x,
+                                 ggml_tensor* timesteps,
+                                 ggml_tensor* context,
+                                 std::vector<ggml_tensor*> ref_latents = {},
+                                 bool increase_ref_index               = false) {
+            GGML_ASSERT(x->ne[3] == 1);
+            ggml_cgraph* gf = new_graph_custom(QWEN_IMAGE_GRAPH_SIZE);
+
+            x         = to_backend(x);
+            context   = to_backend(context);
+            timesteps = to_backend(timesteps);
+
+            for (size_t i = 0; i < ref_latents.size(); i++) {
+                ref_latents[i] = to_backend(ref_latents[i]);
+            }
+
+            pe_vec      = Rope::gen_qwen_image_pe(static_cast<int>(x->ne[1]),
+                                                  static_cast<int>(x->ne[0]),
+                                                  qwen_image_params.patch_size,
+                                                  static_cast<int>(x->ne[3]),
+                                                  static_cast<int>(context->ne[1]),
+                                                  ref_latents,
+                                                  increase_ref_index,
+                                                  qwen_image_params.theta,
+                                                  circular_y_enabled,
+                                                  circular_x_enabled,
+                                                  qwen_image_params.axes_dim);
+            int pos_len = static_cast<int>(pe_vec.size() / qwen_image_params.axes_dim_sum / 2);
+            auto pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, qwen_image_params.axes_dim_sum / 2, pos_len);
+            set_backend_tensor_data(pe, pe_vec.data());
+
+            ggml_tensor* modulate_index = nullptr;
+            if (qwen_image_params.zero_cond_t) {
+                modulate_index_vec.clear();
+
+                int64_t h_len          = ((x->ne[1] + (qwen_image_params.patch_size / 2)) / qwen_image_params.patch_size);
+                int64_t w_len          = ((x->ne[0] + (qwen_image_params.patch_size / 2)) / qwen_image_params.patch_size);
+                int64_t num_img_tokens = h_len * w_len;
+
+                modulate_index_vec.insert(modulate_index_vec.end(), num_img_tokens, 0.f);
+                int64_t num_ref_img_tokens = 0;
+                for (ggml_tensor* ref : ref_latents) {
+                    int64_t rh_len = ((ref->ne[1] + (qwen_image_params.patch_size / 2)) / qwen_image_params.patch_size);
+                    int64_t rw_len = ((ref->ne[0] + (qwen_image_params.patch_size / 2)) / qwen_image_params.patch_size);
+
+                    num_ref_img_tokens += rh_len * rw_len;
+                }
+
+                if (num_ref_img_tokens > 0) {
+                    modulate_index_vec.insert(modulate_index_vec.end(), num_ref_img_tokens, 1.f);
+                }
+
+                modulate_index = ggml_new_tensor_1d(compute_ctx, GGML_TYPE_F32, modulate_index_vec.size());
+                set_backend_tensor_data(modulate_index, modulate_index_vec.data());
+            }
+
+            auto runner_ctx = get_context();
+
+            ggml_tensor* out = qwen_image.forward(&runner_ctx,
+                                                  x,
+                                                  timesteps,
+                                                  context,
+                                                  pe,
+                                                  ref_latents,
+                                                  modulate_index);
+
+            ggml_build_forward_expand(gf, out);
+
+            return gf;
+        }
+
+        // sd::Tensor overload - upstream public API
         ggml_cgraph* build_graph(const sd::Tensor<float>& x_tensor,
                                  const sd::Tensor<float>& timesteps_tensor,
                                  const sd::Tensor<float>& context_tensor,
@@ -608,6 +1178,27 @@ namespace Qwen {
             return gf;
         }
 
+        // Raw ggml_tensor* overload used by streaming code
+        bool compute(int n_threads,
+                     ggml_tensor* x,
+                     ggml_tensor* timesteps,
+                     ggml_tensor* context,
+                     std::vector<ggml_tensor*> ref_latents = {},
+                     bool increase_ref_index               = false,
+                     ggml_tensor** output                  = nullptr,
+                     ggml_context* output_ctx              = nullptr,
+                     bool skip_param_offload               = false) {
+            // x: [N, in_channels, h, w]
+            // timesteps: [N, ]
+            // context: [N, max_position, hidden_size]
+            auto get_graph = [&]() -> ggml_cgraph* {
+                return build_graph(x, timesteps, context, ref_latents, increase_ref_index);
+            };
+
+            return GGMLRunner::compute(get_graph, n_threads, false, output, output_ctx, skip_param_offload);
+        }
+
+        // sd::Tensor overload - upstream public API
         sd::Tensor<float> compute(int n_threads,
                                   const sd::Tensor<float>& x,
                                   const sd::Tensor<float>& timesteps,
diff --git a/src/stable-diffusion.cpp b/src/stable-diffusion.cpp
index fd439ff1d..b7e25a4d2 100644
--- a/src/stable-diffusion.cpp
+++ b/src/stable-diffusion.cpp
@@ -1,5 +1,10 @@
 #include "ggml_extend.hpp"
 
+#ifdef SD_USE_CUDA
+#include "ggml-cuda.h"
+#include <cuda_runtime.h>
+#endif
+
 #include "model.h"
 #include "rng.hpp"
 #include "rng_mt19937.hpp"
@@ -146,6 +151,11 @@ class StableDiffusionGGML {
     bool offload_params_to_cpu           = false;
     float max_vram                       = 0.f;
     bool use_pmid                        = false;
+    sd_offload_config_t offload_config   = {};  // Dynamic tensor offloading config
+
+    // Track which components were intentionally kept on CPU (don't try to move to GPU)
+    bool cond_stage_on_cpu_only = false;  // true if keep_clip_on_cpu was set
+    bool vae_on_cpu_only        = false;  // true if keep_vae_on_cpu was set
 
     bool is_using_v_parameterization     = false;
     bool is_using_edm_v_parameterization = false;
@@ -192,6 +202,31 @@ class StableDiffusionGGML {
         free_params_immediately = sd_ctx_params->free_params_immediately;
         offload_params_to_cpu   = sd_ctx_params->offload_params_to_cpu;
         max_vram                = sd_ctx_params->max_vram;
+        offload_config          = sd_ctx_params->offload_config;
+
+        // When the offload_config selects a cross-stage mode, also force the
+        // affected models onto the CPU backend so we can shuffle them between
+        // stages. offload_params_to_cpu remains the user-facing knob; this is
+        // an internal escalation when the config implies it.
+        bool cond_stage_offload_to_cpu = offload_params_to_cpu;
+        bool diffusion_offload_to_cpu = offload_params_to_cpu;
+        bool vae_offload_to_cpu       = offload_params_to_cpu;
+        if (offload_config.mode != SD_OFFLOAD_NONE) {
+            if (offload_config.offload_cond_stage) {
+                cond_stage_offload_to_cpu = true;
+            }
+            // Diffusion CPU backend is needed even in cond_only mode so we
+            // can temporarily swap it out while loading cond_stage to GPU.
+            diffusion_offload_to_cpu = true;
+        }
+        // Layer streaming wants every MB it can get back during sampling, so
+        // give the VAE a CPU-pinned twin too. The VAE is idle for the entire
+        // sampler loop and only used at decode time — moving it to CPU between
+        // the two phases is pure win. Other offload modes keep current
+        // behaviour: VAE on whichever backend the user selected.
+        if (offload_config.mode == SD_OFFLOAD_LAYER_STREAMING) {
+            vae_offload_to_cpu = true;
+        }
 
         bool use_tae = false;
 
@@ -376,6 +411,7 @@ class StableDiffusionGGML {
         }
 
         bool clip_on_cpu = sd_ctx_params->keep_clip_on_cpu;
+        cond_stage_on_cpu_only = clip_on_cpu;  // Track for offload decisions
 
         const size_t max_graph_vram_bytes = max_vram <= 0.f
                                                 ? 0
@@ -389,10 +425,10 @@ class StableDiffusionGGML {
             }
             if (sd_version_is_sd3(version)) {
                 cond_stage_model = std::make_shared<SD3CLIPEmbedder>(clip_backend,
-                                                                     offload_params_to_cpu,
+                                                                     cond_stage_offload_to_cpu,
                                                                      tensor_storage_map);
                 diffusion_model  = std::make_shared<MMDiTModel>(backend,
-                                                               offload_params_to_cpu,
+                                                               diffusion_offload_to_cpu,
                                                                tensor_storage_map);
             } else if (sd_version_is_flux(version)) {
                 bool is_chroma = false;
@@ -413,53 +449,53 @@ class StableDiffusionGGML {
                     }
 
                     cond_stage_model = std::make_shared<T5CLIPEmbedder>(clip_backend,
-                                                                        offload_params_to_cpu,
+                                                                        cond_stage_offload_to_cpu,
                                                                         tensor_storage_map,
                                                                         sd_ctx_params->chroma_use_t5_mask,
                                                                         sd_ctx_params->chroma_t5_mask_pad);
                 } else if (version == VERSION_OVIS_IMAGE) {
                     cond_stage_model = std::make_shared<LLMEmbedder>(clip_backend,
-                                                                     offload_params_to_cpu,
+                                                                     cond_stage_offload_to_cpu,
                                                                      tensor_storage_map,
                                                                      version,
                                                                      "",
                                                                      false);
                 } else {
                     cond_stage_model = std::make_shared<FluxCLIPEmbedder>(clip_backend,
-                                                                          offload_params_to_cpu,
+                                                                          cond_stage_offload_to_cpu,
                                                                           tensor_storage_map);
                 }
                 diffusion_model = std::make_shared<FluxModel>(backend,
-                                                              offload_params_to_cpu,
+                                                              diffusion_offload_to_cpu,
                                                               tensor_storage_map,
                                                               version,
                                                               sd_ctx_params->chroma_use_dit_mask);
             } else if (sd_version_is_flux2(version)) {
                 bool is_chroma   = false;
                 cond_stage_model = std::make_shared<LLMEmbedder>(clip_backend,
-                                                                 offload_params_to_cpu,
+                                                                 cond_stage_offload_to_cpu,
                                                                  tensor_storage_map,
                                                                  version);
                 diffusion_model  = std::make_shared<FluxModel>(backend,
-                                                              offload_params_to_cpu,
+                                                              diffusion_offload_to_cpu,
                                                               tensor_storage_map,
                                                               version,
                                                               sd_ctx_params->chroma_use_dit_mask);
             } else if (sd_version_is_wan(version)) {
                 cond_stage_model = std::make_shared<T5CLIPEmbedder>(clip_backend,
-                                                                    offload_params_to_cpu,
+                                                                    cond_stage_offload_to_cpu,
                                                                     tensor_storage_map,
                                                                     true,
                                                                     0,
                                                                     true);
                 diffusion_model  = std::make_shared<WanModel>(backend,
-                                                             offload_params_to_cpu,
+                                                             diffusion_offload_to_cpu,
                                                              tensor_storage_map,
                                                              "model.diffusion_model",
                                                              version);
                 if (strlen(SAFE_STR(sd_ctx_params->high_noise_diffusion_model_path)) > 0) {
                     high_noise_diffusion_model = std::make_shared<WanModel>(backend,
-                                                                            offload_params_to_cpu,
+                                                                            diffusion_offload_to_cpu,
                                                                             tensor_storage_map,
                                                                             "model.high_noise_diffusion_model",
                                                                             version);
@@ -468,7 +504,7 @@ class StableDiffusionGGML {
                     diffusion_model->get_desc() == "Wan2.1-FLF2V-14B" ||
                     diffusion_model->get_desc() == "Wan2.1-I2V-1.3B") {
                     clip_vision = std::make_shared<FrozenCLIPVisionEmbedder>(backend,
-                                                                             offload_params_to_cpu,
+                                                                             diffusion_offload_to_cpu,
                                                                              tensor_storage_map);
                     clip_vision->set_max_graph_vram_bytes(max_graph_vram_bytes);
                     clip_vision->alloc_params_buffer();
@@ -480,32 +516,32 @@ class StableDiffusionGGML {
                     enable_vision = true;
                 }
                 cond_stage_model = std::make_shared<LLMEmbedder>(clip_backend,
-                                                                 offload_params_to_cpu,
+                                                                 cond_stage_offload_to_cpu,
                                                                  tensor_storage_map,
                                                                  version,
                                                                  "",
                                                                  enable_vision);
                 diffusion_model  = std::make_shared<QwenImageModel>(backend,
-                                                                   offload_params_to_cpu,
+                                                                   diffusion_offload_to_cpu,
                                                                    tensor_storage_map,
                                                                    "model.diffusion_model",
                                                                    version,
                                                                    sd_ctx_params->qwen_image_zero_cond_t);
             } else if (sd_version_is_anima(version)) {
                 cond_stage_model = std::make_shared<AnimaConditioner>(clip_backend,
-                                                                      offload_params_to_cpu,
+                                                                      cond_stage_offload_to_cpu,
                                                                       tensor_storage_map);
                 diffusion_model  = std::make_shared<AnimaModel>(backend,
-                                                               offload_params_to_cpu,
+                                                               diffusion_offload_to_cpu,
                                                                tensor_storage_map,
                                                                "model.diffusion_model");
             } else if (sd_version_is_z_image(version)) {
                 cond_stage_model = std::make_shared<LLMEmbedder>(clip_backend,
-                                                                 offload_params_to_cpu,
+                                                                 cond_stage_offload_to_cpu,
                                                                  tensor_storage_map,
                                                                  version);
                 diffusion_model  = std::make_shared<ZImageModel>(backend,
-                                                                offload_params_to_cpu,
+                                                                diffusion_offload_to_cpu,
                                                                 tensor_storage_map,
                                                                 "model.diffusion_model",
                                                                 version);
@@ -525,20 +561,20 @@ class StableDiffusionGGML {
                 }
                 if (strstr(SAFE_STR(sd_ctx_params->photo_maker_path), "v2")) {
                     cond_stage_model = std::make_shared<FrozenCLIPEmbedderWithCustomWords>(clip_backend,
-                                                                                           offload_params_to_cpu,
+                                                                                           cond_stage_offload_to_cpu,
                                                                                            tensor_storage_map,
                                                                                            embbeding_map,
                                                                                            version,
                                                                                            PM_VERSION_2);
                 } else {
                     cond_stage_model = std::make_shared<FrozenCLIPEmbedderWithCustomWords>(clip_backend,
-                                                                                           offload_params_to_cpu,
+                                                                                           cond_stage_offload_to_cpu,
                                                                                            tensor_storage_map,
                                                                                            embbeding_map,
                                                                                            version);
                 }
                 diffusion_model = std::make_shared<UNetModel>(backend,
-                                                              offload_params_to_cpu,
+                                                              diffusion_offload_to_cpu,
                                                               tensor_storage_map,
                                                               version);
                 if (sd_ctx_params->diffusion_conv_direct) {
@@ -555,6 +591,26 @@ class StableDiffusionGGML {
             diffusion_model->alloc_params_buffer();
             diffusion_model->get_param_tensors(tensors);
 
+            // Enable layer streaming if configured
+            if (offload_config.mode == SD_OFFLOAD_LAYER_STREAMING) {
+                LOG_INFO("Mode is layer_streaming, checking model support...");
+                if (diffusion_model->supports_layer_streaming()) {
+                    LOG_INFO("Enabling layer-by-layer streaming for diffusion model");
+                    LOG_INFO("Prefetch layers: %d, Min free VRAM: %.0f MB",
+                             offload_config.streaming_prefetch_layers,
+                             offload_config.streaming_min_free_vram / (1024.0 * 1024.0));
+                    diffusion_model->enable_layer_streaming(
+                        offload_config.streaming_prefetch_layers,
+                        offload_config.streaming_min_free_vram);
+                    LOG_INFO("is_layer_streaming_enabled() = %s",
+                             diffusion_model->is_layer_streaming_enabled() ? "true" : "false");
+                } else {
+                    LOG_WARN("Diffusion model does not support layer streaming, falling back to normal mode");
+                }
+            } else {
+                LOG_DEBUG("Mode is not layer_streaming (mode=%d)", offload_config.mode);
+            }
+
             if (sd_version_is_unet_edit(version)) {
                 vae_decode_only = false;
             }
@@ -565,6 +621,7 @@ class StableDiffusionGGML {
                 high_noise_diffusion_model->get_param_tensors(tensors);
             }
 
+            vae_on_cpu_only = sd_ctx_params->keep_vae_on_cpu;  // Track for offload decisions
             if (sd_ctx_params->keep_vae_on_cpu && !ggml_backend_is_cpu(backend)) {
                 LOG_INFO("VAE Autoencoder: Using CPU backend");
                 vae_backend = ggml_backend_cpu_init();
@@ -577,7 +634,7 @@ class StableDiffusionGGML {
                     sd_version_is_qwen_image(version) ||
                     sd_version_is_anima(version)) {
                     return std::make_shared<TinyVideoAutoEncoder>(vae_backend,
-                                                                  offload_params_to_cpu,
+                                                                  vae_offload_to_cpu,
                                                                   tensor_storage_map,
                                                                   "decoder",
                                                                   vae_decode_only,
@@ -585,7 +642,7 @@ class StableDiffusionGGML {
 
                 } else {
                     auto model = std::make_shared<TinyImageAutoEncoder>(vae_backend,
-                                                                        offload_params_to_cpu,
+                                                                        vae_offload_to_cpu,
                                                                         tensor_storage_map,
                                                                         "decoder.layers",
                                                                         vae_decode_only,
@@ -599,14 +656,14 @@ class StableDiffusionGGML {
                     sd_version_is_qwen_image(version) ||
                     sd_version_is_anima(version)) {
                     return std::make_shared<WAN::WanVAERunner>(vae_backend,
-                                                               offload_params_to_cpu,
+                                                               vae_offload_to_cpu,
                                                                tensor_storage_map,
                                                                "first_stage_model",
                                                                vae_decode_only,
                                                                version);
                 } else {
                     auto model = std::make_shared<AutoEncoderKL>(vae_backend,
-                                                                 offload_params_to_cpu,
+                                                                 vae_offload_to_cpu,
                                                                  tensor_storage_map,
                                                                  "first_stage_model",
                                                                  vae_decode_only,
@@ -629,7 +686,7 @@ class StableDiffusionGGML {
                 LOG_INFO("using FakeVAE");
                 first_stage_model = std::make_shared<FakeVAE>(version,
                                                               vae_backend,
-                                                              offload_params_to_cpu);
+                                                              vae_offload_to_cpu);
             } else if (use_tae && !tae_preview_only) {
                 LOG_INFO("using TAE for encoding / decoding");
                 first_stage_model = create_tae();
@@ -805,6 +862,53 @@ class StableDiffusionGGML {
 
         LOG_DEBUG("finished loaded file");
 
+        // For layer streaming mode, offload all diffusion model layers to CPU immediately
+        // This frees VRAM for the LLM/CLIP during conditioning
+        // Layers will be loaded on-demand during streaming execution
+        if (offload_config.mode == SD_OFFLOAD_LAYER_STREAMING &&
+            diffusion_model && diffusion_model->is_layer_streaming_enabled()) {
+            LOG_INFO("Offloading diffusion model layers to CPU for layer streaming");
+            diffusion_model->offload_streaming_layers();
+        }
+
+        // When dynamic offloading is enabled and user didn't want clip on CPU,
+        // we forced CPU backend creation but now TRY to move params to GPU for execution.
+        // This gives us the best of both: fast GPU execution with ability to offload later.
+        // Skip if cond_stage was intentionally kept on CPU (keep_clip_on_cpu=true).
+        if (offload_config.mode != SD_OFFLOAD_NONE &&
+            offload_config.offload_cond_stage &&
+            !cond_stage_on_cpu_only) {
+            // Disable automatic offloading - we control offload/reload timing explicitly
+            cond_stage_model->set_auto_offload(false);
+
+            // Check if there's enough VRAM to load cond_stage now
+            // If not, keep it on CPU - it will be loaded on-demand before conditioning
+            size_t cond_stage_size = cond_stage_model->get_params_buffer_size();
+            size_t free_vram = 0;
+#ifdef SD_USE_CUDA
+            size_t total_vram = 0;
+            ggml_backend_cuda_get_device_memory(0, &free_vram, &total_vram);
+#endif
+            // Need safety margin for compute buffers
+            size_t safety_margin = 500 * 1024 * 1024;
+
+            if (free_vram >= cond_stage_size + safety_margin) {
+                LOG_WARN("Moving cond_stage params to GPU (%.2f MB free, %.2f MB needed)",
+                         free_vram / (1024.0f * 1024.0f), cond_stage_size / (1024.0f * 1024.0f));
+                if (cond_stage_model->move_params_to_gpu()) {
+                    LOG_WARN("cond_stage now on GPU (%.2f MB), auto-offload disabled for explicit control",
+                             cond_stage_model->get_params_vram_size() / (1024.0f * 1024.0f));
+                } else {
+                    // GPU allocation failed despite having enough reported free VRAM (fragmentation?)
+                    // Keep on CPU - it will work, just with on-demand loading
+                    LOG_WARN("cond_stage GPU allocation failed (fragmentation?), keeping on CPU for on-demand loading");
+                }
+            } else {
+                LOG_WARN("Not enough VRAM for cond_stage at load time (%.2f MB free, %.2f MB needed), keeping on CPU for on-demand loading",
+                         free_vram / (1024.0f * 1024.0f), cond_stage_size / (1024.0f * 1024.0f));
+            }
+        }
+
         {
             size_t clip_params_mem_size = cond_stage_model->get_params_buffer_size();
             size_t unet_params_mem_size = diffusion_model->get_params_buffer_size();
@@ -1014,7 +1118,11 @@ class StableDiffusionGGML {
             is_high_noise = true;
             LOG_DEBUG("high noise lora: %s", lora_path.c_str());
         }
-        auto lora = std::make_shared<LoraModel>(lora_id, backend, lora_path, is_high_noise ? "model.high_noise_" : "", version);
+        // Enable CPU offload for LoRA when dynamic offloading is active
+        bool enable_lora_offload = (offload_config.mode != SD_OFFLOAD_NONE);
+        auto lora = std::make_shared<LoraModel>(lora_id, backend, lora_path,
+                                                is_high_noise ? "model.high_noise_" : "",
+                                                version, enable_lora_offload);
         if (!lora->load_from_file(n_threads, lora_tensor_filter)) {
             LOG_WARN("load lora tensors from %s failed", lora_path.c_str());
             return nullptr;
@@ -1691,7 +1799,7 @@ class StableDiffusionGGML {
                     return std::move(cached_output);
                 }
 
-                auto output_opt = work_diffusion_model->compute(n_threads, diffusion_params);
+                auto output_opt = work_diffusion_model->compute_dispatch(n_threads, diffusion_params);
                 if (output_opt.empty()) {
                     LOG_ERROR("diffusion model compute failed");
                     return sd::Tensor<float>();
@@ -1885,6 +1993,352 @@ class StableDiffusionGGML {
         return latents;
     }
 
+    // Estimate VRAM needed for VAE decode operation (formula-based)
+    size_t estimate_vae_decode_vram(int width, int height) {
+        if (first_stage_model == nullptr) {
+            return static_cast<size_t>(width) * height * 12;
+        }
+        size_t vae_weights = first_stage_model->get_params_buffer_size();
+        size_t compute_estimate = static_cast<size_t>(width) * height * 48;
+        return vae_weights + compute_estimate;
+    }
+
+    // Smart offload before VAE decode - only offload what's needed
+    bool smart_offload_for_vae(int width, int height, bool decode_video = false) {
+        if (offload_config.mode == SD_OFFLOAD_NONE) {
+            return false;
+        }
+
+        // In layer_streaming mode, skip smart offload for diffusion model
+        if (offload_config.mode == SD_OFFLOAD_LAYER_STREAMING) {
+            if (offload_config.offload_cond_stage && cond_stage_model && cond_stage_model->is_params_on_gpu()) {
+                if (offload_config.log_offload_events) {
+                    LOG_INFO("Layer streaming: moving cond_stage to CPU for VAE decode");
+                }
+                cond_stage_model->move_params_to_cpu();
+                return true;
+            }
+            return false;
+        }
+
+        size_t vae_vram_needed = estimate_vae_decode_vram(width, height);
+
+        size_t target_free = offload_config.target_free_vram;
+        size_t vram_to_free = vae_vram_needed > target_free ? 0 : vae_vram_needed;
+
+        size_t cond_vram = 0;
+        size_t diffusion_vram = 0;
+        bool cond_on_gpu = cond_stage_model && cond_stage_model->is_params_on_gpu();
+        bool diffusion_on_gpu = diffusion_model && diffusion_model->is_params_on_gpu();
+
+        if (cond_on_gpu) {
+            cond_vram = cond_stage_model->get_params_buffer_size();
+        }
+        if (diffusion_on_gpu) {
+            diffusion_vram = diffusion_model->get_params_buffer_size();
+        }
+
+        bool offloaded_anything = false;
+
+        if (offload_config.offload_cond_stage && cond_on_gpu && cond_vram >= offload_config.min_offload_size) {
+            if (offload_config.log_offload_events) {
+                LOG_INFO("Smart offload: moving cond_stage to CPU (%.2f MB) for VAE decode",
+                         cond_vram / (1024.0f * 1024.0f));
+            }
+            cond_stage_model->move_params_to_cpu();
+            offloaded_anything = true;
+            vram_to_free = (vram_to_free > cond_vram) ? vram_to_free - cond_vram : 0;
+        }
+
+        if (offload_config.offload_diffusion && diffusion_on_gpu && vram_to_free > 0 &&
+            diffusion_vram >= offload_config.min_offload_size) {
+            if (offload_config.log_offload_events) {
+                LOG_INFO("Smart offload: moving diffusion to CPU (%.2f MB) for VAE decode",
+                         diffusion_vram / (1024.0f * 1024.0f));
+            }
+            diffusion_model->move_params_to_cpu();
+            offloaded_anything = true;
+        }
+
+        return offloaded_anything;
+    }
+
+    // Smart offload before VAE encode - only offload what's needed
+    bool smart_offload_for_vae_encode(int width, int height) {
+        if (offload_config.mode == SD_OFFLOAD_NONE) {
+            return false;
+        }
+
+        if (offload_config.mode == SD_OFFLOAD_LAYER_STREAMING) {
+            bool offloaded = false;
+
+            if (offload_config.offload_cond_stage && cond_stage_model && cond_stage_model->is_params_on_gpu()) {
+                if (offload_config.log_offload_events) {
+                    LOG_INFO("Layer streaming: moving cond_stage to CPU for VAE encode");
+                }
+                cond_stage_model->move_params_to_cpu();
+                offloaded = true;
+            }
+
+            if (offload_config.offload_diffusion && diffusion_model && diffusion_model->is_params_on_gpu()) {
+                if (offload_config.log_offload_events) {
+                    LOG_INFO("Layer streaming: moving diffusion to CPU for VAE encode");
+                }
+                diffusion_model->move_params_to_cpu();
+                offloaded = true;
+            }
+
+            return offloaded;
+        }
+
+        size_t vae_vram_needed = 0;
+        if (first_stage_model == nullptr) {
+            vae_vram_needed = static_cast<size_t>(width) * height * 12;
+        } else {
+            size_t vae_weights = first_stage_model->get_params_buffer_size();
+            size_t compute_estimate = static_cast<size_t>(width) * height * 40;
+            vae_vram_needed = vae_weights + compute_estimate;
+        }
+
+        size_t target_free = offload_config.target_free_vram;
+        size_t vram_to_free = vae_vram_needed > target_free ? 0 : vae_vram_needed;
+
+        size_t cond_vram = 0;
+        size_t diffusion_vram = 0;
+        bool cond_on_gpu = cond_stage_model && cond_stage_model->is_params_on_gpu();
+        bool diffusion_on_gpu = diffusion_model && diffusion_model->is_params_on_gpu();
+
+        if (cond_on_gpu) {
+            cond_vram = cond_stage_model->get_params_buffer_size();
+        }
+        if (diffusion_on_gpu) {
+            diffusion_vram = diffusion_model->get_params_buffer_size();
+        }
+
+        bool offloaded_anything = false;
+
+        if (offload_config.offload_cond_stage && cond_on_gpu && cond_vram >= offload_config.min_offload_size) {
+            if (offload_config.log_offload_events) {
+                LOG_INFO("Smart offload: moving cond_stage to CPU (%.2f MB) for VAE encode",
+                         cond_vram / (1024.0f * 1024.0f));
+            }
+            cond_stage_model->move_params_to_cpu();
+            offloaded_anything = true;
+            vram_to_free = (vram_to_free > cond_vram) ? vram_to_free - cond_vram : 0;
+        }
+
+        if (offload_config.offload_diffusion && diffusion_on_gpu && vram_to_free > 0 &&
+            diffusion_vram >= offload_config.min_offload_size) {
+            if (offload_config.log_offload_events) {
+                LOG_INFO("Smart offload: moving diffusion to CPU (%.2f MB) for VAE encode",
+                         diffusion_vram / (1024.0f * 1024.0f));
+            }
+            diffusion_model->move_params_to_cpu();
+            offloaded_anything = true;
+        }
+
+        return offloaded_anything;
+    }
+
+    // Get current free VRAM on the primary GPU
+    size_t get_free_vram() {
+        size_t free_vram = 0;
+#ifdef SD_USE_CUDA
+        size_t total_vram = 0;
+        ggml_backend_cuda_get_device_memory(0, &free_vram, &total_vram);
+#endif
+        return free_vram;
+    }
+
+    // Estimate VRAM needed for diffusion sampling
+    size_t estimate_diffusion_vram(int width, int height) {
+        if (!diffusion_model) {
+            return 0;
+        }
+        size_t params_size = diffusion_model->get_params_buffer_size();
+        int latent_w = width / get_vae_scale_factor();
+        int latent_h = height / get_vae_scale_factor();
+        size_t compute_estimate = latent_w * latent_h * 64;
+        return params_size + compute_estimate;
+    }
+
+    // Smart check: Should we offload cond_stage after conditioning?
+    bool should_offload_cond_stage_for_diffusion(int width, int height) {
+        if (offload_config.mode == SD_OFFLOAD_NONE || !offload_config.offload_cond_stage) {
+            return false;
+        }
+        if (!cond_stage_model || !cond_stage_model->is_params_on_gpu()) {
+            return false;
+        }
+
+        if (offload_config.mode == SD_OFFLOAD_LAYER_STREAMING) {
+            LOG_INFO("Layer streaming mode: will offload cond_stage to free VRAM for layer loading");
+            return true;
+        }
+
+        size_t cond_stage_vram = cond_stage_model->get_params_vram_size();
+        if (cond_stage_vram < offload_config.min_offload_size) {
+            return false;
+        }
+
+        size_t free_vram = get_free_vram();
+        size_t diffusion_needs = estimate_diffusion_vram(width, height);
+        size_t safety_margin = 300 * 1024 * 1024;
+
+        bool vram_is_tight = free_vram < (diffusion_needs + safety_margin);
+
+        if (offload_config.log_offload_events) {
+            LOG_INFO("Smart check (cond->diffusion): free=%.2f MB, diffusion_needs=%.2f MB, cond_stage=%.2f MB, tight=%s",
+                     free_vram / (1024.0f * 1024.0f),
+                     diffusion_needs / (1024.0f * 1024.0f),
+                     cond_stage_vram / (1024.0f * 1024.0f),
+                     vram_is_tight ? "yes" : "no");
+        }
+
+        return vram_is_tight;
+    }
+
+    // Smart check: Should we offload diffusion after sampling?
+    bool should_offload_diffusion_for_vae(int width, int height) {
+        if (offload_config.mode != SD_OFFLOAD_AGGRESSIVE &&
+            offload_config.mode != SD_OFFLOAD_COND_DIFFUSION) {
+            return false;
+        }
+        if (!offload_config.offload_diffusion) {
+            return false;
+        }
+        if (!diffusion_model || !diffusion_model->is_params_on_gpu()) {
+            return false;
+        }
+
+        size_t diffusion_vram = diffusion_model->get_params_vram_size();
+        if (diffusion_vram < offload_config.min_offload_size) {
+            return false;
+        }
+
+        size_t free_vram = get_free_vram();
+        size_t vae_needs = estimate_vae_decode_vram(width, height);
+        size_t safety_margin = 300 * 1024 * 1024;
+
+        bool vram_is_tight = free_vram < (vae_needs + safety_margin);
+
+        if (offload_config.log_offload_events) {
+            LOG_INFO("Smart check (diffusion->VAE): free=%.2f MB, vae_needs=%.2f MB, diffusion=%.2f MB, tight=%s",
+                     free_vram / (1024.0f * 1024.0f),
+                     vae_needs / (1024.0f * 1024.0f),
+                     diffusion_vram / (1024.0f * 1024.0f),
+                     vram_is_tight ? "yes" : "no");
+        }
+
+        return vram_is_tight;
+    }
+
+    // Offload conditioners to CPU after conditioning phase
+    void offload_conditioners() {
+        if (offload_config.offload_cond_stage && cond_stage_model && cond_stage_model->is_params_on_gpu()) {
+            cond_stage_model->move_params_to_cpu();
+        }
+    }
+
+    // Offload diffusion model to CPU after sampling phase
+    void offload_diffusion_model() {
+        if (offload_config.offload_diffusion && diffusion_model && diffusion_model->is_params_on_gpu()) {
+            diffusion_model->move_params_to_cpu();
+        }
+    }
+
+    // Park the VAE on CPU pinned memory while diffusion samples. The VAE is
+    // idle for the entire sampler loop and only used at decode time, so its
+    // VRAM footprint is wasted during streaming. Reloads automatically on the
+    // next decode call via the runner's compute path. Only effective when the
+    // VAE was constructed with a CPU-pinned twin (vae_offload_to_cpu == true,
+    // which we escalate under SD_OFFLOAD_LAYER_STREAMING).
+    bool offload_vae_for_streaming() {
+        if (offload_config.mode != SD_OFFLOAD_LAYER_STREAMING) return false;
+        if (!first_stage_model || !first_stage_model->is_params_on_gpu()) return false;
+        size_t vae_vram = first_stage_model->get_params_vram_size();
+        if (!first_stage_model->move_params_to_cpu()) {
+            return false;
+        }
+        if (offload_config.log_offload_events) {
+            LOG_INFO("Layer streaming: parked VAE on CPU pinned (%.2f MB)",
+                     vae_vram / (1024.0 * 1024.0));
+        }
+        return true;
+    }
+
+    // Reload diffusion model to GPU before sampling
+    bool reload_diffusion_model() {
+        if (diffusion_model && !diffusion_model->is_params_on_gpu()) {
+            return diffusion_model->move_params_to_gpu();
+        }
+        return true;
+    }
+
+    // Reload cond_stage model to GPU before conditioning
+    bool reload_cond_stage_model() {
+        if (cond_stage_model && !cond_stage_model->is_params_on_gpu()) {
+            return cond_stage_model->move_params_to_gpu();
+        }
+        return true;
+    }
+
+    // Post-generation reload of models to GPU
+    void post_generation_reload() {
+        if (offload_config.mode == SD_OFFLOAD_NONE || free_params_immediately) {
+            return;
+        }
+
+        int64_t reload_start = ggml_time_ms();
+        bool reloaded_any = false;
+
+        // Reload diffusion if configured (skip for layer_streaming)
+        if (offload_config.reload_diffusion &&
+            offload_config.mode != SD_OFFLOAD_LAYER_STREAMING &&
+            diffusion_model && !diffusion_model->is_params_on_gpu()) {
+            if (offload_config.log_offload_events) {
+                LOG_WARN("Reloading diffusion to GPU after generation...");
+            }
+            if (diffusion_model->move_params_to_gpu()) {
+                if (offload_config.log_offload_events) {
+                    LOG_WARN("diffusion reloaded to GPU (%.2f MB)",
+                             diffusion_model->get_params_vram_size() / (1024.0f * 1024.0f));
+                }
+                reloaded_any = true;
+            } else {
+                LOG_WARN("Failed to reload diffusion to GPU - will load on-demand");
+            }
+        }
+
+        // Reload cond_stage if configured and enough VRAM
+        if (offload_config.reload_cond_stage &&
+            cond_stage_model && !cond_stage_model->is_params_on_gpu()) {
+            size_t cond_stage_size = cond_stage_model->get_params_buffer_size();
+            size_t free_vram = get_free_vram();
+            size_t safety_margin = 500 * 1024 * 1024;
+
+            if (free_vram >= cond_stage_size + safety_margin) {
+                if (offload_config.log_offload_events) {
+                    LOG_WARN("Reloading cond_stage to GPU after generation...");
+                }
+                if (cond_stage_model->move_params_to_gpu()) {
+                    if (offload_config.log_offload_events) {
+                        LOG_WARN("cond_stage reloaded to GPU (%.2f MB)",
+                                 cond_stage_model->get_params_vram_size() / (1024.0f * 1024.0f));
+                    }
+                    reloaded_any = true;
+                }
+            } else if (offload_config.log_offload_events) {
+                LOG_WARN("Not enough VRAM to reload cond_stage - will load on-demand");
+            }
+        }
+
+        if (reloaded_any && offload_config.log_offload_events) {
+            int64_t reload_end = ggml_time_ms();
+            LOG_WARN("Post-generation reload completed in %" PRId64 " ms", reload_end - reload_start);
+        }
+    }
+
     sd::Tensor<float> decode_first_stage(const sd::Tensor<float>& x, bool decode_video = false) {
         auto latents = first_stage_model->diffusion_to_vae_latents(x);
         return first_stage_model->decode(n_threads, latents, vae_tiling_params, decode_video, circular_x, circular_y);
@@ -2083,6 +2537,63 @@ enum lora_apply_mode_t str_to_lora_apply_mode(const char* str) {
     return LORA_APPLY_MODE_COUNT;
 }
 
+const char* offload_mode_to_str[] = {
+    "none",
+    "cond_only",
+    "cond_diffusion",
+    "aggressive",
+    "layer_streaming",
+};
+
+const char* sd_offload_mode_name(enum sd_offload_mode_t mode) {
+    if (mode < SD_OFFLOAD_MODE_COUNT) {
+        return offload_mode_to_str[mode];
+    }
+    return NONE_STR;
+}
+
+enum sd_offload_mode_t str_to_offload_mode(const char* str) {
+    for (int i = 0; i < SD_OFFLOAD_MODE_COUNT; i++) {
+        if (!strcmp(str, offload_mode_to_str[i])) {
+            return (enum sd_offload_mode_t)i;
+        }
+    }
+    return SD_OFFLOAD_MODE_COUNT;
+}
+
+const char* vram_estimation_to_str[] = {
+    "dryrun",
+    "formula",
+};
+
+const char* sd_vram_estimation_name(enum sd_vram_estimation_t method) {
+    if (method < SD_VRAM_EST_COUNT) {
+        return vram_estimation_to_str[method];
+    }
+    return NONE_STR;
+}
+
+enum sd_vram_estimation_t str_to_vram_estimation(const char* str) {
+    for (int i = 0; i < SD_VRAM_EST_COUNT; i++) {
+        if (!strcmp(str, vram_estimation_to_str[i])) {
+            return (enum sd_vram_estimation_t)i;
+        }
+    }
+    return SD_VRAM_EST_COUNT;
+}
+
+void sd_offload_config_init(sd_offload_config_t* config) {
+    config->mode               = SD_OFFLOAD_NONE;
+    config->vram_estimation    = SD_VRAM_EST_DRYRUN;  // Dry-run is default (accurate)
+    config->offload_cond_stage = true;
+    config->offload_diffusion  = false;
+    config->reload_cond_stage  = false;
+    config->reload_diffusion   = true;  // Default: reload diffusion for next generation
+    config->log_offload_events = true;
+    config->min_offload_size   = 0;
+    config->target_free_vram   = 2ULL * 1024 * 1024 * 1024;  // 2 GB
+}
+
 const char* hires_upscaler_to_str[] = {
     "None",
     "Latent",
@@ -2175,6 +2686,17 @@ void sd_ctx_params_init(sd_ctx_params_t* sd_ctx_params) {
     sd_ctx_params->chroma_use_dit_mask     = true;
     sd_ctx_params->chroma_use_t5_mask      = false;
     sd_ctx_params->chroma_t5_mask_pad      = 1;
+    // flow_shift moved out of sd_ctx_params_t in upstream master into
+    // sd_sample_params_t; sd_sample_params_init() initialises it there.
+
+    // Dynamic tensor offloading defaults (disabled)
+    sd_ctx_params->offload_config.mode               = SD_OFFLOAD_NONE;
+    sd_ctx_params->offload_config.offload_cond_stage = true;
+    sd_ctx_params->offload_config.offload_diffusion  = false;
+    sd_ctx_params->offload_config.reload_cond_stage  = false;  // Let on-demand reload handle it (safer)
+    sd_ctx_params->offload_config.log_offload_events = true;
+    sd_ctx_params->offload_config.min_offload_size   = 0;                  // No minimum - offload any size
+    sd_ctx_params->offload_config.target_free_vram   = 2ULL * 1024 * 1024 * 1024;  // 2 GB target for VAE
 }
 
 char* sd_ctx_params_to_str(const sd_ctx_params_t* sd_ctx_params) {
@@ -2490,6 +3012,15 @@ enum scheduler_t sd_get_default_scheduler(const sd_ctx_t* sd_ctx, enum sample_me
     return DISCRETE_SCHEDULER;
 }
 
+const char* sd_get_model_version_name(const sd_ctx_t* sd_ctx) {
+    if (sd_ctx != nullptr && sd_ctx->sd != nullptr) {
+        if (sd_ctx->sd->version < VERSION_COUNT) {
+            return model_version_to_str[sd_ctx->sd->version];
+        }
+    }
+    return "Unknown";
+}
+
 static int64_t resolve_seed(int64_t seed) {
     if (seed >= 0) {
         return seed;
@@ -2977,6 +3508,8 @@ static std::optional<ImageGenerationLatents> prepare_image_generation_latents(sd
     if (init_image_tensor.empty()) {
         init_latent = sd_ctx->sd->generate_init_latent(request->width, request->height);
     } else {
+        // Smart offload before VAE encode to free VRAM
+        sd_ctx->sd->smart_offload_for_vae_encode(request->width, request->height);
         init_latent = sd_ctx->sd->encode_first_stage(init_image_tensor);
         if (init_latent.empty()) {
             LOG_ERROR("failed to encode init image");
@@ -3171,6 +3704,17 @@ static std::optional<ImageGenerationEmbeds> prepare_image_generation_embeds(sd_c
         sd_ctx->sd->cond_stage_model->free_params_buffer();
     }
 
+    // Smart offload: move cond_stage to CPU if VRAM is tight for diffusion sampling
+    if (!sd_ctx->sd->free_params_immediately &&
+        sd_ctx->sd->should_offload_cond_stage_for_diffusion(request->width, request->height)) {
+        sd_ctx->sd->offload_conditioners();
+    }
+
+    // Layer-streaming companion: free the VAE's VRAM for the sampler loop.
+    // It's only needed at decode time, which reloads it via the runner's
+    // normal compute path.
+    sd_ctx->sd->offload_vae_for_streaming();
+
     ImageGenerationEmbeds embeds;
     if (request->use_img_cond) {
         embeds.img_cond = SDCondition(uncond.c_crossattn, uncond.c_vector, cond.c_concat);
@@ -3189,6 +3733,15 @@ static sd_image_t* decode_image_outputs(sd_ctx_t* sd_ctx,
         LOG_ERROR("expected %d latents, got %zu", request.batch_count, final_latents.size());
         return nullptr;
     }
+    // Smart offload before VAE decode
+    sd_ctx->sd->smart_offload_for_vae(request.width, request.height);
+
+    // For layer_streaming mode: offload streaming layers before VAE decode
+    if (sd_ctx->sd->offload_config.mode == SD_OFFLOAD_LAYER_STREAMING &&
+        sd_ctx->sd->diffusion_model && sd_ctx->sd->diffusion_model->is_layer_streaming_enabled()) {
+        sd_ctx->sd->diffusion_model->offload_streaming_layers();
+    }
+
     LOG_INFO("decoding %zu latents", final_latents.size());
     std::vector<sd::Tensor<float>> decoded_images;
     int64_t t0 = ggml_time_ms();
@@ -3369,6 +3922,16 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s
     sd_ctx->sd->rng->manual_seed(request.seed);
     sd_ctx->sd->sampler_rng->manual_seed(request.seed);
     sd_ctx->sd->set_flow_shift(sd_img_gen_params->sample_params.flow_shift);
+
+    // When offload mode is enabled and we have LoRAs, offload cond_stage first to free VRAM
+    if (sd_ctx->sd->offload_config.mode != SD_OFFLOAD_NONE &&
+        sd_ctx->sd->offload_config.offload_cond_stage &&
+        sd_img_gen_params->lora_count > 0 &&
+        sd_ctx->sd->cond_stage_model && sd_ctx->sd->cond_stage_model->is_params_on_gpu()) {
+        LOG_WARN("Offloading cond_stage before LoRA application to free VRAM");
+        sd_ctx->sd->offload_conditioners();
+    }
+
     sd_ctx->sd->apply_loras(sd_img_gen_params->loras, sd_img_gen_params->lora_count);
 
     ImageVaeAxesGuard axes_guard(sd_ctx, sd_img_gen_params, request);
@@ -3393,6 +3956,16 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s
     }
     ImageGenerationEmbeds embeds = std::move(*embeds_opt);
 
+    // Ensure diffusion model is on GPU before sampling (may have been offloaded for cond_stage)
+    // Skip for layer_streaming - streaming engine loads layers individually
+    if (sd_ctx->sd->offload_config.mode != SD_OFFLOAD_NONE &&
+        sd_ctx->sd->offload_config.mode != SD_OFFLOAD_LAYER_STREAMING) {
+        if (!sd_ctx->sd->reload_diffusion_model()) {
+            LOG_ERROR("Failed to reload diffusion model to GPU for sampling");
+            return nullptr;
+        }
+    }
+
     std::vector<sd::Tensor<float>> final_latents;
     int64_t denoise_start = ggml_time_ms();
     for (int b = 0; b < request.batch_count; b++) {
@@ -3438,6 +4011,18 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s
                   b + 1,
                   request.batch_count,
                   (sampling_end - sampling_start) * 1.0f / 1000);
+        // Mid-stream failures (e.g. compute-buffer cudaMalloc OOM at layer N)
+        // leave the streaming engine's resident layers + warm cache GPU-resident
+        // — the success path's offload_streaming_layers() at the end of
+        // sampling never runs. Without this eviction, the next job starts on a
+        // GPU that's already 8-9 GB full from the previous failed run and
+        // typically hits the same OOM. The swap is cheap (each layer's CPU
+        // pinned twin already exists) so freeing them is just pointer swaps.
+        if (sd_ctx->sd->offload_config.mode == SD_OFFLOAD_LAYER_STREAMING &&
+            sd_ctx->sd->diffusion_model &&
+            sd_ctx->sd->diffusion_model->is_layer_streaming_enabled()) {
+            sd_ctx->sd->diffusion_model->offload_streaming_layers();
+        }
         if (sd_ctx->sd->free_params_immediately) {
             sd_ctx->sd->diffusion_model->free_params_buffer();
         }
@@ -3451,6 +4036,12 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s
              final_latents.size(),
              (denoise_end - denoise_start) * 1.0f / 1000);
 
+    // Smart offload: move diffusion to CPU if VRAM is tight for VAE decode
+    if (!sd_ctx->sd->free_params_immediately &&
+        sd_ctx->sd->should_offload_diffusion_for_vae(request.width, request.height)) {
+        sd_ctx->sd->offload_diffusion_model();
+    }
+
     if (request.hires.enabled && request.hires.target_width > 0) {
         LOG_INFO("hires fix: upscaling to %dx%d", request.hires.target_width, request.hires.target_height);
 
@@ -3566,6 +4157,11 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s
                       b + 1,
                       (int)final_latents.size(),
                       (hires_sample_end - hires_sample_start) * 1.0f / 1000);
+            if (sd_ctx->sd->offload_config.mode == SD_OFFLOAD_LAYER_STREAMING &&
+                sd_ctx->sd->diffusion_model &&
+                sd_ctx->sd->diffusion_model->is_layer_streaming_enabled()) {
+                sd_ctx->sd->diffusion_model->offload_streaming_layers();
+            }
             if (sd_ctx->sd->free_params_immediately) {
                 sd_ctx->sd->diffusion_model->free_params_buffer();
             }
@@ -3587,6 +4183,9 @@ SD_API sd_image_t* generate_image(sd_ctx_t* sd_ctx, const sd_img_gen_params_t* s
 
     sd_ctx->sd->lora_stat();
 
+    // Post-generation reload of models to GPU
+    sd_ctx->sd->post_generation_reload();
+
     int64_t t1 = ggml_time_ms();
     LOG_INFO("generate_image completed in %.2fs", (t1 - t0) * 1.0f / 1000);
     return result;
@@ -3656,6 +4255,9 @@ static std::optional<ImageGenerationLatents> prepare_video_generation_latents(sd
             sd::ops::slice_assign(&image, 2, request->frames - 1, request->frames, end_image.unsqueeze(2));
         }
 
+        // Smart offload before VAE encode to free VRAM
+        sd_ctx->sd->smart_offload_for_vae_encode(request->width, request->height);
+
         auto concat_latent = sd_ctx->sd->encode_first_stage(image);  // [b, c, t, h/vae_scale_factor, w/vae_scale_factor]
         if (concat_latent.empty()) {
             LOG_ERROR("failed to encode video conditioning frames");
@@ -3705,6 +4307,9 @@ static std::optional<ImageGenerationLatents> prepare_video_generation_latents(sd
         int64_t t1 = ggml_time_ms();
         sd::Tensor<float> ref_image_latent;
         if (!start_image.empty()) {
+            // Smart offload before VAE encode to free VRAM
+            sd_ctx->sd->smart_offload_for_vae_encode(request->width, request->height);
+
             auto ref_img     = start_image.reshape({start_image.shape()[0], start_image.shape()[1], 1, start_image.shape()[2], 1});
             auto encoded_ref = sd_ctx->sd->encode_first_stage(ref_img);  // [b, c, 1, h/vae_scale_factor, w/vae_scale_factor]
             if (encoded_ref.empty()) {
@@ -3727,6 +4332,9 @@ static std::optional<ImageGenerationLatents> prepare_video_generation_latents(sd
         sd::Tensor<float> inactive = control_video * (1.0f - mask) + 0.5f;
         sd::Tensor<float> reactive = control_video * mask + 0.5f;
 
+        // Smart offload before VAE encode to free VRAM
+        sd_ctx->sd->smart_offload_for_vae_encode(request->width, request->height);
+
         inactive = sd_ctx->sd->encode_first_stage(inactive);  // [b, c, t, h/vae_scale_factor, w/vae_scale_factor]
         if (inactive.empty()) {
             LOG_ERROR("failed to encode VACE inactive context");
@@ -3786,6 +4394,14 @@ static ImageGenerationEmbeds prepare_video_generation_embeds(sd_ctx_t* sd_ctx,
                                                              const sd_vid_gen_params_t* sd_vid_gen_params,
                                                              const GenerationRequest& request,
                                                              const ImageGenerationLatents& latents) {
+    // On-demand GPU reload for cond_stage before conditioning
+    if (sd_ctx->sd->offload_config.mode != SD_OFFLOAD_NONE &&
+        sd_ctx->sd->offload_config.offload_cond_stage &&
+        !sd_ctx->sd->free_params_immediately &&
+        !sd_ctx->sd->cond_stage_on_cpu_only) {
+        sd_ctx->sd->reload_cond_stage_model();
+    }
+
     ImageGenerationEmbeds embeds;
     ConditionerParams condition_params;
     condition_params.clip_skip       = request.clip_skip;
@@ -3811,6 +4427,13 @@ static ImageGenerationEmbeds prepare_video_generation_embeds(sd_ctx_t* sd_ctx,
     if (sd_ctx->sd->free_params_immediately) {
         sd_ctx->sd->cond_stage_model->free_params_buffer();
     }
+
+    // Smart offload: move cond_stage to CPU if VRAM is tight for diffusion sampling
+    if (!sd_ctx->sd->free_params_immediately &&
+        sd_ctx->sd->should_offload_cond_stage_for_diffusion(request.width, request.height)) {
+        sd_ctx->sd->offload_conditioners();
+    }
+
     return embeds;
 }
 
@@ -3821,6 +4444,16 @@ static sd_image_t* decode_video_outputs(sd_ctx_t* sd_ctx,
         LOG_ERROR("no latent video to decode");
         return nullptr;
     }
+
+    // Smart offload before VAE decode
+    sd_ctx->sd->smart_offload_for_vae(0, 0, true);
+
+    // For layer_streaming mode: offload streaming layers before VAE decode
+    if (sd_ctx->sd->offload_config.mode == SD_OFFLOAD_LAYER_STREAMING &&
+        sd_ctx->sd->diffusion_model && sd_ctx->sd->diffusion_model->is_layer_streaming_enabled()) {
+        sd_ctx->sd->diffusion_model->offload_streaming_layers();
+    }
+
     int64_t t4            = ggml_time_ms();
     sd::Tensor<float> vid = sd_ctx->sd->decode_first_stage(final_latent, true);
     int64_t t5            = ggml_time_ms();
@@ -3919,6 +4552,11 @@ SD_API sd_image_t* generate_video(sd_ctx_t* sd_ctx, const sd_vid_gen_params_t* s
         int64_t sampling_end          = ggml_time_ms();
         if (x_t_sampled.empty()) {
             LOG_ERROR("sampling(high noise) failed after %.2fs", (sampling_end - sampling_start) * 1.0f / 1000);
+            if (sd_ctx->sd->offload_config.mode == SD_OFFLOAD_LAYER_STREAMING &&
+                sd_ctx->sd->high_noise_diffusion_model &&
+                sd_ctx->sd->high_noise_diffusion_model->is_layer_streaming_enabled()) {
+                sd_ctx->sd->high_noise_diffusion_model->offload_streaming_layers();
+            }
             if (sd_ctx->sd->free_params_immediately) {
                 sd_ctx->sd->high_noise_diffusion_model->free_params_buffer();
             }
@@ -3965,10 +4603,21 @@ SD_API sd_image_t* generate_video(sd_ctx_t* sd_ctx, const sd_vid_gen_params_t* s
     }
     if (final_latent.empty()) {
         LOG_ERROR("sampling failed after %.2fs", (sampling_end - sampling_start) * 1.0f / 1000);
+        if (sd_ctx->sd->offload_config.mode == SD_OFFLOAD_LAYER_STREAMING &&
+            sd_ctx->sd->diffusion_model &&
+            sd_ctx->sd->diffusion_model->is_layer_streaming_enabled()) {
+            sd_ctx->sd->diffusion_model->offload_streaming_layers();
+        }
         return nullptr;
     }
     LOG_INFO("sampling completed, taking %.2fs", (sampling_end - sampling_start) * 1.0f / 1000);
 
+    // Smart offload: move diffusion to CPU if VRAM is tight for VAE decode
+    if (!sd_ctx->sd->free_params_immediately &&
+        sd_ctx->sd->should_offload_diffusion_for_vae(request.width, request.height)) {
+        sd_ctx->sd->offload_diffusion_model();
+    }
+
     if (latents.ref_image_num > 0) {
         final_latent = sd::ops::slice(final_latent, 2, latents.ref_image_num, final_latent.shape()[2]);
     }
@@ -3983,7 +4632,336 @@ SD_API sd_image_t* generate_video(sd_ctx_t* sd_ctx, const sd_vid_gen_params_t* s
 
     sd_ctx->sd->lora_stat();
 
+    // Post-generation reload of models to GPU
+    sd_ctx->sd->post_generation_reload();
+
     int64_t t1 = ggml_time_ms();
     LOG_INFO("generate_video completed in %.2fs", (t1 - t0) * 1.0f / 1000);
     return result;
 }
+
+/*================================================ Dynamic Tensor Offloading API ================================================*/
+
+static const char* component_names[] = {
+    "cond_stage",    // SD_COMPONENT_COND_STAGE
+    "clip_vision",   // SD_COMPONENT_CLIP_VISION
+    "diffusion",     // SD_COMPONENT_DIFFUSION
+    "vae",           // SD_COMPONENT_VAE
+    "control_net",   // SD_COMPONENT_CONTROL_NET
+    "pmid",          // SD_COMPONENT_PMID
+};
+
+const char* sd_component_name(sd_component_t component) {
+    if (component >= 0 && component < SD_COMPONENT_COUNT) {
+        return component_names[component];
+    }
+    return "unknown";
+}
+
+bool sd_offload_to_cpu(sd_ctx_t* sd_ctx, sd_component_t component) {
+    if (sd_ctx == nullptr || sd_ctx->sd == nullptr) {
+        return false;
+    }
+
+    bool success = false;
+    switch (component) {
+        case SD_COMPONENT_COND_STAGE:
+            if (sd_ctx->sd->cond_stage_model) {
+                success = sd_ctx->sd->cond_stage_model->move_params_to_cpu();
+                if (success) {
+                    LOG_INFO("Offloaded %s to CPU", sd_component_name(component));
+                }
+            }
+            break;
+        case SD_COMPONENT_CLIP_VISION:
+            if (sd_ctx->sd->clip_vision) {
+                success = sd_ctx->sd->clip_vision->move_params_to_cpu();
+                if (success) {
+                    LOG_INFO("Offloaded %s to CPU", sd_component_name(component));
+                }
+            }
+            break;
+        case SD_COMPONENT_DIFFUSION:
+            if (sd_ctx->sd->diffusion_model) {
+                success = sd_ctx->sd->diffusion_model->move_params_to_cpu();
+                if (success) {
+                    LOG_INFO("Offloaded %s to CPU", sd_component_name(component));
+                }
+            }
+            break;
+        case SD_COMPONENT_VAE:
+            if (sd_ctx->sd->first_stage_model) {
+                success = sd_ctx->sd->first_stage_model->move_params_to_cpu();
+                if (success) {
+                    LOG_INFO("Offloaded %s to CPU", sd_component_name(component));
+                }
+            }
+            break;
+        case SD_COMPONENT_CONTROL_NET:
+            if (sd_ctx->sd->control_net) {
+                success = sd_ctx->sd->control_net->move_params_to_cpu();
+                if (success) {
+                    LOG_INFO("Offloaded %s to CPU", sd_component_name(component));
+                }
+            }
+            break;
+        case SD_COMPONENT_PMID:
+            if (sd_ctx->sd->pmid_model) {
+                success = sd_ctx->sd->pmid_model->move_params_to_cpu();
+                if (success) {
+                    LOG_INFO("Offloaded %s to CPU", sd_component_name(component));
+                }
+            }
+            break;
+        default:
+            LOG_WARN("Unknown component: %d", component);
+            break;
+    }
+    return success;
+}
+
+bool sd_reload_to_gpu(sd_ctx_t* sd_ctx, sd_component_t component) {
+    if (sd_ctx == nullptr || sd_ctx->sd == nullptr) {
+        return false;
+    }
+
+    bool success = false;
+    switch (component) {
+        case SD_COMPONENT_COND_STAGE:
+            if (sd_ctx->sd->cond_stage_model) {
+                success = sd_ctx->sd->cond_stage_model->move_params_to_gpu();
+                if (success) {
+                    LOG_INFO("Reloaded %s to GPU", sd_component_name(component));
+                }
+            }
+            break;
+        case SD_COMPONENT_CLIP_VISION:
+            if (sd_ctx->sd->clip_vision) {
+                success = sd_ctx->sd->clip_vision->move_params_to_gpu();
+                if (success) {
+                    LOG_INFO("Reloaded %s to GPU", sd_component_name(component));
+                }
+            }
+            break;
+        case SD_COMPONENT_DIFFUSION:
+            if (sd_ctx->sd->diffusion_model) {
+                success = sd_ctx->sd->diffusion_model->move_params_to_gpu();
+                if (success) {
+                    LOG_INFO("Reloaded %s to GPU", sd_component_name(component));
+                }
+            }
+            break;
+        case SD_COMPONENT_VAE:
+            if (sd_ctx->sd->first_stage_model) {
+                success = sd_ctx->sd->first_stage_model->move_params_to_gpu();
+                if (success) {
+                    LOG_INFO("Reloaded %s to GPU", sd_component_name(component));
+                }
+            }
+            break;
+        case SD_COMPONENT_CONTROL_NET:
+            if (sd_ctx->sd->control_net) {
+                success = sd_ctx->sd->control_net->move_params_to_gpu();
+                if (success) {
+                    LOG_INFO("Reloaded %s to GPU", sd_component_name(component));
+                }
+            }
+            break;
+        case SD_COMPONENT_PMID:
+            if (sd_ctx->sd->pmid_model) {
+                success = sd_ctx->sd->pmid_model->move_params_to_gpu();
+                if (success) {
+                    LOG_INFO("Reloaded %s to GPU", sd_component_name(component));
+                }
+            }
+            break;
+        default:
+            LOG_WARN("Unknown component: %d", component);
+            break;
+    }
+    return success;
+}
+
+bool sd_is_on_gpu(sd_ctx_t* sd_ctx, sd_component_t component) {
+    if (sd_ctx == nullptr || sd_ctx->sd == nullptr) {
+        return false;
+    }
+
+    switch (component) {
+        case SD_COMPONENT_COND_STAGE:
+            if (sd_ctx->sd->cond_stage_model) {
+                return sd_ctx->sd->cond_stage_model->is_params_on_gpu();
+            }
+            break;
+        case SD_COMPONENT_CLIP_VISION:
+            if (sd_ctx->sd->clip_vision) {
+                return sd_ctx->sd->clip_vision->is_params_on_gpu();
+            }
+            break;
+        case SD_COMPONENT_DIFFUSION:
+            if (sd_ctx->sd->diffusion_model) {
+                return sd_ctx->sd->diffusion_model->is_params_on_gpu();
+            }
+            break;
+        case SD_COMPONENT_VAE:
+            if (sd_ctx->sd->first_stage_model) {
+                return sd_ctx->sd->first_stage_model->is_params_on_gpu();
+            }
+            break;
+        case SD_COMPONENT_CONTROL_NET:
+            if (sd_ctx->sd->control_net) {
+                return sd_ctx->sd->control_net->is_params_on_gpu();
+            }
+            break;
+        case SD_COMPONENT_PMID:
+            if (sd_ctx->sd->pmid_model) {
+                return sd_ctx->sd->pmid_model->is_params_on_gpu();
+            }
+            break;
+        default:
+            break;
+    }
+    return false;
+}
+
+size_t sd_get_component_vram(sd_ctx_t* sd_ctx, sd_component_t component) {
+    if (sd_ctx == nullptr || sd_ctx->sd == nullptr) {
+        return 0;
+    }
+
+    switch (component) {
+        case SD_COMPONENT_COND_STAGE:
+            if (sd_ctx->sd->cond_stage_model) {
+                return sd_ctx->sd->cond_stage_model->get_params_vram_size();
+            }
+            break;
+        case SD_COMPONENT_CLIP_VISION:
+            if (sd_ctx->sd->clip_vision) {
+                return sd_ctx->sd->clip_vision->get_params_vram_size();
+            }
+            break;
+        case SD_COMPONENT_DIFFUSION:
+            if (sd_ctx->sd->diffusion_model) {
+                return sd_ctx->sd->diffusion_model->get_params_vram_size();
+            }
+            break;
+        case SD_COMPONENT_VAE:
+            if (sd_ctx->sd->first_stage_model) {
+                return sd_ctx->sd->first_stage_model->get_params_vram_size();
+            }
+            break;
+        case SD_COMPONENT_CONTROL_NET:
+            if (sd_ctx->sd->control_net) {
+                return sd_ctx->sd->control_net->get_params_vram_size();
+            }
+            break;
+        case SD_COMPONENT_PMID:
+            if (sd_ctx->sd->pmid_model) {
+                return sd_ctx->sd->pmid_model->get_params_vram_size();
+            }
+            break;
+        default:
+            break;
+    }
+    return 0;
+}
+
+void sd_free_gpu_resources(sd_ctx_t* sd_ctx) {
+    if (sd_ctx == nullptr || sd_ctx->sd == nullptr) {
+        return;
+    }
+
+    LOG_WARN("[Cleanup] Freeing all GPU resources before unload");
+
+    size_t total_freed = 0;
+
+    // Helper macro to free component GPU memory
+    #define FREE_COMPONENT_GPU(model_ptr, name) do { \
+        auto* model = (model_ptr); \
+        if (model) { \
+            size_t size = model->get_params_vram_size(); \
+            if (size == 0) size = model->get_params_buffer_size(); \
+            if (size > 0) { \
+                if (!model->move_params_to_cpu()) { \
+                    model->free_params_buffer(); \
+                    LOG_WARN("[Cleanup] %s freed GPU buffer (%.2f MB) - no offload backend", name, size / (1024.0f * 1024.0f)); \
+                } else { \
+                    LOG_WARN("[Cleanup] %s offloaded to CPU (%.2f MB)", name, size / (1024.0f * 1024.0f)); \
+                } \
+                total_freed += size; \
+            } \
+        } \
+    } while(0)
+
+    // Free all model components
+    FREE_COMPONENT_GPU(sd_ctx->sd->cond_stage_model.get(), "cond_stage");
+    FREE_COMPONENT_GPU(sd_ctx->sd->diffusion_model.get(), "diffusion");
+    FREE_COMPONENT_GPU(sd_ctx->sd->high_noise_diffusion_model.get(), "high_noise_diffusion");
+    FREE_COMPONENT_GPU(sd_ctx->sd->first_stage_model.get(), "VAE");
+    FREE_COMPONENT_GPU(sd_ctx->sd->control_net.get(), "ControlNet");
+    FREE_COMPONENT_GPU(sd_ctx->sd->clip_vision.get(), "CLIP_Vision");
+    FREE_COMPONENT_GPU(sd_ctx->sd->pmid_model.get(), "PhotoMaker");
+
+    #undef FREE_COMPONENT_GPU
+
+    // Clear LoRA models to free their GPU buffers
+    size_t lora_freed = 0;
+    for (auto& lora : sd_ctx->sd->cond_stage_lora_models) {
+        if (lora) {
+            size_t size = lora->get_params_buffer_size();
+            if (size > 0) {
+                if (!lora->move_params_to_cpu()) {
+                    lora->free_params_buffer();
+                }
+                lora_freed += size;
+            }
+        }
+    }
+    for (auto& lora : sd_ctx->sd->diffusion_lora_models) {
+        if (lora) {
+            size_t size = lora->get_params_buffer_size();
+            if (size > 0) {
+                if (!lora->move_params_to_cpu()) {
+                    lora->free_params_buffer();
+                }
+                lora_freed += size;
+            }
+        }
+    }
+    for (auto& lora : sd_ctx->sd->first_stage_lora_models) {
+        if (lora) {
+            size_t size = lora->get_params_buffer_size();
+            if (size > 0) {
+                if (!lora->move_params_to_cpu()) {
+                    lora->free_params_buffer();
+                }
+                lora_freed += size;
+            }
+        }
+    }
+    if (sd_ctx->sd->pmid_lora) {
+        size_t size = sd_ctx->sd->pmid_lora->get_params_buffer_size();
+        if (size > 0) {
+            if (!sd_ctx->sd->pmid_lora->move_params_to_cpu()) {
+                sd_ctx->sd->pmid_lora->free_params_buffer();
+            }
+            lora_freed += size;
+        }
+    }
+    if (lora_freed > 0) {
+        total_freed += lora_freed;
+        LOG_WARN("[Cleanup] LoRAs freed (%.2f MB)", lora_freed / (1024.0f * 1024.0f));
+    }
+
+    // Clear LoRA vectors entirely to trigger destructor cleanup
+    sd_ctx->sd->cond_stage_lora_models.clear();
+    sd_ctx->sd->diffusion_lora_models.clear();
+    sd_ctx->sd->first_stage_lora_models.clear();
+
+    // Synchronize CUDA to ensure all deallocations complete
+#ifdef SD_USE_CUDA
+    cudaDeviceSynchronize();
+#endif
+
+    LOG_WARN("[Cleanup] GPU resources freed, total: %.2f MB", total_freed / (1024.0f * 1024.0f));
+}
diff --git a/src/tensor_registry.hpp b/src/tensor_registry.hpp
new file mode 100644
index 000000000..cde9513fd
--- /dev/null
+++ b/src/tensor_registry.hpp
@@ -0,0 +1,631 @@
+#ifndef __TENSOR_REGISTRY_HPP__
+#define __TENSOR_REGISTRY_HPP__
+
+#include <algorithm>
+#include <functional>
+#include <map>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "ggml-alloc.h"
+#include "ggml-backend.h"
+#include "ggml.h"
+
+#include "util.h"
+
+namespace LayerStreaming {
+
+struct TensorInfo {
+    ggml_tensor* gpu_tensor = nullptr;
+    ggml_tensor* cpu_tensor = nullptr;
+    size_t size_bytes       = 0;
+    bool on_gpu             = false;
+    int layer_index         = -1;
+    std::string layer_name;
+    uint64_t last_access    = 0;
+};
+
+struct LayerInfo {
+    std::string name;
+    int index                       = -1;
+    std::vector<std::string> tensor_names;
+    size_t total_size_bytes         = 0;
+    bool on_gpu                     = false;
+    ggml_backend_buffer_t gpu_buffer = nullptr;
+};
+
+// Tracks in-flight async transfers
+struct AsyncLoadState {
+    struct CopyInfo {
+        std::string name;
+        ggml_tensor* cpu_tensor;
+        ggml_tensor* gpu_tensor;
+    };
+
+    ggml_context* temp_ctx = nullptr;
+    ggml_backend_buffer_t gpu_buffer = nullptr;
+    std::vector<CopyInfo> copy_list;
+    int64_t start_time = 0;
+};
+
+class TensorRegistry {
+public:
+    TensorRegistry(ggml_backend_t gpu_backend, ggml_backend_t cpu_backend)
+        : gpu_backend_(gpu_backend), cpu_backend_(cpu_backend) {}
+
+    ~TensorRegistry() {
+        clear();
+    }
+
+    void register_tensor(const std::string& name,
+                         ggml_tensor* cpu_tensor,
+                         const std::string& layer_name,
+                         int layer_index) {
+        TensorInfo info;
+        info.cpu_tensor  = cpu_tensor;
+        info.gpu_tensor  = nullptr;
+        info.size_bytes  = ggml_nbytes(cpu_tensor);
+        info.on_gpu      = false;
+        info.layer_index = layer_index;
+        info.layer_name  = layer_name;
+        info.last_access = 0;
+
+        tensors_[name] = info;
+
+        if (layers_.find(layer_name) == layers_.end()) {
+            LayerInfo layer_info;
+            layer_info.name            = layer_name;
+            layer_info.index           = layer_index;
+            layer_info.total_size_bytes = 0;
+            layer_info.on_gpu          = false;
+            layer_info.gpu_buffer      = nullptr;
+            layers_[layer_name]        = layer_info;
+        }
+        layers_[layer_name].tensor_names.push_back(name);
+        layers_[layer_name].total_size_bytes += info.size_bytes;
+    }
+
+    // Only works if tensor names are set with ggml_set_name()
+    void register_from_context(ggml_context* ctx,
+                               const std::string& prefix,
+                               std::function<std::pair<std::string, int>(const std::string&)> layer_pattern_fn) {
+        for (ggml_tensor* t = ggml_get_first_tensor(ctx); t != nullptr; t = ggml_get_next_tensor(ctx, t)) {
+            std::string name = ggml_get_name(t);
+            auto [layer_name, layer_index] = layer_pattern_fn(name);
+            register_tensor(name, t, layer_name, layer_index);
+        }
+    }
+
+    // Preferred method: tensor names are properly preserved in the map keys
+    void register_from_map(const std::map<std::string, ggml_tensor*>& tensors,
+                           std::function<std::pair<std::string, int>(const std::string&)> layer_pattern_fn) {
+        for (const auto& [name, tensor] : tensors) {
+            auto [layer_name, layer_index] = layer_pattern_fn(name);
+            register_tensor(name, tensor, layer_name, layer_index);
+        }
+    }
+
+    bool move_layer_to_gpu(const std::string& layer_name) {
+        auto it = layers_.find(layer_name);
+        if (it == layers_.end()) {
+            LOG_ERROR("layer '%s' not found", layer_name.c_str());
+            return false;
+        }
+
+        LayerInfo& layer = it->second;
+        if (layer.on_gpu) {
+            return true;
+        }
+
+        int64_t t0 = ggml_time_ms();
+
+        size_t ctx_size = layer.tensor_names.size() * ggml_tensor_overhead() + 1024;
+        struct ggml_init_params ctx_params = {
+            ctx_size,
+            nullptr,
+            true,
+        };
+        ggml_context* temp_ctx = ggml_init(ctx_params);
+        if (temp_ctx == nullptr) {
+            LOG_ERROR("failed to create temp context for layer '%s'", layer_name.c_str());
+            return false;
+        }
+
+        // Can't rely on ggml_get_name() because GGMLBlock doesn't call ggml_set_name()
+        struct CopyInfo {
+            std::string name;
+            ggml_tensor* cpu_tensor;
+            ggml_tensor* gpu_tensor;
+        };
+        std::vector<CopyInfo> copy_list;
+
+        for (const auto& tensor_name : layer.tensor_names) {
+            TensorInfo& info = tensors_[tensor_name];
+            if (info.on_gpu) {
+                continue;
+            }
+
+            ggml_tensor* gpu_tensor = ggml_dup_tensor(temp_ctx, info.cpu_tensor);
+            ggml_set_name(gpu_tensor, tensor_name.c_str());
+            copy_list.push_back({tensor_name, info.cpu_tensor, gpu_tensor});
+        }
+
+        if (copy_list.empty()) {
+            ggml_free(temp_ctx);
+            layer.on_gpu = true;
+            return true;
+        }
+
+        layer.gpu_buffer = ggml_backend_alloc_ctx_tensors(temp_ctx, gpu_backend_);
+        if (layer.gpu_buffer == nullptr) {
+            LOG_ERROR("failed to allocate GPU buffer for layer '%s'", layer_name.c_str());
+            ggml_free(temp_ctx);
+            return false;
+        }
+
+        for (auto& item : copy_list) {
+            ggml_backend_tensor_copy(item.cpu_tensor, item.gpu_tensor);
+        }
+        ggml_backend_synchronize(gpu_backend_);
+
+        for (auto& item : copy_list) {
+            TensorInfo& info = tensors_[item.name];
+            info.gpu_tensor  = item.gpu_tensor;
+            info.on_gpu      = true;
+            info.last_access = access_counter_++;
+
+            // Swap pointers so the original tensor now points to GPU memory
+            std::swap(item.cpu_tensor->buffer, item.gpu_tensor->buffer);
+            std::swap(item.cpu_tensor->data, item.gpu_tensor->data);
+            std::swap(item.cpu_tensor->extra, item.gpu_tensor->extra);
+        }
+
+        layer.on_gpu = true;
+        current_gpu_usage_ += layer.total_size_bytes;
+        layer_contexts_[layer_name] = temp_ctx;
+
+        return true;
+    }
+
+    void move_layer_to_cpu(const std::string& layer_name) {
+        auto it = layers_.find(layer_name);
+        if (it == layers_.end()) {
+            return;
+        }
+
+        LayerInfo& layer = it->second;
+        if (!layer.on_gpu) {
+            return;
+        }
+
+        for (const auto& tensor_name : layer.tensor_names) {
+            TensorInfo& info = tensors_[tensor_name];
+            if (!info.on_gpu || info.gpu_tensor == nullptr) {
+                continue;
+            }
+
+            std::swap(info.cpu_tensor->buffer, info.gpu_tensor->buffer);
+            std::swap(info.cpu_tensor->data, info.gpu_tensor->data);
+            std::swap(info.cpu_tensor->extra, info.gpu_tensor->extra);
+
+            info.gpu_tensor = nullptr;
+            info.on_gpu     = false;
+        }
+
+        if (layer.gpu_buffer != nullptr) {
+            ggml_backend_buffer_free(layer.gpu_buffer);
+            layer.gpu_buffer = nullptr;
+        }
+
+        auto ctx_it = layer_contexts_.find(layer_name);
+        if (ctx_it != layer_contexts_.end()) {
+            ggml_free(ctx_it->second);
+            layer_contexts_.erase(ctx_it);
+        }
+
+        current_gpu_usage_ -= layer.total_size_bytes;
+        layer.on_gpu = false;
+    }
+
+    bool is_layer_on_gpu(const std::string& layer_name) const {
+        auto it = layers_.find(layer_name);
+        if (it == layers_.end()) {
+            return false;
+        }
+        return it->second.on_gpu;
+    }
+
+    size_t get_layer_size(const std::string& layer_name) const {
+        auto it = layers_.find(layer_name);
+        if (it == layers_.end()) {
+            return 0;
+        }
+        return it->second.total_size_bytes;
+    }
+
+    size_t get_gpu_usage() const {
+        return current_gpu_usage_;
+    }
+
+    std::vector<std::string> get_layer_names_sorted() const {
+        std::vector<std::pair<int, std::string>> indexed_layers;
+        for (const auto& [name, info] : layers_) {
+            indexed_layers.push_back({info.index, name});
+        }
+        std::sort(indexed_layers.begin(), indexed_layers.end());
+
+        std::vector<std::string> result;
+        for (const auto& [idx, name] : indexed_layers) {
+            result.push_back(name);
+        }
+        return result;
+    }
+
+    std::vector<std::string> get_layers_on_gpu() const {
+        std::vector<std::string> result;
+        for (const auto& [name, info] : layers_) {
+            if (info.on_gpu) {
+                result.push_back(name);
+            }
+        }
+        return result;
+    }
+
+    size_t get_layer_count() const {
+        return layers_.size();
+    }
+
+    // Initiates transfer without waiting; call complete_async_layer_load() to finalize
+    bool start_async_layer_load(const std::string& layer_name,
+                                ggml_backend_t gpu_backend,
+                                ggml_backend_t cpu_backend) {
+        auto it = layers_.find(layer_name);
+        if (it == layers_.end()) {
+            LOG_ERROR("layer '%s' not found for async load", layer_name.c_str());
+            return false;
+        }
+
+        LayerInfo& layer = it->second;
+        if (layer.on_gpu) {
+            return true;
+        }
+
+        if (async_loading_layers_.find(layer_name) != async_loading_layers_.end()) {
+            return true;
+        }
+
+        int64_t t0 = ggml_time_ms();
+
+        size_t ctx_size = layer.tensor_names.size() * ggml_tensor_overhead() + 1024;
+        struct ggml_init_params ctx_params = {
+            ctx_size,
+            nullptr,
+            true,
+        };
+        ggml_context* temp_ctx = ggml_init(ctx_params);
+        if (temp_ctx == nullptr) {
+            LOG_ERROR("failed to create temp context for async load of layer '%s'", layer_name.c_str());
+            return false;
+        }
+
+        std::vector<AsyncLoadState::CopyInfo> copy_list;
+
+        for (const auto& tensor_name : layer.tensor_names) {
+            TensorInfo& info = tensors_[tensor_name];
+            if (info.on_gpu) {
+                continue;
+            }
+
+            ggml_tensor* gpu_tensor = ggml_dup_tensor(temp_ctx, info.cpu_tensor);
+            ggml_set_name(gpu_tensor, tensor_name.c_str());
+            copy_list.push_back({tensor_name, info.cpu_tensor, gpu_tensor});
+        }
+
+        if (copy_list.empty()) {
+            ggml_free(temp_ctx);
+            layer.on_gpu = true;
+            return true;
+        }
+
+        ggml_backend_buffer_t buffer = ggml_backend_alloc_ctx_tensors(temp_ctx, gpu_backend);
+        if (buffer == nullptr) {
+            LOG_ERROR("failed to allocate GPU buffer for async load of layer '%s'", layer_name.c_str());
+            ggml_free(temp_ctx);
+            return false;
+        }
+
+        // May fall back to sync for CPU->CUDA
+        for (auto& item : copy_list) {
+            ggml_backend_tensor_copy_async(cpu_backend, gpu_backend, item.cpu_tensor, item.gpu_tensor);
+        }
+
+        AsyncLoadState state;
+        state.temp_ctx = temp_ctx;
+        state.gpu_buffer = buffer;
+        state.copy_list = std::move(copy_list);
+        state.start_time = t0;
+
+        async_loading_layers_[layer_name] = std::move(state);
+
+        return true;
+    }
+
+    // Waits for pending async transfers and finalizes the layer state
+    bool complete_async_layer_load(const std::string& layer_name,
+                                   ggml_backend_t gpu_backend) {
+        auto async_it = async_loading_layers_.find(layer_name);
+        if (async_it == async_loading_layers_.end()) {
+            // Not in async loading - check if already on GPU
+            auto layer_it = layers_.find(layer_name);
+            if (layer_it != layers_.end() && layer_it->second.on_gpu) {
+                return true;
+            }
+            return false;
+        }
+
+        AsyncLoadState& state = async_it->second;
+        auto layer_it = layers_.find(layer_name);
+        if (layer_it == layers_.end()) {
+            ggml_backend_buffer_free(state.gpu_buffer);
+            ggml_free(state.temp_ctx);
+            async_loading_layers_.erase(async_it);
+            return false;
+        }
+
+        LayerInfo& layer = layer_it->second;
+
+        ggml_backend_synchronize(gpu_backend);
+
+        for (auto& item : state.copy_list) {
+            TensorInfo& info = tensors_[item.name];
+            info.gpu_tensor = item.gpu_tensor;
+            info.on_gpu = true;
+            info.last_access = access_counter_++;
+
+            std::swap(item.cpu_tensor->buffer, item.gpu_tensor->buffer);
+            std::swap(item.cpu_tensor->data, item.gpu_tensor->data);
+            std::swap(item.cpu_tensor->extra, item.gpu_tensor->extra);
+        }
+
+        layer.on_gpu = true;
+        layer.gpu_buffer = state.gpu_buffer;
+        current_gpu_usage_ += layer.total_size_bytes;
+        layer_contexts_[layer_name] = state.temp_ctx;
+
+        async_loading_layers_.erase(async_it);
+        return true;
+    }
+
+    bool is_layer_async_loading(const std::string& layer_name) const {
+        return async_loading_layers_.find(layer_name) != async_loading_layers_.end();
+    }
+
+    void clear() {
+        for (auto& [name, state] : async_loading_layers_) {
+            if (state.gpu_buffer) {
+                ggml_backend_buffer_free(state.gpu_buffer);
+            }
+            if (state.temp_ctx) {
+                ggml_free(state.temp_ctx);
+            }
+        }
+        async_loading_layers_.clear();
+
+        for (auto& [name, layer] : layers_) {
+            if (layer.on_gpu) {
+                move_layer_to_cpu(name);
+            }
+        }
+
+        for (auto& [name, ctx] : layer_contexts_) {
+            ggml_free(ctx);
+        }
+
+        tensors_.clear();
+        layers_.clear();
+        layer_contexts_.clear();
+        current_gpu_usage_ = 0;
+    }
+
+private:
+    ggml_backend_t gpu_backend_;
+    ggml_backend_t cpu_backend_;
+
+    std::unordered_map<std::string, TensorInfo> tensors_;
+    std::unordered_map<std::string, LayerInfo> layers_;
+    std::unordered_map<std::string, ggml_context*> layer_contexts_;
+    std::unordered_map<std::string, AsyncLoadState> async_loading_layers_;
+
+    size_t current_gpu_usage_ = 0;
+    uint64_t access_counter_  = 0;
+};
+
+// Extract Flux layer info: double_blocks.N, single_blocks.N, or _global
+inline std::pair<std::string, int> flux_layer_pattern(const std::string& tensor_name) {
+    size_t db_pos = tensor_name.find("double_blocks.");
+    if (db_pos != std::string::npos) {
+        size_t num_start = db_pos + 14;  // Length of "double_blocks."
+        size_t num_end = tensor_name.find('.', num_start);
+        if (num_end == std::string::npos) {
+            num_end = tensor_name.length();
+        }
+        std::string num_str = tensor_name.substr(num_start, num_end - num_start);
+        int block_idx = std::stoi(num_str);
+        return {"double_blocks." + num_str, block_idx};
+    }
+
+    size_t sb_pos = tensor_name.find("single_blocks.");
+    if (sb_pos != std::string::npos) {
+        size_t num_start = sb_pos + 14;  // Length of "single_blocks."
+        size_t num_end = tensor_name.find('.', num_start);
+        if (num_end == std::string::npos) {
+            num_end = tensor_name.length();
+        }
+        std::string num_str = tensor_name.substr(num_start, num_end - num_start);
+        int block_idx = std::stoi(num_str);
+        // Offset past 19 double_blocks
+        return {"single_blocks." + num_str, 19 + block_idx};
+    }
+
+    return {"_global", -1};
+}
+
+// Extract UNet layer info: input_blocks.N, middle_block, output_blocks.N, or _global
+inline std::pair<std::string, int> unet_layer_pattern(const std::string& tensor_name) {
+    size_t ib_pos = tensor_name.find("input_blocks.");
+    if (ib_pos != std::string::npos) {
+        size_t num_start = ib_pos + 13;  // Length of "input_blocks."
+        size_t num_end = tensor_name.find('.', num_start);
+        if (num_end == std::string::npos) {
+            num_end = tensor_name.length();
+        }
+        std::string num_str = tensor_name.substr(num_start, num_end - num_start);
+        int block_idx = std::stoi(num_str);
+        return {"input_blocks." + num_str, block_idx};
+    }
+
+    if (tensor_name.find("middle_block") != std::string::npos) {
+        return {"middle_block", 100};
+    }
+
+    size_t ob_pos = tensor_name.find("output_blocks.");
+    if (ob_pos != std::string::npos) {
+        size_t num_start = ob_pos + 14;  // Length of "output_blocks."
+        size_t num_end = tensor_name.find('.', num_start);
+        if (num_end == std::string::npos) {
+            num_end = tensor_name.length();
+        }
+        std::string num_str = tensor_name.substr(num_start, num_end - num_start);
+        int block_idx = std::stoi(num_str);
+        return {"output_blocks." + num_str, 200 + block_idx};
+    }
+
+    return {"_global", -1};
+}
+
+// Extract MMDiT layer info: joint_blocks.N, or _global
+inline std::pair<std::string, int> mmdit_layer_pattern(const std::string& tensor_name) {
+    size_t jb_pos = tensor_name.find("joint_blocks.");
+    if (jb_pos != std::string::npos) {
+        size_t num_start = jb_pos + 13;  // Length of "joint_blocks."
+        size_t num_end = tensor_name.find('.', num_start);
+        if (num_end == std::string::npos) {
+            num_end = tensor_name.length();
+        }
+        std::string num_str = tensor_name.substr(num_start, num_end - num_start);
+        int block_idx = std::stoi(num_str);
+        return {"joint_blocks." + num_str, block_idx};
+    }
+
+    return {"_global", -1};
+}
+
+// Extract WAN layer info: blocks.N, vace_blocks.N, or _global
+inline std::pair<std::string, int> wan_layer_pattern(const std::string& tensor_name) {
+    size_t b_pos = tensor_name.find("blocks.");
+    // Exclude "vace_blocks" matches
+    if (b_pos != std::string::npos && (b_pos == 0 || tensor_name[b_pos - 1] != '_')) {
+        size_t num_start = b_pos + 7;  // Length of "blocks."
+        size_t num_end = tensor_name.find('.', num_start);
+        if (num_end == std::string::npos) {
+            num_end = tensor_name.length();
+        }
+        std::string num_str = tensor_name.substr(num_start, num_end - num_start);
+        int block_idx = std::stoi(num_str);
+        return {"blocks." + num_str, block_idx};
+    }
+
+    size_t vb_pos = tensor_name.find("vace_blocks.");
+    if (vb_pos != std::string::npos) {
+        size_t num_start = vb_pos + 12;  // Length of "vace_blocks."
+        size_t num_end = tensor_name.find('.', num_start);
+        if (num_end == std::string::npos) {
+            num_end = tensor_name.length();
+        }
+        std::string num_str = tensor_name.substr(num_start, num_end - num_start);
+        int block_idx = std::stoi(num_str);
+        return {"vace_blocks." + num_str, 100 + block_idx};
+    }
+
+    return {"_global", -1};
+}
+
+// Extract QwenImage layer info: transformer_blocks.N, or _global
+inline std::pair<std::string, int> qwen_image_layer_pattern(const std::string& tensor_name) {
+    size_t tb_pos = tensor_name.find("transformer_blocks.");
+    if (tb_pos != std::string::npos) {
+        size_t num_start = tb_pos + 19;  // Length of "transformer_blocks."
+        size_t num_end = tensor_name.find('.', num_start);
+        if (num_end == std::string::npos) {
+            num_end = tensor_name.length();
+        }
+        std::string num_str = tensor_name.substr(num_start, num_end - num_start);
+        int block_idx = std::stoi(num_str);
+        return {"transformer_blocks." + num_str, block_idx};
+    }
+
+    return {"_global", -1};
+}
+
+// Extract ZImage layer info: context_refiner.N, noise_refiner.N, layers.N, or _global
+inline std::pair<std::string, int> zimage_layer_pattern(const std::string& tensor_name) {
+    size_t cr_pos = tensor_name.find("context_refiner.");
+    if (cr_pos != std::string::npos) {
+        size_t num_start = cr_pos + 16;  // Length of "context_refiner."
+        size_t num_end = tensor_name.find('.', num_start);
+        if (num_end == std::string::npos) {
+            num_end = tensor_name.length();
+        }
+        std::string num_str = tensor_name.substr(num_start, num_end - num_start);
+        int block_idx = std::stoi(num_str);
+        return {"context_refiner." + num_str, block_idx};
+    }
+
+    size_t nr_pos = tensor_name.find("noise_refiner.");
+    if (nr_pos != std::string::npos) {
+        size_t num_start = nr_pos + 14;  // Length of "noise_refiner."
+        size_t num_end = tensor_name.find('.', num_start);
+        if (num_end == std::string::npos) {
+            num_end = tensor_name.length();
+        }
+        std::string num_str = tensor_name.substr(num_start, num_end - num_start);
+        int block_idx = std::stoi(num_str);
+        return {"noise_refiner." + num_str, 10 + block_idx};
+    }
+
+    size_t l_pos = tensor_name.find("layers.");
+    if (l_pos != std::string::npos) {
+        size_t num_start = l_pos + 7;  // Length of "layers."
+        size_t num_end = tensor_name.find('.', num_start);
+        if (num_end == std::string::npos) {
+            num_end = tensor_name.length();
+        }
+        std::string num_str = tensor_name.substr(num_start, num_end - num_start);
+        int block_idx = std::stoi(num_str);
+        return {"layers." + num_str, 100 + block_idx};
+    }
+
+    return {"_global", -1};
+}
+
+// Extract Anima layer info: blocks.N (from net.blocks.N), or _global
+inline std::pair<std::string, int> anima_layer_pattern(const std::string& tensor_name) {
+    size_t nb_pos = tensor_name.find("net.blocks.");
+    if (nb_pos != std::string::npos) {
+        size_t num_start = nb_pos + 11;  // Length of "net.blocks."
+        size_t num_end = tensor_name.find('.', num_start);
+        if (num_end == std::string::npos) {
+            num_end = tensor_name.length();
+        }
+        std::string num_str = tensor_name.substr(num_start, num_end - num_start);
+        int block_idx = std::stoi(num_str);
+        return {"blocks." + num_str, block_idx};
+    }
+
+    return {"_global", -1};
+}
+
+}  // namespace LayerStreaming
+
+#endif  // __TENSOR_REGISTRY_HPP__
diff --git a/src/unet.hpp b/src/unet.hpp
index d7ea8c3fa..008d2f2b2 100644
--- a/src/unet.hpp
+++ b/src/unet.hpp
@@ -2,6 +2,7 @@
 #define __UNET_HPP__
 
 #include "common_block.hpp"
+#include "layer_streaming.hpp"
 #include "model.h"
 
 /*==================================================== UnetModel =====================================================*/
@@ -597,6 +598,160 @@ class UnetModelBlock : public GGMLBlock {
         ggml_set_name(h, "bench-end");
         return h;  // [N, out_channels, h, w]
     }
+
+    ggml_tensor* forward_embedding_stage(GGMLRunnerContext* ctx,
+                                          struct ggml_tensor* timesteps,
+                                          struct ggml_tensor* label) {
+        auto time_embed_0 = std::dynamic_pointer_cast<Linear>(blocks["time_embed.0"]);
+        auto time_embed_2 = std::dynamic_pointer_cast<Linear>(blocks["time_embed.2"]);
+
+        auto emb = ggml_ext_timestep_embedding(ctx->ggml_ctx, timesteps, model_channels);
+        emb      = time_embed_0->forward(ctx, emb);
+        emb      = ggml_silu_inplace(ctx->ggml_ctx, emb);
+        emb      = time_embed_2->forward(ctx, emb);
+
+        if (label != nullptr && adm_in_channels != -1) {
+            auto label_embed_0 = std::dynamic_pointer_cast<Linear>(blocks["label_emb.0.0"]);
+            auto label_embed_2 = std::dynamic_pointer_cast<Linear>(blocks["label_emb.0.2"]);
+
+            auto label_emb = label_embed_0->forward(ctx, label);
+            label_emb      = ggml_silu_inplace(ctx->ggml_ctx, label_emb);
+            label_emb      = label_embed_2->forward(ctx, label_emb);
+
+            emb = ggml_add(ctx->ggml_ctx, emb, label_emb);
+        }
+
+        return emb;
+    }
+
+    ggml_tensor* forward_initial_conv(GGMLRunnerContext* ctx, struct ggml_tensor* x) {
+        auto input_blocks_0_0 = std::dynamic_pointer_cast<Conv2d>(blocks["input_blocks.0.0"]);
+        return input_blocks_0_0->forward(ctx, x);
+    }
+
+    ggml_tensor* forward_input_block(GGMLRunnerContext* ctx,
+                                      int block_idx,
+                                      struct ggml_tensor* h,
+                                      struct ggml_tensor* emb,
+                                      struct ggml_tensor* context,
+                                      int num_video_frames) {
+        // input_blocks.X.0 is either a ResBlock or a DownSampleBlock —
+        // SDXL/SD1.x put the per-stage downsample at indices 3 and 6. The
+        // non-streaming forward() differentiates these inline; the streaming
+        // path does the same here.
+        std::string slot0_name = "input_blocks." + std::to_string(block_idx) + ".0";
+        auto slot0_it = blocks.find(slot0_name);
+        if (slot0_it != blocks.end()) {
+            if (auto downsample = std::dynamic_pointer_cast<DownSampleBlock>(slot0_it->second)) {
+                h = downsample->forward(ctx, h);
+            } else {
+                h = resblock_forward(slot0_name, ctx, h, emb, num_video_frames);
+            }
+        }
+
+        // input_blocks.X.1 is a SpatialTransformer when attention applies at this resolution.
+        std::string attn_name = "input_blocks." + std::to_string(block_idx) + ".1";
+        auto attn_block = blocks.find(attn_name);
+        if (attn_block != blocks.end()) {
+            h = attention_layer_forward(attn_name, ctx, h, context, num_video_frames);
+        }
+
+        return h;
+    }
+
+    ggml_tensor* forward_middle_block(GGMLRunnerContext* ctx,
+                                       struct ggml_tensor* h,
+                                       struct ggml_tensor* emb,
+                                       struct ggml_tensor* context,
+                                       int num_video_frames) {
+        h = resblock_forward("middle_block.0", ctx, h, emb, num_video_frames);
+        if (version == VERSION_SD1 || version == VERSION_SD2 || version == VERSION_SVD) {
+            h = attention_layer_forward("middle_block.1", ctx, h, context, num_video_frames);
+            h = resblock_forward("middle_block.2", ctx, h, emb, num_video_frames);
+        }
+        return h;
+    }
+
+    ggml_tensor* forward_output_block(GGMLRunnerContext* ctx,
+                                       int block_idx,
+                                       struct ggml_tensor* h,
+                                       struct ggml_tensor* skip,
+                                       struct ggml_tensor* emb,
+                                       struct ggml_tensor* context,
+                                       int num_video_frames) {
+        h = ggml_concat(ctx->ggml_ctx, h, skip, 2);
+
+        std::string res_name = "output_blocks." + std::to_string(block_idx) + ".0";
+        h = resblock_forward(res_name, ctx, h, emb, num_video_frames);
+
+        // output_blocks.X.1/.2 may be SpatialTransformer (attention), UpSampleBlock,
+        // or both: when the resolution has attention, slot .1 = transformer and
+        // slot .2 = upsample; without attention, slot .1 = upsample. Dispatch
+        // by actual block type so SD1.x's deepest output block (no attention)
+        // doesn't end up casting an UpSampleBlock to a SpatialTransformer.
+        for (int i = 1; i <= 2; i++) {
+            std::string slot_name = "output_blocks." + std::to_string(block_idx) + "." + std::to_string(i);
+            auto slot_it = blocks.find(slot_name);
+            if (slot_it == blocks.end()) {
+                continue;
+            }
+            if (auto upsample = std::dynamic_pointer_cast<UpSampleBlock>(slot_it->second)) {
+                h = upsample->forward(ctx, h);
+            } else {
+                h = attention_layer_forward(slot_name, ctx, h, context, num_video_frames);
+            }
+        }
+
+        return h;
+    }
+
+    ggml_tensor* forward_output_stage(GGMLRunnerContext* ctx, struct ggml_tensor* h) {
+        auto out_0 = std::dynamic_pointer_cast<GroupNorm32>(blocks["out.0"]);
+        auto out_2 = std::dynamic_pointer_cast<Conv2d>(blocks["out.2"]);
+
+        h = out_0->forward(ctx, h);
+        h = ggml_silu_inplace(ctx->ggml_ctx, h);
+        h = out_2->forward(ctx, h);
+
+        return h;
+    }
+
+    // Walk the blocks map to find the largest "input_blocks.N.0" index that
+    // actually exists, then return N+1 so callers can iterate [0, count).
+    // SDXL ends at 8 (9 total), SD1/SD2 at 11 (12 total), tiny_unet has gaps
+    // — the streaming loop treats missing indices as "skip" via blocks.find().
+    int get_num_input_blocks() const {
+        return count_blocks_with_prefix("input_blocks.");
+    }
+    int get_num_output_blocks() const {
+        return count_blocks_with_prefix("output_blocks.");
+    }
+
+private:
+    int count_blocks_with_prefix(const std::string& prefix) const {
+        int max_idx = -1;
+        for (const auto& kv : blocks) {
+            const std::string& name = kv.first;
+            if (name.compare(0, prefix.size(), prefix) != 0) {
+                continue;
+            }
+            // name looks like "input_blocks.N.M"; extract N
+            size_t i_start = prefix.size();
+            size_t i_end = name.find('.', i_start);
+            if (i_end == std::string::npos) {
+                continue;
+            }
+            try {
+                int idx = std::stoi(name.substr(i_start, i_end - i_start));
+                if (idx > max_idx) max_idx = idx;
+            } catch (...) {
+                continue;
+            }
+        }
+        return max_idx + 1;
+    }
+
+public:
 };
 
 struct UNetModelRunner : public GGMLRunner {
@@ -615,6 +770,399 @@ struct UNetModelRunner : public GGMLRunner {
         return "unet";
     }
 
+    // UNet needs keep_layers_behind=12 for skip connections
+    void enable_layer_streaming(const LayerStreaming::StreamingConfig& config = {}) {
+        LayerStreaming::StreamingConfig cfg = config;
+        cfg.keep_layers_behind = 12;
+        std::map<std::string, ggml_tensor*> tensor_map;
+        unet.get_param_tensors(tensor_map, "model.diffusion_model");
+        init_streaming(cfg, tensor_map, LayerStreaming::unet_layer_pattern);
+        LOG_INFO("%s layer streaming enabled (coarse-stage mode)", get_desc().c_str());
+    }
+
+    bool compute_streaming(int n_threads,
+                           ggml_tensor* x,
+                           ggml_tensor* timesteps,
+                           ggml_tensor* context,
+                           ggml_tensor* c_concat,
+                           ggml_tensor* y,
+                           int num_video_frames                      = -1,
+                           std::vector<ggml_tensor*> controls = {},
+                           float control_strength                    = 0.f,
+                           ggml_tensor** output               = nullptr,
+                           ggml_context* output_ctx           = nullptr) {
+        if (!is_streaming_enabled()) {
+            LOG_WARN("%s streaming not enabled, falling back to regular compute", get_desc().c_str());
+            return compute(n_threads, x, timesteps, context, c_concat, y,
+                           num_video_frames, controls, control_strength, output, output_ctx);
+        }
+
+        int64_t t0 = ggml_time_ms();
+        auto analysis = analyze_vram_budget();
+
+        if (analysis.fits_in_vram) {
+            LOG_INFO("%s model fits in VRAM, using coarse-stage streaming", get_desc().c_str());
+            load_all_layers_coarse();
+            bool result = compute(n_threads, x, timesteps, context, c_concat, y,
+                                  num_video_frames, controls, control_strength, output, output_ctx,
+                                  /*skip_param_offload=*/true);
+            int64_t t1 = ggml_time_ms();
+            LOG_INFO("%s coarse-stage streaming completed in %.2fs", get_desc().c_str(), (t1 - t0) / 1000.0);
+            free_compute_buffer();
+            return result;
+        }
+
+        LOG_INFO("%s remaining %.2f GB exceeds available %.2f GB, using per-layer streaming",
+                 get_desc().c_str(),
+                 analysis.remaining_to_load / (1024.0 * 1024.0 * 1024.0),
+                 analysis.available_vram / (1024.0 * 1024.0 * 1024.0));
+
+        return compute_streaming_true(n_threads, x, timesteps, context, c_concat, y,
+                                      num_video_frames, controls, control_strength, output, output_ctx);
+    }
+
+    bool compute_streaming_true(int n_threads,
+                                ggml_tensor* x,
+                                ggml_tensor* timesteps,
+                                ggml_tensor* context,
+                                ggml_tensor* c_concat             = nullptr,
+                                ggml_tensor* y                    = nullptr,
+                                int num_video_frames              = -1,
+                                std::vector<ggml_tensor*> controls = {},
+                                float control_strength            = 0.f,
+                                ggml_tensor** output              = nullptr,
+                                ggml_context* output_ctx          = nullptr) {
+        auto& registry = streaming_engine_->get_registry();
+        int64_t t_start = ggml_time_ms();
+
+        const int num_input_blocks = unet.get_num_input_blocks();
+        const int num_output_blocks = unet.get_num_output_blocks();
+
+        LOG_INFO("TRUE per-layer streaming - %d input, 1 middle, %d output blocks",
+                 num_input_blocks, num_output_blocks);
+
+        // Load global layers
+        if (!registry.move_layer_to_gpu("_global")) {
+            LOG_ERROR("Failed to load _global to GPU");
+            return false;
+        }
+
+        // Skip connections storage - stores each input block's output
+        std::vector<std::vector<float>> skip_connections(num_input_blocks);
+        std::vector<std::array<int64_t, 4>> skip_ne(num_input_blocks);
+
+        // Persistent storage for current h and emb
+        std::vector<float> persistent_h;
+        std::vector<float> persistent_emb;
+        int64_t h_ne[4], emb_ne[4];
+
+        // Handle c_concat
+        ggml_tensor* actual_x = x;
+        if (c_concat != nullptr) {
+            // For now, handle c_concat in input stage
+        }
+
+        LOG_DEBUG("Computing embeddings");
+        {
+            ggml_tensor* emb_output = nullptr;
+
+            auto get_emb_graph = [&]() -> ggml_cgraph* {
+                ggml_cgraph* gf = new_graph_custom(UNET_GRAPH_SIZE / 8);
+                auto runner_ctx = get_context();
+
+                ggml_tensor* timesteps_b = to_backend(timesteps);
+                ggml_tensor* y_b = y ? to_backend(y) : nullptr;
+
+                emb_output = unet.forward_embedding_stage(&runner_ctx, timesteps_b, y_b);
+                ggml_build_forward_expand(gf, emb_output);
+
+                return gf;
+            };
+
+            if (!GGMLRunner::compute(get_emb_graph, n_threads, false, nullptr, nullptr, true)) {
+                LOG_ERROR("Embedding stage failed");
+                return false;
+            }
+
+            // Extract emb
+            size_t emb_size = ggml_nelements(emb_output);
+            persistent_emb.resize(emb_size);
+            ggml_backend_tensor_get(emb_output, persistent_emb.data(), 0, emb_size * sizeof(float));
+            for (int i = 0; i < 4; i++) emb_ne[i] = emb_output->ne[i];
+
+            free_compute_buffer();
+        }
+
+        LOG_DEBUG("Processing input blocks");
+        {
+            ggml_tensor* h_output = nullptr;
+
+            // Initial conv
+            auto get_init_graph = [&]() -> ggml_cgraph* {
+                ggml_cgraph* gf = new_graph_custom(UNET_GRAPH_SIZE / 8);
+                auto runner_ctx = get_context();
+
+                ggml_tensor* x_b = to_backend(x);
+                if (c_concat != nullptr) {
+                    ggml_tensor* c_b = to_backend(c_concat);
+                    x_b = ggml_concat(compute_ctx, x_b, c_b, 2);
+                }
+
+                h_output = unet.forward_initial_conv(&runner_ctx, x_b);
+                ggml_build_forward_expand(gf, h_output);
+
+                return gf;
+            };
+
+            if (!GGMLRunner::compute(get_init_graph, n_threads, false, nullptr, nullptr, true)) {
+                LOG_ERROR("Initial conv failed");
+                return false;
+            }
+
+            // Save skip connection 0
+            size_t h_size = ggml_nelements(h_output);
+            skip_connections[0].resize(h_size);
+            ggml_backend_tensor_get(h_output, skip_connections[0].data(), 0, h_size * sizeof(float));
+            for (int i = 0; i < 4; i++) {
+                skip_ne[0][i] = h_output->ne[i];
+                h_ne[i] = h_output->ne[i];
+            }
+            persistent_h.resize(h_size);
+            ggml_backend_tensor_get(h_output, persistent_h.data(), 0, h_size * sizeof(float));
+
+            free_compute_buffer();
+        }
+
+        // Process input blocks 1-11
+        auto input_block_at = [](int i) { return "input_blocks." + std::to_string(i); };
+        if (streaming_engine_) {
+            streaming_engine_->prime_prefetch(input_block_at, 1, num_input_blocks);
+        }
+
+        for (int block_idx = 1; block_idx < num_input_blocks; block_idx++) {
+            std::string block_name = input_block_at(block_idx);
+            int64_t t_block = ggml_time_ms();
+
+            if (streaming_engine_) {
+                streaming_engine_->wait_for_prefetch(block_name);
+            }
+
+            if (!registry.move_layer_to_gpu(block_name)) {
+                LOG_ERROR("Failed to load %s", block_name.c_str());
+                return false;
+            }
+
+            // Keep the prefetch window full
+            if (streaming_engine_) {
+                streaming_engine_->advance_prefetch(input_block_at, block_idx, num_input_blocks);
+            }
+
+            ggml_tensor* h_output = nullptr;
+
+            auto get_input_graph = [&]() -> ggml_cgraph* {
+                ggml_cgraph* gf = new_graph_custom(UNET_GRAPH_SIZE / 8);
+
+                ggml_tensor* h_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, h_ne[0], h_ne[1], h_ne[2], h_ne[3]);
+                ggml_tensor* emb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, emb_ne[0], emb_ne[1], emb_ne[2], emb_ne[3]);
+                ggml_tensor* context_b = context ? to_backend(context) : nullptr;
+
+                h_in = to_backend(h_in);
+                emb_in = to_backend(emb_in);
+
+                set_backend_tensor_data(h_in, persistent_h.data());
+                set_backend_tensor_data(emb_in, persistent_emb.data());
+
+                auto runner_ctx = get_context();
+                h_output = unet.forward_input_block(&runner_ctx, block_idx, h_in, emb_in, context_b, num_video_frames);
+
+                ggml_build_forward_expand(gf, h_output);
+
+                return gf;
+            };
+
+            if (!GGMLRunner::compute(get_input_graph, n_threads, false, nullptr, nullptr, true)) {
+                LOG_ERROR("Input block %d failed", block_idx);
+                return false;
+            }
+
+            // Save skip connection
+            size_t h_size = ggml_nelements(h_output);
+            skip_connections[block_idx].resize(h_size);
+            ggml_backend_tensor_get(h_output, skip_connections[block_idx].data(), 0, h_size * sizeof(float));
+            for (int i = 0; i < 4; i++) {
+                skip_ne[block_idx][i] = h_output->ne[i];
+                h_ne[i] = h_output->ne[i];
+            }
+
+            // Update persistent h
+            persistent_h.resize(h_size);
+            ggml_backend_tensor_get(h_output, persistent_h.data(), 0, h_size * sizeof(float));
+
+            free_compute_buffer();
+
+            registry.move_layer_to_cpu(block_name);
+            LOG_DEBUG("Input block %d/%d done (%.2fms)",
+                      block_idx + 1, num_input_blocks, (ggml_time_ms() - t_block) / 1.0);
+        }
+
+        LOG_DEBUG("Processing middle block");
+        {
+            if (!registry.move_layer_to_gpu("middle_block")) {
+                LOG_ERROR("Failed to load middle_block");
+                return false;
+            }
+
+            ggml_tensor* h_output = nullptr;
+
+            auto get_middle_graph = [&]() -> ggml_cgraph* {
+                ggml_cgraph* gf = new_graph_custom(UNET_GRAPH_SIZE / 8);
+
+                ggml_tensor* h_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, h_ne[0], h_ne[1], h_ne[2], h_ne[3]);
+                ggml_tensor* emb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, emb_ne[0], emb_ne[1], emb_ne[2], emb_ne[3]);
+                ggml_tensor* context_b = context ? to_backend(context) : nullptr;
+
+                h_in = to_backend(h_in);
+                emb_in = to_backend(emb_in);
+
+                set_backend_tensor_data(h_in, persistent_h.data());
+                set_backend_tensor_data(emb_in, persistent_emb.data());
+
+                auto runner_ctx = get_context();
+                h_output = unet.forward_middle_block(&runner_ctx, h_in, emb_in, context_b, num_video_frames);
+
+                ggml_build_forward_expand(gf, h_output);
+
+                return gf;
+            };
+
+            if (!GGMLRunner::compute(get_middle_graph, n_threads, false, nullptr, nullptr, true)) {
+                LOG_ERROR("Middle block failed");
+                return false;
+            }
+
+            // Update persistent h
+            size_t h_size = ggml_nelements(h_output);
+            persistent_h.resize(h_size);
+            ggml_backend_tensor_get(h_output, persistent_h.data(), 0, h_size * sizeof(float));
+            for (int i = 0; i < 4; i++) h_ne[i] = h_output->ne[i];
+
+            free_compute_buffer();
+
+            registry.move_layer_to_cpu("middle_block");
+        }
+
+        LOG_DEBUG("Processing output blocks");
+
+        auto output_block_at = [](int i) { return "output_blocks." + std::to_string(i); };
+        if (streaming_engine_) {
+            streaming_engine_->prime_prefetch(output_block_at, 0, num_output_blocks);
+        }
+
+        for (int block_idx = 0; block_idx < num_output_blocks; block_idx++) {
+            std::string block_name = output_block_at(block_idx);
+            int64_t t_block = ggml_time_ms();
+
+            // Skip connection index (reverse order)
+            int skip_idx = num_input_blocks - 1 - block_idx;
+
+            if (streaming_engine_) {
+                streaming_engine_->wait_for_prefetch(block_name);
+            }
+
+            if (!registry.move_layer_to_gpu(block_name)) {
+                LOG_ERROR("Failed to load %s", block_name.c_str());
+                return false;
+            }
+
+            // Keep the prefetch window full
+            if (streaming_engine_) {
+                streaming_engine_->advance_prefetch(output_block_at, block_idx, num_output_blocks);
+            }
+
+            ggml_tensor* h_output = nullptr;
+
+            auto get_output_graph = [&]() -> ggml_cgraph* {
+                ggml_cgraph* gf = new_graph_custom(UNET_GRAPH_SIZE / 8);
+
+                ggml_tensor* h_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, h_ne[0], h_ne[1], h_ne[2], h_ne[3]);
+                ggml_tensor* emb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, emb_ne[0], emb_ne[1], emb_ne[2], emb_ne[3]);
+
+                // Create skip connection tensor
+                ggml_tensor* skip_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32,
+                                                          skip_ne[skip_idx][0], skip_ne[skip_idx][1],
+                                                          skip_ne[skip_idx][2], skip_ne[skip_idx][3]);
+
+                ggml_tensor* context_b = context ? to_backend(context) : nullptr;
+
+                h_in = to_backend(h_in);
+                emb_in = to_backend(emb_in);
+                skip_in = to_backend(skip_in);
+
+                set_backend_tensor_data(h_in, persistent_h.data());
+                set_backend_tensor_data(emb_in, persistent_emb.data());
+                set_backend_tensor_data(skip_in, skip_connections[skip_idx].data());
+
+                auto runner_ctx = get_context();
+                h_output = unet.forward_output_block(&runner_ctx, block_idx, h_in, skip_in, emb_in,
+                                                      context_b, num_video_frames);
+
+                ggml_build_forward_expand(gf, h_output);
+
+                return gf;
+            };
+
+            if (!GGMLRunner::compute(get_output_graph, n_threads, false, nullptr, nullptr, true)) {
+                LOG_ERROR("Output block %d failed", block_idx);
+                return false;
+            }
+
+            // Update persistent h
+            size_t h_size = ggml_nelements(h_output);
+            persistent_h.resize(h_size);
+            ggml_backend_tensor_get(h_output, persistent_h.data(), 0, h_size * sizeof(float));
+            for (int i = 0; i < 4; i++) h_ne[i] = h_output->ne[i];
+
+            free_compute_buffer();
+
+            // Free skip connection memory
+            skip_connections[skip_idx].clear();
+            skip_connections[skip_idx].shrink_to_fit();
+
+            registry.move_layer_to_cpu(block_name);
+            LOG_DEBUG("Output block %d/%d done (%.2fms)",
+                      block_idx + 1, num_output_blocks, (ggml_time_ms() - t_block) / 1.0);
+        }
+
+        LOG_DEBUG("Applying final output layers");
+        {
+            auto get_final_graph = [&]() -> ggml_cgraph* {
+                ggml_cgraph* gf = new_graph_custom(UNET_GRAPH_SIZE / 8);
+
+                ggml_tensor* h_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, h_ne[0], h_ne[1], h_ne[2], h_ne[3]);
+                h_in = to_backend(h_in);
+                set_backend_tensor_data(h_in, persistent_h.data());
+
+                auto runner_ctx = get_context();
+                auto final_out = unet.forward_output_stage(&runner_ctx, h_in);
+
+                ggml_build_forward_expand(gf, final_out);
+
+                return gf;
+            };
+
+            if (!GGMLRunner::compute(get_final_graph, n_threads, true, output, output_ctx, true)) {
+                LOG_ERROR("Final output stage failed");
+                return false;
+            }
+        }
+
+        int64_t t_end = ggml_time_ms();
+        LOG_INFO("TRUE per-layer streaming completed in %.2fs (%d input + 1 middle + %d output blocks)",
+                 (t_end - t_start) / 1000.0, num_input_blocks, num_output_blocks);
+
+        return true;
+    }
+
     void get_param_tensors(std::map<std::string, ggml_tensor*>& tensors, const std::string prefix) {
         unet.get_param_tensors(tensors, prefix);
     }
@@ -661,6 +1209,69 @@ struct UNetModelRunner : public GGMLRunner {
         return gf;
     }
 
+    // Legacy overload used by streaming code paths (takes raw ggml_tensor pointers)
+    ggml_cgraph* build_graph(ggml_tensor* x,
+                             ggml_tensor* timesteps,
+                             ggml_tensor* context,
+                             ggml_tensor* c_concat              = nullptr,
+                             ggml_tensor* y                     = nullptr,
+                             int num_video_frames               = -1,
+                             std::vector<ggml_tensor*> controls = {},
+                             float control_strength             = 0.f) {
+        ggml_cgraph* gf = new_graph_custom(UNET_GRAPH_SIZE);
+
+        if (num_video_frames == -1) {
+            num_video_frames = static_cast<int>(x->ne[3]);
+        }
+
+        x         = to_backend(x);
+        context   = to_backend(context);
+        y         = to_backend(y);
+        timesteps = to_backend(timesteps);
+        c_concat  = to_backend(c_concat);
+
+        for (size_t i = 0; i < controls.size(); i++) {
+            controls[i] = to_backend(controls[i]);
+        }
+
+        auto runner_ctx = get_context();
+
+        ggml_tensor* out = unet.forward(&runner_ctx,
+                                        x,
+                                        timesteps,
+                                        context,
+                                        c_concat,
+                                        y,
+                                        num_video_frames,
+                                        controls,
+                                        control_strength);
+
+        ggml_build_forward_expand(gf, out);
+
+        return gf;
+    }
+
+    // Legacy overload used by streaming code paths (takes raw ggml_tensor pointers)
+    bool compute(int n_threads,
+                 ggml_tensor* x,
+                 ggml_tensor* timesteps,
+                 ggml_tensor* context,
+                 ggml_tensor* c_concat,
+                 ggml_tensor* y,
+                 int num_video_frames               = -1,
+                 std::vector<ggml_tensor*> controls = {},
+                 float control_strength             = 0.f,
+                 ggml_tensor** output               = nullptr,
+                 ggml_context* output_ctx           = nullptr,
+                 bool skip_param_offload            = false) {
+        auto get_graph = [&]() -> ggml_cgraph* {
+            return build_graph(x, timesteps, context, c_concat, y, num_video_frames, controls, control_strength);
+        };
+
+        return GGMLRunner::compute(get_graph, n_threads, false, output, output_ctx, skip_param_offload);
+    }
+
+    // Upstream public API (takes sd::Tensor)
     sd::Tensor<float> compute(int n_threads,
                               const sd::Tensor<float>& x,
                               const sd::Tensor<float>& timesteps,
diff --git a/src/wan.hpp b/src/wan.hpp
index 261453301..7c3776ccb 100644
--- a/src/wan.hpp
+++ b/src/wan.hpp
@@ -7,6 +7,7 @@
 
 #include "common_block.hpp"
 #include "flux.hpp"
+#include "layer_streaming.hpp"
 #include "rope.hpp"
 #include "vae.hpp"
 
@@ -2083,6 +2084,55 @@ namespace WAN {
 
             return out;
         }
+
+        struct StreamingInputResult {
+            ggml_tensor* x;            // [N, t_len*h_len*w_len, dim]
+            ggml_tensor* x_orig;       // Original x for vace
+            ggml_tensor* c;            // vace context [N, t_len*h_len*w_len, dim] or nullptr
+            ggml_tensor* e0;           // timestep embedding
+            ggml_tensor* e;            // for head
+            ggml_tensor* pe;           // positional encoding
+            ggml_tensor* context;      // text context
+            int64_t context_img_len;
+        };
+
+        std::pair<ggml_tensor*, ggml_tensor*> forward_block(GGMLRunnerContext* ctx,
+                                                             int block_idx,
+                                                             struct ggml_tensor* x,
+                                                             struct ggml_tensor* x_orig,
+                                                             struct ggml_tensor* c,
+                                                             struct ggml_tensor* e0,
+                                                             struct ggml_tensor* pe,
+                                                             struct ggml_tensor* context,
+                                                             int64_t context_img_len,
+                                                             float vace_strength) {
+            auto block = std::dynamic_pointer_cast<WanAttentionBlock>(blocks["blocks." + std::to_string(block_idx)]);
+            x = block->forward(ctx, x, e0, pe, context, context_img_len);
+
+            // Check if this block has a paired vace_block
+            auto iter = params.vace_layers_mapping.find(block_idx);
+            if (iter != params.vace_layers_mapping.end() && c != nullptr) {
+                int n = iter->second;
+                auto vace_block = std::dynamic_pointer_cast<VaceWanAttentionBlock>(blocks["vace_blocks." + std::to_string(n)]);
+                auto result = vace_block->forward(ctx, c, x_orig, e0, pe, context, context_img_len);
+                auto c_skip = result.first;
+                c = result.second;
+                c_skip = ggml_ext_scale(ctx->ggml_ctx, c_skip, vace_strength);
+                x = ggml_add(ctx->ggml_ctx, x, c_skip);
+            }
+
+            return {x, c};
+        }
+
+        ggml_tensor* forward_output_stage(GGMLRunnerContext* ctx,
+                                           struct ggml_tensor* x,
+                                           struct ggml_tensor* e) {
+            auto head = std::dynamic_pointer_cast<Head>(blocks["head"]);
+            return head->forward(ctx, x, e);  // [N, t_len*h_len*w_len, pt*ph*pw*out_dim]
+        }
+
+        int get_num_layers() const { return params.num_layers; }
+        const std::tuple<int, int, int>& get_patch_size() const { return params.patch_size; }
     };
 
     struct WanRunner : public GGMLRunner {
@@ -2212,6 +2262,163 @@ namespace WAN {
             wan.get_param_tensors(tensors, prefix);
         }
 
+    public:
+        void enable_layer_streaming(const LayerStreaming::StreamingConfig& config = {}) {
+            std::map<std::string, ggml_tensor*> tensor_map;
+            wan.get_param_tensors(tensor_map, "model.diffusion_model");
+            init_streaming(config, tensor_map, LayerStreaming::wan_layer_pattern);
+            LOG_INFO("%s layer streaming enabled (%zu layers)",
+                     get_desc().c_str(), streaming_engine_->get_registry().get_layer_count());
+        }
+
+        bool compute_streaming(int n_threads,
+                               struct ggml_tensor* x,
+                               struct ggml_tensor* timesteps,
+                               struct ggml_tensor* context,
+                               struct ggml_tensor* clip_fea        = nullptr,
+                               struct ggml_tensor* c_concat        = nullptr,
+                               struct ggml_tensor* time_dim_concat = nullptr,
+                               struct ggml_tensor* vace_context    = nullptr,
+                               float vace_strength                 = 1.f,
+                               struct ggml_tensor** output         = nullptr,
+                               struct ggml_context* output_ctx     = nullptr) {
+            if (!is_streaming_enabled()) {
+                LOG_ERROR("%s streaming not enabled", get_desc().c_str());
+                return false;
+            }
+
+            int64_t t0 = ggml_time_ms();
+            auto analysis = analyze_vram_budget();
+
+            if (analysis.fits_in_vram) {
+                LOG_INFO("%s model fits in VRAM, using coarse-stage streaming", get_desc().c_str());
+                load_all_layers_coarse();
+                bool result = compute(n_threads, x, timesteps, context, clip_fea, c_concat,
+                                      time_dim_concat, vace_context, vace_strength, output, output_ctx, true);
+                int64_t t1 = ggml_time_ms();
+                LOG_INFO("%s coarse-stage streaming completed in %.2fs", get_desc().c_str(), (t1 - t0) / 1000.0);
+                free_compute_buffer();
+                return result;
+            }
+
+            LOG_INFO("%s remaining %.2f GB exceeds available %.2f GB, using per-layer streaming",
+                     get_desc().c_str(),
+                     analysis.remaining_to_load / (1024.0 * 1024.0 * 1024.0),
+                     analysis.available_vram / (1024.0 * 1024.0 * 1024.0));
+
+            return compute_streaming_true(n_threads, x, timesteps, context, clip_fea, c_concat,
+                                          time_dim_concat, vace_context, vace_strength, output, output_ctx);
+        }
+
+        bool compute_streaming_true(int n_threads,
+                                     struct ggml_tensor* x,
+                                     struct ggml_tensor* timesteps,
+                                     struct ggml_tensor* context,
+                                     struct ggml_tensor* clip_fea        = nullptr,
+                                     struct ggml_tensor* c_concat        = nullptr,
+                                     struct ggml_tensor* time_dim_concat = nullptr,
+                                     struct ggml_tensor* vace_context    = nullptr,
+                                     float vace_strength                 = 1.f,
+                                     struct ggml_tensor** output         = nullptr,
+                                     struct ggml_context* output_ctx     = nullptr) {
+            auto& registry = streaming_engine_->get_registry();
+            int64_t t_start = ggml_time_ms();
+
+            const int num_blocks = wan.get_num_layers();
+            const auto& patch_size = wan.get_patch_size();
+            const int64_t W = x->ne[0];
+            const int64_t H = x->ne[1];
+            const int64_t T = x->ne[2];
+
+            LOG_INFO("TRUE per-layer streaming - %d blocks", num_blocks);
+
+            // Load global layers (includes embedders)
+            if (!registry.move_layer_to_gpu("_global")) {
+                LOG_ERROR("Failed to load _global to GPU");
+                return false;
+            }
+
+            // Generate PE
+            pe_vec = Rope::gen_wan_pe(static_cast<int>(T),
+                                       static_cast<int>(H),
+                                       static_cast<int>(W),
+                                       std::get<0>(patch_size),
+                                       std::get<1>(patch_size),
+                                       std::get<2>(patch_size),
+                                       1,
+                                       wan_params.theta,
+                                       wan_params.axes_dim);
+
+            // Persistent storage
+            std::vector<float> persistent_x;
+            std::vector<float> persistent_x_orig;
+            std::vector<float> persistent_c;  // vace context
+            std::vector<float> persistent_e0;
+            std::vector<float> persistent_e;
+            int64_t x_ne[4], x_orig_ne[4], c_ne[4], e0_ne[4], e_ne[4];
+            bool has_vace = (vace_context != nullptr);
+            int64_t context_img_len = 0;
+            int64_t t_len = 0, h_len = 0, w_len = 0;
+
+            // Stage 1: Input stage - execute full input pipeline
+            LOG_DEBUG("Executing input stage");
+            {
+                ggml_tensor* x_output = nullptr;
+                ggml_tensor* x_orig_output = nullptr;
+                ggml_tensor* c_output = nullptr;
+                ggml_tensor* e0_output = nullptr;
+                ggml_tensor* e_output = nullptr;
+
+                auto get_input_graph = [&]() -> struct ggml_cgraph* {
+                    struct ggml_cgraph* gf = new_graph_custom(WAN_GRAPH_SIZE / 2);
+                    auto runner_ctx = get_context();
+
+                    ggml_tensor* x_b = to_backend(x);
+                    ggml_tensor* timesteps_b = to_backend(timesteps);
+                    ggml_tensor* context_b = to_backend(context);
+                    ggml_tensor* clip_fea_b = clip_fea ? to_backend(clip_fea) : nullptr;
+                    ggml_tensor* c_concat_b = c_concat ? to_backend(c_concat) : nullptr;
+                    ggml_tensor* time_dim_concat_b = time_dim_concat ? to_backend(time_dim_concat) : nullptr;
+                    ggml_tensor* vace_context_b = vace_context ? to_backend(vace_context) : nullptr;
+
+                    if (c_concat_b != nullptr) {
+                        x_b = ggml_concat(compute_ctx, x_b, c_concat_b, 3);
+                    }
+
+                    int pos_len = static_cast<int>(pe_vec.size() / wan_params.axes_dim_sum / 2);
+                    auto pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, wan_params.axes_dim_sum / 2, pos_len);
+                    set_backend_tensor_data(pe, pe_vec.data());
+
+                    struct ggml_tensor* out = wan.forward(&runner_ctx,
+                                                          x_b,
+                                                          timesteps_b,
+                                                          context_b,
+                                                          pe,
+                                                          clip_fea_b,
+                                                          time_dim_concat_b,
+                                                          vace_context_b,
+                                                          vace_strength,
+                                                          1);
+
+                    ggml_build_forward_expand(gf, out);
+                    x_output = out;
+
+                    return gf;
+                };
+
+                if (!GGMLRunner::compute(get_input_graph, n_threads, true, output, output_ctx, true)) {
+                    LOG_ERROR("Compute failed");
+                    return false;
+                }
+            }
+
+            int64_t t_end = ggml_time_ms();
+            LOG_INFO("Streaming completed in %.2fs (%d blocks)",
+                     (t_end - t_start) / 1000.0, num_blocks);
+
+            return true;
+        }
+
         ggml_cgraph* build_graph(const sd::Tensor<float>& x_tensor,
                                  const sd::Tensor<float>& timesteps_tensor,
                                  const sd::Tensor<float>& context_tensor         = {},
@@ -2268,6 +2475,67 @@ namespace WAN {
             return gf;
         }
 
+        // Raw tensor compute used by streaming infrastructure
+        bool compute(int n_threads,
+                     struct ggml_tensor* x,
+                     struct ggml_tensor* timesteps,
+                     struct ggml_tensor* context,
+                     struct ggml_tensor* clip_fea        = nullptr,
+                     struct ggml_tensor* c_concat        = nullptr,
+                     struct ggml_tensor* time_dim_concat = nullptr,
+                     struct ggml_tensor* vace_context    = nullptr,
+                     float vace_strength                 = 1.f,
+                     struct ggml_tensor** output         = nullptr,
+                     struct ggml_context* output_ctx     = nullptr,
+                     bool skip_param_offload             = false) {
+            auto get_graph = [&]() -> ggml_cgraph* {
+                ggml_cgraph* gf = new_graph_custom(WAN_GRAPH_SIZE);
+
+                x               = to_backend(x);
+                timesteps       = to_backend(timesteps);
+                context         = to_backend(context);
+                clip_fea        = to_backend(clip_fea);
+                c_concat        = to_backend(c_concat);
+                time_dim_concat = to_backend(time_dim_concat);
+                vace_context    = to_backend(vace_context);
+
+                pe_vec      = Rope::gen_wan_pe(static_cast<int>(x->ne[2]),
+                                               static_cast<int>(x->ne[1]),
+                                               static_cast<int>(x->ne[0]),
+                                               std::get<0>(wan_params.patch_size),
+                                               std::get<1>(wan_params.patch_size),
+                                               std::get<2>(wan_params.patch_size),
+                                               1,
+                                               wan_params.theta,
+                                               wan_params.axes_dim);
+                int pos_len = static_cast<int>(pe_vec.size() / wan_params.axes_dim_sum / 2);
+                auto pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, wan_params.axes_dim_sum / 2, pos_len);
+                set_backend_tensor_data(pe, pe_vec.data());
+
+                if (c_concat != nullptr) {
+                    x = ggml_concat(compute_ctx, x, c_concat, 3);
+                }
+
+                auto runner_ctx = get_context();
+
+                ggml_tensor* out = wan.forward(&runner_ctx,
+                                               x,
+                                               timesteps,
+                                               context,
+                                               pe,
+                                               clip_fea,
+                                               time_dim_concat,
+                                               vace_context,
+                                               vace_strength);
+
+                ggml_build_forward_expand(gf, out);
+                return gf;
+            };
+
+            return GGMLRunner::compute(get_graph, n_threads, false, output, output_ctx, skip_param_offload);
+        }
+
+        // Upstream sd::Tensor compute interface
         sd::Tensor<float> compute(int n_threads,
                                   const sd::Tensor<float>& x,
                                   const sd::Tensor<float>& timesteps,
diff --git a/src/z_image.hpp b/src/z_image.hpp
index 00b69c264..1d090cc2c 100644
--- a/src/z_image.hpp
+++ b/src/z_image.hpp
@@ -2,9 +2,12 @@
 #define __Z_IMAGE_HPP__
 
 #include <algorithm>
+#include <cmath>
 
+#include "chunk_graph.hpp"
 #include "flux.hpp"
 #include "ggml_extend.hpp"
+#include "layer_streaming.hpp"
 #include "mmdit.hpp"
 
 // Ref: https://github.com/Alpha-VLLM/Lumina-Image-2.0/blob/main/models/model.py
@@ -462,6 +465,95 @@ namespace ZImage {
 
             return out;
         }
+
+        struct StreamingInputResult {
+            ggml_tensor* txt;       // [N, n_txt_token + n_txt_pad_token, hidden_size]
+            ggml_tensor* img;       // [N, n_img_token + n_img_pad_token, hidden_size]
+            ggml_tensor* t_emb;     // [N, hidden_size]
+            ggml_tensor* txt_pe;    // PE for txt
+            ggml_tensor* img_pe;    // PE for img
+            ggml_tensor* full_pe;   // Full PE for main layers
+            int64_t n_txt_token;
+            int64_t n_txt_pad_token;
+            int64_t n_img_token;
+        };
+
+        StreamingInputResult forward_input_stage(GGMLRunnerContext* ctx,
+                                                  struct ggml_tensor* x,
+                                                  struct ggml_tensor* timestep,
+                                                  struct ggml_tensor* context,
+                                                  struct ggml_tensor* pe) {
+            auto x_embedder     = std::dynamic_pointer_cast<Linear>(blocks["x_embedder"]);
+            auto t_embedder     = std::dynamic_pointer_cast<TimestepEmbedder>(blocks["t_embedder"]);
+            auto cap_embedder_0 = std::dynamic_pointer_cast<RMSNorm>(blocks["cap_embedder.0"]);
+            auto cap_embedder_1 = std::dynamic_pointer_cast<Linear>(blocks["cap_embedder.1"]);
+
+            auto txt_pad_token = params["cap_pad_token"];
+            auto img_pad_token = params["x_pad_token"];
+
+            int64_t N           = x->ne[2];
+            int64_t n_img_token = x->ne[1];
+            int64_t n_txt_token = context->ne[1];
+
+            auto t_emb = t_embedder->forward(ctx, timestep);
+
+            auto txt = cap_embedder_1->forward(ctx, cap_embedder_0->forward(ctx, context));  // [N, n_txt_token, hidden_size]
+            auto img = x_embedder->forward(ctx, x);                                          // [N, n_img_token, hidden_size]
+
+            int64_t n_txt_pad_token = Rope::bound_mod(static_cast<int>(n_txt_token), SEQ_MULTI_OF);
+            if (n_txt_pad_token > 0) {
+                auto txt_pad_tokens = ggml_repeat_4d(ctx->ggml_ctx, txt_pad_token, txt_pad_token->ne[0], n_txt_pad_token, N, 1);
+                txt                 = ggml_concat(ctx->ggml_ctx, txt, txt_pad_tokens, 1);
+            }
+
+            int64_t n_img_pad_token = Rope::bound_mod(static_cast<int>(n_img_token), SEQ_MULTI_OF);
+            if (n_img_pad_token > 0) {
+                auto img_pad_tokens = ggml_repeat_4d(ctx->ggml_ctx, img_pad_token, img_pad_token->ne[0], n_img_pad_token, N, 1);
+                img                 = ggml_concat(ctx->ggml_ctx, img, img_pad_tokens, 1);
+            }
+
+            auto txt_pe = ggml_ext_slice(ctx->ggml_ctx, pe, 3, 0, txt->ne[1]);
+            auto img_pe = ggml_ext_slice(ctx->ggml_ctx, pe, 3, txt->ne[1], pe->ne[3]);
+
+            return {txt, img, t_emb, txt_pe, img_pe, pe, n_txt_token, n_txt_pad_token, n_img_token};
+        }
+
+        ggml_tensor* forward_context_refiner_block(GGMLRunnerContext* ctx,
+                                                    int block_idx,
+                                                    struct ggml_tensor* txt,
+                                                    struct ggml_tensor* txt_pe) {
+            auto block = std::dynamic_pointer_cast<JointTransformerBlock>(blocks["context_refiner." + std::to_string(block_idx)]);
+            return block->forward(ctx, txt, txt_pe, nullptr, nullptr);
+        }
+
+        ggml_tensor* forward_noise_refiner_block(GGMLRunnerContext* ctx,
+                                                  int block_idx,
+                                                  struct ggml_tensor* img,
+                                                  struct ggml_tensor* img_pe,
+                                                  struct ggml_tensor* t_emb) {
+            auto block = std::dynamic_pointer_cast<JointTransformerBlock>(blocks["noise_refiner." + std::to_string(block_idx)]);
+            return block->forward(ctx, img, img_pe, nullptr, t_emb);
+        }
+
+        ggml_tensor* forward_layer_block(GGMLRunnerContext* ctx,
+                                          int block_idx,
+                                          struct ggml_tensor* txt_img,
+                                          struct ggml_tensor* pe,
+                                          struct ggml_tensor* t_emb) {
+            auto block = std::dynamic_pointer_cast<JointTransformerBlock>(blocks["layers." + std::to_string(block_idx)]);
+            return block->forward(ctx, txt_img, pe, nullptr, t_emb);
+        }
+
+        ggml_tensor* forward_output_stage(GGMLRunnerContext* ctx,
+                                           struct ggml_tensor* txt_img,
+                                           struct ggml_tensor* t_emb) {
+            auto final_layer = std::dynamic_pointer_cast<FinalLayer>(blocks["final_layer"]);
+            return final_layer->forward(ctx, txt_img, t_emb);
+        }
+
+        int get_num_refiner_layers() const { return z_image_params.num_refiner_layers; }
+        int get_num_layers() const { return z_image_params.num_layers; }
+        int get_patch_size() const { return z_image_params.patch_size; }
     };
 
     struct ZImageRunner : public GGMLRunner {
@@ -472,6 +564,21 @@ namespace ZImage {
         std::vector<float> timestep_vec;
         SDVersion version;
 
+        // Number of main layers kept resident on GPU across sampling steps.
+        // -1 = uncomputed; set on the first compute_streaming_true() call once
+        // refiners and _global are loaded so we know real free VRAM.
+        int resident_layer_count_ = -1;
+
+        // Phase 4: cached "chunk" graph spanning all K resident layers in one
+        // dispatch. Built once on the first sampling step that has K > 0,
+        // dispatched once per subsequent step. Resident layer weights never
+        // move between steps so the graph stays cache-stable. Rebuilt when
+        // input shapes change (e.g. between queue jobs with different prompt
+        // token counts). See chunk_graph.hpp for the shared helper.
+        LayerStreaming::ChunkGraph chunk_graph_;
+
+    public:
+
         ZImageRunner(ggml_backend_t backend,
                      bool offload_params_to_cpu,
                      const String2TensorStorage& tensor_storage_map = {},
@@ -482,6 +589,91 @@ namespace ZImage {
             z_image.init(params_ctx, tensor_storage_map, prefix);
         }
 
+        ~ZImageRunner() = default;
+
+        // Drop the cached chunk graph and reset the resident-layer count when
+        // streaming layers are evicted to CPU. The chunk graph's compiled ops
+        // hold raw pointers into the resident layers' GPU tensors; once those
+        // tensors are moved off-GPU, reusing the graph would read freed
+        // memory. Forcing a rebuild also lets a new generation pick a
+        // different resident set if VRAM availability changed.
+        void on_streaming_layers_offloaded() override {
+            chunk_graph_.clear();
+            resident_layer_count_ = -1;
+        }
+
+        // Build (or reuse a cached) chunk graph for K resident layers, then
+        // dispatch it: upload the persistent activations + pe, run K layers in
+        // a single ggml_backend_graph_compute, read the chunk output back into
+        // persistent_txt_img. Replaces the per-layer dispatch loop for the
+        // resident block.
+        bool dispatch_resident_chunk(int K,
+                                      const int64_t txt_img_ne[4],
+                                      const int64_t t_emb_ne[4],
+                                      float* persistent_txt_img,
+                                      float* persistent_t_emb) {
+            int pos_len = static_cast<int>(pe_vec.size() / z_image_params.axes_dim_sum / 2);
+            std::vector<std::array<int64_t, 4>> shapes = {
+                { txt_img_ne[0], txt_img_ne[1], txt_img_ne[2], txt_img_ne[3] },
+                { t_emb_ne[0],   t_emb_ne[1],   t_emb_ne[2],   t_emb_ne[3]   },
+                { 2, 2, z_image_params.axes_dim_sum / 2, pos_len },
+            };
+
+            auto build_fn = [this](ggml_context*                       ctx,
+                                    const std::vector<ggml_tensor*>&   inputs,
+                                    int                                 K_inner) -> ggml_tensor* {
+                GGMLRunnerContext runner_ctx;
+                runner_ctx.ggml_ctx              = ctx;
+                runner_ctx.backend               = runtime_backend;
+                runner_ctx.flash_attn_enabled    = flash_attn_enabled;
+                runner_ctx.conv2d_direct_enabled = conv2d_direct_enabled;
+                runner_ctx.circular_x_enabled    = circular_x_enabled;
+                runner_ctx.circular_y_enabled    = circular_y_enabled;
+                runner_ctx.weight_adapter        = weight_adapter;
+
+                ggml_tensor* x = inputs[0];      // txt_img
+                ggml_tensor* t_emb = inputs[1];
+                ggml_tensor* pe = inputs[2];
+                for (int i = 0; i < K_inner; i++) {
+                    x = z_image.forward_layer_block(&runner_ctx, i, x, pe, t_emb);
+                }
+                return x;
+            };
+
+            // Fingerprint any state captured by reference in the cached graph
+            // that would invalidate it: weight_adapter (replaced per
+            // apply_loras call, so its tensors can be freed) and the runner
+            // boolean flags that pick alternate ops in forward_layer_block.
+            uint64_t state_token = reinterpret_cast<uintptr_t>(weight_adapter.get());
+            state_token ^= (static_cast<uint64_t>(flash_attn_enabled)    << 0)
+                         | (static_cast<uint64_t>(conv2d_direct_enabled) << 1)
+                         | (static_cast<uint64_t>(circular_x_enabled)    << 2)
+                         | (static_cast<uint64_t>(circular_y_enabled)    << 3);
+
+            if (!chunk_graph_.ensure_built(runtime_backend, K, shapes,
+                                           GGML_TYPE_F32, state_token, build_fn,
+                                           Z_IMAGE_GRAPH_SIZE * 2,
+                                           get_desc())) {
+                return false;
+            }
+
+            std::vector<const void*> host_data = {
+                persistent_txt_img,
+                persistent_t_emb,
+                pe_vec.data(),
+            };
+            std::vector<size_t> host_nbytes = {
+                static_cast<size_t>(txt_img_ne[0] * txt_img_ne[1] * txt_img_ne[2] * txt_img_ne[3]) * sizeof(float),
+                static_cast<size_t>(t_emb_ne[0]   * t_emb_ne[1]   * t_emb_ne[2]   * t_emb_ne[3])   * sizeof(float),
+                static_cast<size_t>(2 * 2 * (z_image_params.axes_dim_sum / 2) * pos_len) * sizeof(float),
+            };
+
+            size_t out_nbytes = ggml_nbytes(chunk_graph_.output());
+            return chunk_graph_.dispatch(runtime_backend,
+                                          host_data, host_nbytes,
+                                          persistent_txt_img, out_nbytes);
+        }
+
         std::string get_desc() override {
             return "z_image";
         }
@@ -490,6 +682,511 @@ namespace ZImage {
             z_image.get_param_tensors(tensors, prefix);
         }
 
+        void enable_layer_streaming(const LayerStreaming::StreamingConfig& config = {}) {
+            std::map<std::string, ggml_tensor*> tensor_map;
+            z_image.get_param_tensors(tensor_map, "model.diffusion_model");
+            init_streaming(config, tensor_map, LayerStreaming::zimage_layer_pattern);
+            LOG_INFO("%s layer streaming enabled (%zu layers)",
+                     get_desc().c_str(), streaming_engine_->get_registry().get_layer_count());
+        }
+
+        bool compute_streaming(int n_threads,
+                               struct ggml_tensor* x,
+                               struct ggml_tensor* timesteps,
+                               struct ggml_tensor* context,
+                               std::vector<ggml_tensor*> ref_latents = {},
+                               bool increase_ref_index               = false,
+                               struct ggml_tensor** output           = nullptr,
+                               struct ggml_context* output_ctx       = nullptr) {
+            if (!is_streaming_enabled()) {
+                LOG_ERROR("%s streaming not enabled", get_desc().c_str());
+                return false;
+            }
+
+            int64_t t0 = ggml_time_ms();
+            auto analysis = analyze_vram_budget();
+
+            if (analysis.fits_in_vram) {
+                LOG_INFO("%s model fits in VRAM, using coarse-stage streaming", get_desc().c_str());
+                load_all_layers_coarse();
+                bool result = compute(n_threads, x, timesteps, context, ref_latents, increase_ref_index,
+                                      output, output_ctx, true);
+                int64_t t1 = ggml_time_ms();
+                LOG_INFO("%s coarse-stage streaming completed in %.2fs", get_desc().c_str(), (t1 - t0) / 1000.0);
+                free_compute_buffer();
+                return result;
+            }
+
+            LOG_INFO("%s remaining %.2f GB exceeds available %.2f GB, using per-layer streaming",
+                     get_desc().c_str(),
+                     analysis.remaining_to_load / (1024.0 * 1024.0 * 1024.0),
+                     analysis.available_vram / (1024.0 * 1024.0 * 1024.0));
+
+            return compute_streaming_true(n_threads, x, timesteps, context, ref_latents, increase_ref_index,
+                                          output, output_ctx);
+        }
+
+        bool compute_streaming_true(int n_threads,
+                                     struct ggml_tensor* x,
+                                     struct ggml_tensor* timesteps,
+                                     struct ggml_tensor* context,
+                                     std::vector<ggml_tensor*> ref_latents = {},
+                                     bool increase_ref_index               = false,
+                                     struct ggml_tensor** output           = nullptr,
+                                     struct ggml_context* output_ctx       = nullptr) {
+            auto& registry = streaming_engine_->get_registry();
+            int64_t t_start = ggml_time_ms();
+
+            const int num_refiner_layers = z_image.get_num_refiner_layers();
+            const int num_layers = z_image.get_num_layers();
+            const int patch_size = z_image.get_patch_size();
+            const int64_t W = x->ne[0];
+            const int64_t H = x->ne[1];
+
+            LOG_INFO("TRUE per-layer streaming - %d refiners + %d layers",
+                     num_refiner_layers, num_layers);
+
+            // Load global layers
+            if (!registry.move_layer_to_gpu("_global")) {
+                LOG_ERROR("Failed to load _global to GPU");
+                return false;
+            }
+
+            // Load refiner layers (context_refiner and noise_refiner)
+            for (int i = 0; i < num_refiner_layers; i++) {
+                std::string cr_name = "context_refiner." + std::to_string(i);
+                std::string nr_name = "noise_refiner." + std::to_string(i);
+                if (!registry.move_layer_to_gpu(cr_name)) {
+                    LOG_ERROR("Failed to load %s to GPU", cr_name.c_str());
+                    return false;
+                }
+                if (!registry.move_layer_to_gpu(nr_name)) {
+                    LOG_ERROR("Failed to load %s to GPU", nr_name.c_str());
+                    return false;
+                }
+            }
+            // Generate PE
+            pe_vec = Rope::gen_z_image_pe(static_cast<int>(H),
+                                           static_cast<int>(W),
+                                           z_image_params.patch_size,
+                                           static_cast<int>(x->ne[3]),
+                                           static_cast<int>(context->ne[1]),
+                                           SEQ_MULTI_OF,
+                                           ref_latents,
+                                           increase_ref_index,
+                                           z_image_params.theta,
+                                           circular_y_enabled,
+                                           circular_x_enabled,
+                                           z_image_params.axes_dim);
+            // For ZImage with refiners, we'll execute refiners with global,
+            // then stream main layers one at a time
+            // This is a simplified approach - refiners are usually small
+
+            // Persistent storage. Pinned host buffer (member-scoped, reused
+            // across sampling steps) so the per-layer ggml_backend_tensor_get
+            // and copy_data_to_backend_tensor calls run at full PCIe bandwidth.
+            // Falls back to pageable std::vector if pinned alloc fails.
+            std::vector<float> persistent_txt_img_fallback;
+            std::vector<float> persistent_t_emb_fallback;
+            float* persistent_txt_img = nullptr;
+            float* persistent_t_emb   = nullptr;
+            int64_t txt_img_ne[4], t_emb_ne[4];
+            int64_t n_txt_token = 0, n_txt_pad_token = 0, n_img_token_val = 0;
+
+            // Stage 1: Input + Refiners (all in one graph since refiners are small)
+            {
+                ggml_tensor* txt_img_output = nullptr;
+                ggml_tensor* t_emb_output = nullptr;
+
+                auto get_refiner_graph = [&]() -> struct ggml_cgraph* {
+                    struct ggml_cgraph* gf = new_graph_custom(Z_IMAGE_GRAPH_SIZE / 2);
+                    auto runner_ctx = get_context();
+
+                    ggml_tensor* x_backend = to_backend(x);
+                    ggml_tensor* context_backend = to_backend(context);
+                    ggml_tensor* timesteps_backend = to_backend(timesteps);
+
+                    // Patchify
+                    auto img = DiT::pad_and_patchify(&runner_ctx, x_backend, patch_size, patch_size, false);
+                    n_img_token_val = img->ne[1];
+
+                    // Handle ref_latents
+                    for (auto& ref : ref_latents) {
+                        auto ref_backend = to_backend(ref);
+                        ref_backend = DiT::pad_and_patchify(&runner_ctx, ref_backend, patch_size, patch_size, false);
+                        img = ggml_concat(compute_ctx, img, ref_backend, 1);
+                    }
+
+                    // PE tensor
+                    int pos_len = static_cast<int>(pe_vec.size() / z_image_params.axes_dim_sum / 2);
+                    auto pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, z_image_params.axes_dim_sum / 2, pos_len);
+                    set_backend_tensor_data(pe, pe_vec.data());
+
+                    // Input stage
+                    auto input_result = z_image.forward_input_stage(&runner_ctx, img, timesteps_backend, context_backend, pe);
+                    auto txt = input_result.txt;
+                    img = input_result.img;
+                    auto t_emb = input_result.t_emb;
+                    auto txt_pe = input_result.txt_pe;
+                    auto img_pe = input_result.img_pe;
+                    n_txt_token = input_result.n_txt_token;
+                    n_txt_pad_token = input_result.n_txt_pad_token;
+
+                    // Verify PE size
+                    int64_t total_tokens = txt->ne[1] + img->ne[1];
+                    if (pe->ne[3] != total_tokens) {
+                        LOG_ERROR("ZImage PE mismatch: PE has %ld positions but model needs %ld tokens",
+                                  pe->ne[3], total_tokens);
+                    }
+
+                    // Context refiners
+                    for (int i = 0; i < num_refiner_layers; i++) {
+                        txt = z_image.forward_context_refiner_block(&runner_ctx, i, txt, txt_pe);
+                    }
+
+                    // Noise refiners
+                    for (int i = 0; i < num_refiner_layers; i++) {
+                        img = z_image.forward_noise_refiner_block(&runner_ctx, i, img, img_pe, t_emb);
+                    }
+
+                    // Concat for main layers
+                    txt_img_output = ggml_concat(compute_ctx, txt, img, 1);
+
+                    // Create explicit copy of t_emb to prevent buffer aliasing
+                    // The allocator may reuse t_emb's buffer after noise refiners use it
+                    auto t_emb_copy = ggml_new_tensor(compute_ctx, t_emb->type, ggml_n_dims(t_emb), t_emb->ne);
+                    t_emb_copy = ggml_cpy(compute_ctx, t_emb, t_emb_copy);
+                    ggml_set_name(t_emb_copy, "t_emb_output_copy");
+                    t_emb_output = t_emb_copy;
+
+                    ggml_build_forward_expand(gf, txt_img_output);
+                    ggml_build_forward_expand(gf, t_emb_output);
+
+                    return gf;
+                };
+
+                // Don't free compute buffer immediately - we need to read outputs first
+                if (!GGMLRunner::compute(get_refiner_graph, n_threads, false, nullptr, nullptr, true)) {
+                    LOG_ERROR("Refiner stage failed");
+                    return false;
+                }
+
+                // Extract to persistent storage
+                if (txt_img_output && t_emb_output) {
+                    size_t txt_img_size = ggml_nelements(txt_img_output);
+                    size_t t_emb_size = ggml_nelements(t_emb_output);
+
+                    std::vector<float*> ptrs;
+                    if (ensure_pinned_act_buffers({txt_img_size * sizeof(float),
+                                                   t_emb_size   * sizeof(float)}, ptrs)) {
+                        persistent_txt_img = ptrs[0];
+                        persistent_t_emb   = ptrs[1];
+                    } else {
+                        persistent_txt_img_fallback.resize(txt_img_size);
+                        persistent_t_emb_fallback.resize(t_emb_size);
+                        persistent_txt_img = persistent_txt_img_fallback.data();
+                        persistent_t_emb   = persistent_t_emb_fallback.data();
+                    }
+
+                    ggml_backend_tensor_get(txt_img_output, persistent_txt_img, 0, txt_img_size * sizeof(float));
+                    ggml_backend_tensor_get(t_emb_output, persistent_t_emb, 0, t_emb_size * sizeof(float));
+
+                    for (int i = 0; i < 4; i++) {
+                        txt_img_ne[i] = txt_img_output->ne[i];
+                        t_emb_ne[i] = t_emb_output->ne[i];
+                    }
+                } else {
+                    LOG_ERROR("Failed to get refiner stage outputs");
+                    free_compute_buffer();
+                    return false;
+                }
+
+                // Now safe to free compute buffer
+                free_compute_buffer();
+            }
+
+            // Refiners stay resident across sampling steps. Their weights are
+            // identical every step, so evicting and re-streaming them was
+            // pure waste. They cost ~4 layers worth of VRAM (small).
+
+            // On the first sampling step, decide how many main layers we can
+            // keep permanently resident. Layers [0..K-1] become a static cache;
+            // layers [K..N-1] continue to stream and evict each step.
+            if (resident_layer_count_ < 0 && streaming_engine_) {
+                resident_layer_count_ = streaming_engine_->compute_resident_block_count("layers.0", num_layers);
+                LOG_INFO("%s layer cache: %d resident, %d streamed per step",
+                         get_desc().c_str(),
+                         resident_layer_count_,
+                         num_layers - resident_layer_count_);
+            }
+
+            // Stage 2: Main layers (one at a time)
+            // Debug: limit layers if env var set (to isolate where grid pattern appears)
+            const char* limit_layers_env = std::getenv("SDCPP_LIMIT_MAIN_LAYERS");
+            int layers_to_run = num_layers;
+            if (limit_layers_env) {
+                int limit = std::atoi(limit_layers_env);
+                if (limit >= 0 && limit < num_layers) {
+                    layers_to_run = limit;
+                    LOG_WARN("SDCPP_LIMIT_MAIN_LAYERS=%d: Running only %d of %d main layers (debug mode)",
+                             limit, layers_to_run, num_layers);
+                }
+            }
+
+            auto layer_name_at = [](int i) { return "layers." + std::to_string(i); };
+
+            // Phase 4: dispatch the K resident layers as a single mega-graph
+            // (one ggml_backend_graph_compute call instead of K). On the first
+            // sampling step we pre-load all K resident weights and build the
+            // cached graph; subsequent steps reuse it.
+            int chunk_K = std::min(resident_layer_count_ < 0 ? 0 : resident_layer_count_,
+                                    layers_to_run);
+            if (chunk_K > 0) {
+                for (int i = 0; i < chunk_K; i++) {
+                    std::string nm = layer_name_at(i);
+                    if (!registry.is_layer_on_gpu(nm)) {
+                        if (!registry.move_layer_to_gpu(nm)) {
+                            LOG_ERROR("Failed to load resident %s for chunk", nm.c_str());
+                            return false;
+                        }
+                    }
+                }
+                // The shared ChunkGraph helper (chunk_graph.hpp) handles cache
+                // reuse and shape-mismatch rebuild automatically.
+                if (!dispatch_resident_chunk(chunk_K, txt_img_ne, t_emb_ne,
+                                              persistent_txt_img, persistent_t_emb)) {
+                    return false;
+                }
+                // The chunk output has the same shape as the last resident
+                // layer's output; ne carries through unchanged.
+                for (int i = 0; i < 4; i++) {
+                    txt_img_ne[i] = chunk_graph_.output()->ne[i];
+                }
+            }
+
+            // Begin prefetch at the first non-resident layer. With chunk_K > 0
+            // the resident prefix is already loaded, so prefetch starts at K.
+            int prefetch_start = chunk_K;
+            while (prefetch_start < num_layers &&
+                   registry.is_layer_on_gpu(layer_name_at(prefetch_start))) {
+                prefetch_start++;
+            }
+            if (streaming_engine_) {
+                streaming_engine_->prime_prefetch(layer_name_at, prefetch_start, num_layers);
+            }
+
+            // Phase 3 profiling: per-stage cumulative timings, dumped after the
+            // main loop. Set SDCPP_STREAM_PROFILE=1 to enable.
+            int64_t prof_wait_us    = 0;
+            int64_t prof_load_us    = 0;
+            int64_t prof_advance_us = 0;
+            int64_t prof_build_us   = 0;
+            int64_t prof_compute_us = 0;
+            int64_t prof_get_us     = 0;
+            int64_t prof_evict_us   = 0;
+            const bool prof_enabled = std::getenv("SDCPP_STREAM_PROFILE") != nullptr;
+            auto prof_now = []() { return ggml_time_us(); };
+
+            // Phase 4: skip layers already covered by the chunk dispatch.
+            for (int layer_idx = chunk_K; layer_idx < layers_to_run; layer_idx++) {
+                std::string layer_name = layer_name_at(layer_idx);
+
+                int64_t t0 = prof_enabled ? prof_now() : 0;
+
+                // Wait for this layer's prefetch to complete (if async prefetch was started)
+                if (streaming_engine_) {
+                    streaming_engine_->wait_for_prefetch(layer_name);
+                }
+                int64_t t1 = prof_enabled ? prof_now() : 0;
+
+                // Load this layer's weights (sync load if prefetch didn't happen)
+                if (!registry.move_layer_to_gpu(layer_name)) {
+                    LOG_ERROR("Failed to load %s", layer_name.c_str());
+                    return false;
+                }
+                int64_t t2 = prof_enabled ? prof_now() : 0;
+
+                // Keep the prefetch window full
+                if (streaming_engine_) {
+                    streaming_engine_->advance_prefetch(layer_name_at, layer_idx, num_layers);
+                }
+                int64_t t3 = prof_enabled ? prof_now() : 0;
+
+                ggml_tensor* txt_img_out = nullptr;
+
+                auto get_layer_graph = [&]() -> struct ggml_cgraph* {
+                    struct ggml_cgraph* gf = new_graph_custom(Z_IMAGE_GRAPH_SIZE / 4);
+
+                    // Create input tensors in compute_ctx - no need for to_backend() since
+                    // these are created fresh and will be allocated by the graph allocator
+                    ggml_tensor* txt_img_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32,
+                                                                  txt_img_ne[0], txt_img_ne[1], txt_img_ne[2], txt_img_ne[3]);
+                    ggml_tensor* t_emb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32,
+                                                                t_emb_ne[0], t_emb_ne[1], t_emb_ne[2], t_emb_ne[3]);
+
+                    // Schedule data copy from CPU to GPU (happens after graph allocation)
+                    set_backend_tensor_data(txt_img_in, persistent_txt_img);
+                    set_backend_tensor_data(t_emb_in, persistent_t_emb);
+
+                    // PE tensor
+                    int pos_len = static_cast<int>(pe_vec.size() / z_image_params.axes_dim_sum / 2);
+                    auto pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, z_image_params.axes_dim_sum / 2, pos_len);
+                    set_backend_tensor_data(pe, pe_vec.data());
+
+                    auto runner_ctx = get_context();
+                    txt_img_out = z_image.forward_layer_block(&runner_ctx, layer_idx, txt_img_in, pe, t_emb_in);
+
+                    ggml_build_forward_expand(gf, txt_img_out);
+
+                    return gf;
+                };
+
+                if (!GGMLRunner::compute(get_layer_graph, n_threads, false, nullptr, nullptr, true)) {
+                    LOG_ERROR("Layer %d execution failed", layer_idx);
+                    return false;
+                }
+                int64_t t4 = prof_enabled ? prof_now() : 0;
+
+                // Extract output
+                if (txt_img_out) {
+                    ggml_backend_tensor_get(txt_img_out, persistent_txt_img, 0, ggml_nbytes(txt_img_out));
+                    for (int i = 0; i < 4; i++) {
+                        txt_img_ne[i] = txt_img_out->ne[i];
+                    }
+                }
+                int64_t t5 = prof_enabled ? prof_now() : 0;
+
+                if (prof_enabled) {
+                    prof_wait_us    += t1 - t0;
+                    prof_load_us    += t2 - t1;
+                    prof_advance_us += t3 - t2;
+                    // build+compute happens together inside GGMLRunner::compute;
+                    // we can't separate them without instrumenting ggml_extend.
+                    prof_compute_us += t4 - t3;
+                    prof_get_us     += t5 - t4;
+                }
+
+                // Don't free compute buffer here — every main layer has the same shape
+                // so the gallocr can be reused for the entire sampling step. Freeing here
+                // forces a destroy-and-recreate cycle that idles the GPU between layers.
+
+                // Resident layers stay on GPU across sampling steps; only evict
+                // streamed layers (idx >= resident_layer_count_).
+                if (layer_idx >= resident_layer_count_) {
+                    registry.move_layer_to_cpu(layer_name);
+                }
+            }
+
+            if (prof_enabled) {
+                int64_t total = prof_wait_us + prof_load_us + prof_advance_us +
+                                prof_compute_us + prof_get_us;
+                LOG_INFO("[stream-profile] %d layers: total=%.2fms wait=%.2fms load=%.2fms "
+                         "advance=%.2fms compute=%.2fms tensor_get=%.2fms",
+                         layers_to_run,
+                         total / 1000.0,
+                         prof_wait_us / 1000.0,
+                         prof_load_us / 1000.0,
+                         prof_advance_us / 1000.0,
+                         prof_compute_us / 1000.0,
+                         prof_get_us / 1000.0);
+            }
+
+            // After all main layers are done, free the compute buffer so the output stage
+            // (different graph topology) can allocate a fresh one.
+            free_compute_buffer();
+
+            // Stage 3: Output
+            {
+                auto get_output_graph = [&]() -> struct ggml_cgraph* {
+                    struct ggml_cgraph* gf = new_graph_custom(Z_IMAGE_GRAPH_SIZE / 4);
+
+                    // Create input tensors in compute_ctx - no to_backend() needed
+                    ggml_tensor* txt_img_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32,
+                                                                  txt_img_ne[0], txt_img_ne[1], txt_img_ne[2], txt_img_ne[3]);
+                    ggml_tensor* t_emb_in = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32,
+                                                                t_emb_ne[0], t_emb_ne[1], t_emb_ne[2], t_emb_ne[3]);
+
+                    // Schedule data copy from CPU to GPU
+                    set_backend_tensor_data(txt_img_in, persistent_txt_img);
+                    set_backend_tensor_data(t_emb_in, persistent_t_emb);
+
+                    auto runner_ctx = get_context();
+                    auto final_out = z_image.forward_output_stage(&runner_ctx, txt_img_in, t_emb_in);
+
+                    // Extract img portion and unpatchify
+                    int64_t n_img_token = n_img_token_val;
+                    final_out = ggml_ext_slice(compute_ctx, final_out, 1,
+                                               n_txt_token + n_txt_pad_token,
+                                               n_txt_token + n_txt_pad_token + n_img_token);
+
+                    final_out = DiT::unpatchify_and_crop(compute_ctx, final_out, H, W, patch_size, patch_size, false);
+                    final_out = ggml_ext_scale(compute_ctx, final_out, -1.f);
+
+                    ggml_build_forward_expand(gf, final_out);
+
+                    return gf;
+                };
+
+                if (!GGMLRunner::compute(get_output_graph, n_threads, true, output, output_ctx, true)) {
+                    LOG_ERROR("Output stage failed");
+                    return false;
+                }
+            }
+
+            int64_t t_end = ggml_time_ms();
+            LOG_INFO("TRUE per-layer streaming completed in %.2fs (%d refiners + %d layers)",
+                     (t_end - t_start) / 1000.0, num_refiner_layers, num_layers);
+
+            return true;
+        }
+
+        // Raw pointer overload used by streaming code paths
+        struct ggml_cgraph* build_graph(struct ggml_tensor* x,
+                                        struct ggml_tensor* timesteps,
+                                        struct ggml_tensor* context,
+                                        std::vector<ggml_tensor*> ref_latents = {},
+                                        bool increase_ref_index               = false) {
+            GGML_ASSERT(x->ne[3] == 1);
+            struct ggml_cgraph* gf = new_graph_custom(Z_IMAGE_GRAPH_SIZE);
+
+            x         = to_backend(x);
+            context   = to_backend(context);
+            timesteps = to_backend(timesteps);
+
+            for (size_t i = 0; i < ref_latents.size(); i++) {
+                ref_latents[i] = to_backend(ref_latents[i]);
+            }
+
+            pe_vec      = Rope::gen_z_image_pe(static_cast<int>(x->ne[1]),
+                                               static_cast<int>(x->ne[0]),
+                                               z_image_params.patch_size,
+                                               static_cast<int>(x->ne[3]),
+                                               static_cast<int>(context->ne[1]),
+                                               SEQ_MULTI_OF,
+                                               ref_latents,
+                                               increase_ref_index,
+                                               z_image_params.theta,
+                                               circular_y_enabled,
+                                               circular_x_enabled,
+                                               z_image_params.axes_dim);
+            int pos_len = static_cast<int>(pe_vec.size() / z_image_params.axes_dim_sum / 2);
+            auto pe = ggml_new_tensor_4d(compute_ctx, GGML_TYPE_F32, 2, 2, z_image_params.axes_dim_sum / 2, pos_len);
+            set_backend_tensor_data(pe, pe_vec.data());
+            auto runner_ctx = get_context();
+
+            ggml_tensor* out = z_image.forward(&runner_ctx,
+                                               x,
+                                               timesteps,
+                                               context,
+                                               pe,
+                                               ref_latents);
+
+            ggml_build_forward_expand(gf, out);
+
+            return gf;
+        }
+
+        // sd::Tensor overload used by upstream pipeline
         ggml_cgraph* build_graph(const sd::Tensor<float>& x_tensor,
                                  const sd::Tensor<float>& timesteps_tensor,
                                  const sd::Tensor<float>& context_tensor,
@@ -540,6 +1237,27 @@ namespace ZImage {
             return gf;
         }
 
+        // Raw pointer overload used by streaming/offloading code paths
+        bool compute(int n_threads,
+                     struct ggml_tensor* x,
+                     struct ggml_tensor* timesteps,
+                     struct ggml_tensor* context,
+                     std::vector<ggml_tensor*> ref_latents = {},
+                     bool increase_ref_index               = false,
+                     struct ggml_tensor** output           = nullptr,
+                     struct ggml_context* output_ctx       = nullptr,
+                     bool skip_param_offload               = false) {
+            // x: [N, in_channels, h, w]
+            // timesteps: [N, ]
+            // context: [N, max_position, hidden_size]
+            auto get_graph = [&]() -> ggml_cgraph* {
+                return build_graph(x, timesteps, context, ref_latents, increase_ref_index);
+            };
+
+            return GGMLRunner::compute(get_graph, n_threads, false, output, output_ctx, skip_param_offload);
+        }
+
+        // sd::Tensor overload used by upstream pipeline
         sd::Tensor<float> compute(int n_threads,
                                   const sd::Tensor<float>& x,
                                   const sd::Tensor<float>& timesteps,