Skip to content

Conversation

@stduhpf
Copy link
Contributor

@stduhpf stduhpf commented Nov 3, 2025

https://github.com/madebyollin/taehv

Model weights:

.\bin\Release\sd.exe --diffusion-model ..\..\ComfyUI\models\diffusion_models\qwen-image-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\qwen_image_vae.safetensors --qwen2vl ..\..\ComfyUI\models\text_encoders\Qwen2.5-VL-7B-Instruct-Q8_0.gguf -p '一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 “一、Qwen-Image的技术路线: 探索视觉生成基础模型的极限,开创理解与生成一体化的未来。二、Qwen-Image的模型特色:1、复杂文字渲染。支持中英渲染、自动布局; 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景:赋能专业内容创作、助力生成式AI发展。”' --cfg-scale 2.5 --sampling-method euler -v --offload-to-cpu -H 1024 -W 1024 --diffusion-fa --flow-shift 3 --tae ..\ComfyUI\models\vae_approx\taew2_1.pth --vae-conv-direct

output

.\bin\Release\sd-cli.exe -M vid_gen --diffusion-model '..\..\ComfyUI\models\unet\Wan2.2-TI2V-5B-Q8_0.gguf' --t5xxl ..\..\ComfyUI\models\clip\t5\umt5-xxl-encoder-Q8_0.gguf --tae ..\..\ComfyUI\models\vae_approx\taew2_2.pth -p "The woman drops the marker, and then she starts laughing a bit" -n "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" --cfg-scale 5.0 --sampling-method euler -v -W 768 -H 768 --color --video-frames 49 -i .\image.png --vae-conv-direct --scheduler smoothstep --steps 50 --fps 24 --diffusion-fa

output.mp4

Speedup and memory saving aren't that impressive yet, maybe it can be improved further?

@stduhpf
Copy link
Contributor Author

stduhpf commented Nov 3, 2025

Sorry for the unrelated whitespace changes and the debug spam, will fix later

@stduhpf
Copy link
Contributor Author

stduhpf commented Nov 3, 2025

Oh a new version of the taew2.1 weights just came out, coincidentally.

Old Weights New Weights
output - Copy (112) output

@stduhpf
Copy link
Contributor Author

stduhpf commented Nov 3, 2025

Now tae decoding for the outputs of Wan2.1 models (and Wan2.2 A14B) works in txt2img mode.

Video decoding is running as well, but the results are obviously incorrect (flashing lights warning)

If someone can see what I'm doing wrong when decoding videos, let me know.

@madebyollin
Copy link

madebyollin commented Dec 11, 2025

After fixing the three bugs mentioned in review, image results look correct (tested on GH200 with -DSD_CUDA=ON). I didn't check video.
image

diffs
diff --git a/tae.hpp b/tae.hpp
index ad0bd37..6a7951f 100644
--- a/tae.hpp
+++ b/tae.hpp
@@ -224,7 +224,7 @@ public:
         h      = conv1->forward(ctx, h);
         h      = ggml_relu_inplace(ctx->ggml_ctx, h);
         h      = conv2->forward(ctx, h);
-        h      = ggml_relu_inplace(ctx->ggml_ctx, h);
+        // h      = ggml_relu_inplace(ctx->ggml_ctx, h);
 
         auto skip = x;
         if (has_skip_conv) {
@@ -323,7 +323,7 @@ public:
         for (int i = 0; i < num_layers; i++) {
             for (int j = 0; j < num_blocks; j++) {
                 auto block = std::dynamic_pointer_cast<MemBlock>(blocks[std::to_string(index++)]);
-                auto mem   = ggml_pad(ctx->ggml_ctx, h, 0, 0, 0, 1);
+                auto mem   = ggml_pad_ext(ctx->ggml_ctx, h, 0, 0, 0, 0, 0, 0, 1, 0);
                 mem        = ggml_view_4d(ctx->ggml_ctx, mem, h->ne[0], h->ne[1], h->ne[2], h->ne[3], h->nb[1], h->nb[2], h->nb[3], 0);
                 h          = block->forward(ctx, h, mem);
             }
@@ -341,7 +341,7 @@ public:
         h              = last_conv->forward(ctx, h);
 
         // shape(W, H, 3, T+3) => shape(W, H, 3, T)
-        h = ggml_view_4d(ctx->ggml_ctx, h, h->ne[0], h->ne[1], h->ne[2], h->ne[3] - 3, h->nb[1], h->nb[2], h->nb[3], 0);
+        h = ggml_view_4d(ctx->ggml_ctx, h, h->ne[0], h->ne[1], h->ne[2], h->ne[3] - 3, h->nb[1], h->nb[2], h->nb[3], 3*h->nb[3]);
         return h;

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 11, 2025

Video is still completely broken, but image decoding works very well now.

@stduhpf stduhpf marked this pull request as ready for review December 13, 2025 21:10
@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 13, 2025

Results for taew2.2 are quite interesting for now.

output.mp4

@madebyollin
Copy link

Wan 2.2 and Hunyuan 1.5 have 2x2 pixelshuffle on the input/output

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 13, 2025

@madebyollin Yes I saw that when looking at the VAE code in wan.hpp, I'm on it

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 13, 2025

output.mp4

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 13, 2025

output.mp4

@madebyollin
Copy link

The Wan 2.2 TI2V results still look broken. There's a scaling issue on ~L3600 where sd_ctx->sd->process_latent_out(init_latent); and sd_ctx->sd->process_latent_in(init_latent); are incorrectly called even when using TAEW2.2. After fixing that, initial frame results look correctly-scaled but the video deteriorates into gray mush:

output_with_disabled_process_latent.mp4

This gray-mush issue happens with the default VAE on 8f05f5bc6ee9d6aba9d1ff2be7739a5a3cf1586d (before this PR) so fixing it is likely out of scope for this PR.

output_with_official_vae_on_8f05f5bc6ee9d6aba9d1ff2be7739a5a3cf1586d.mp4

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 15, 2025

@madebyollin yes I figured it was probably something that after noticing how much worse the img2vid results were compared to txt2vid. I get no "gray-mush" on my end with this fix though.

@leejet
Copy link
Owner

leejet commented Dec 15, 2025

@stduhpf I used taehv and get results very close to the results of wan vae. Maybe this PR can be merged now?

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 15, 2025

I think so too. I haven't tested every possible use case though (for example VACE).

@spiderolsen
Copy link

20251216_005330

I tested I2V with this image and successfully enabled the TAE feature on AMD RX 7800 XT using my ROCm 6.4.2 build. Thanks for implementing it! ❤️

832px x 480px 5s Clip with Wan2.2-TI2V-5B-Q4_0.gguf

20251216_005604.mp4

@leejet
Copy link
Owner

leejet commented Dec 16, 2025

Thank you for your contributions.

@leejet leejet merged commit 9fa7f41 into leejet:master Dec 16, 2025
9 checks passed
@CarlGao4
Copy link
Contributor

How is the VRAM requirement reduction?

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 16, 2025

How is the VRAM requirement reduction?

Reesults on ROCm 6.2, Windows, with AMD Radeon RX 6800 (gfx1030) 16GB

Without --vae-conv-direct

1024x1024x1

  • wan2.1_vae vs taew2_1:
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - wan_vae compute buffer size: 7493.50 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 4.37s
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - taehv compute buffer size: 6913.00 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 1.90s
  • wan2.2_vae vs taew2_2:
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - wan_vae compute buffer size: 9996.77 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 4.71s
 [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - taehv compute buffer size: 1728.75 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 0.31s

512x512x33

  • wan2.1_vae vs taew2_1:
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - wan_vae compute buffer size: 12900.63 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 28.24s
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - taehv compute buffer size: 15554.25 MB(VRAM)
H:/stable-diffusion.cpp/ggml/src/ggml-cuda/cpy.cu:359: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed
  • wan2.2_vae vs taew2_2:
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - wan_vae compute buffer size: 13718.14 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 31.91s
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - taehv compute buffer size: 3889.69 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 0.56s

With --vae-conv-direct

1024x1024x1

  • wan2.1_vae vs taew2_1:
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1705 - wan_vae compute buffer size: 7493.50 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2332 - computing vae decode graph completed, taking 4.56s
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1705 - taehv compute buffer size: 2817.00 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2332 - computing vae decode graph completed, taking 3.13s
  • wan2.2_vae vs taew2_2:
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - wan_vae compute buffer size: 9996.77 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 4.75s
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - taehv compute buffer size: 704.75 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 0.76s

512x512x33

  • wan2.1_vae vs taew2_1:
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - wan_vae compute buffer size: 12900.63 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 28.19s
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - taehv compute buffer size: 6338.25 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 6.67s
  • wan2.2_vae vs taew2_2:
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - wan_vae compute buffer size: 13718.14 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 32.01s
[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - taehv compute buffer size: 1585.69 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 1.70s

@CarlGao4
Copy link
Contributor

Thanks! I'm using wan2.2 5b, and the default vae requires 30GiB VRAM, so now generating video with 16GB RAM becomes possible

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 16, 2025

@madebyollin is this expected for taew2_1 to be slower and save less VRAM than taew2_2?

@CarlGao4
Copy link
Contributor

It seems that this implementation has a maximum VRAM. When I used longer video (960x720x81), an error occured:

[INFO ] ggml_extend.hpp:1799 - taehv offload params ( 21.79 MB, 128 tensors) to runtime backend (CUDA0), taking 0.01s
[DEBUG] ggml_extend.hpp:1699 - taehv compute buffer size: 15143.45 MB(VRAM)
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\cpy.cu:359: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed

@stduhpf
Copy link
Contributor Author

stduhpf commented Dec 17, 2025

@CarlGao4 Was this with --vae-conv-direct? If no, try with it. If you're already using it and hitting this assert, I coincidentally just found a potential solution, though I don't really know how safe it is yet: just remove the assert and try again

@madebyollin
Copy link

@stduhpf For a given image size, TAEW2.2 should be ~1/4 the cost of TAEW2.1, because TAEW2.2 uses 2x2 patchification which halves the height and width of all intermediate tensors. The cost of 2x2 patchification is a reduced ability to represent fine details. I pushed a new checkpoint today that should improve TAEW2.2 quality slightly, but it's still less sharp/stable than TAEW2.1 in my testing.

@CarlGao4
Copy link
Contributor

@stduhpf With --vae-conv-direct enabled, the assertation does not fail. Resolution 1024x768x81

@Green-Sky
Copy link
Contributor

Green-Sky commented Dec 17, 2025

I pushed a new checkpoint today that should improve TAEW2.2 quality slightly, but it's still less sharp/stable than TAEW2.1 in my testing.

Hard (moving) edges are now noticeably less pixelated/squiggley, especially every non-first frame.

Second frame:

before after
image image

@CarlGao4
Copy link
Contributor

@madebyollin Can you update the safetensors as well?
Besides, maybe it's time to update the support list https://github.com/madebyollin/taehv#where-can-i-get-taehv

@madebyollin
Copy link

@CarlGao4 Good point, updated the safetensors and added stable-diffusion.cpp to the support list with credit to this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants