Feat; Add support for Wan/Qwen TAEHV decoding #937

stduhpf · 2025-11-03T15:25:59Z

Model weights:

https://github.com/madebyollin/taehv/blob/main/taew2_1.pth (wan2.1, wan2.2 a14B, Qwen)
https://github.com/madebyollin/taehv/blob/main/taew2_2.pth (wan2.2 5B)

.\bin\Release\sd.exe --diffusion-model ..\..\ComfyUI\models\diffusion_models\qwen-image-Q8_0.gguf --vae ..\..\ComfyUI\models\vae\qwen_image_vae.safetensors --qwen2vl ..\..\ComfyUI\models\text_encoders\Qwen2.5-VL-7B-Instruct-Q8_0.gguf -p '一个穿着"QWEN"标志的T恤的中国美女正拿着黑色的马克笔面相镜头微笑。她身后的玻璃板上手写体写着 “一、Qwen-Image的技术路线：探索视觉生成基础模型的极限，开创理解与生成一体化的未来。二、Qwen-Image的模型特色：1、复杂文字渲染。支持中英渲染、自动布局； 2、精准图像编辑。支持文字编辑、物体增减、风格变换。三、Qwen-Image的未来愿景：赋能专业内容创作、助力生成式AI发展。”' --cfg-scale 2.5 --sampling-method euler -v --offload-to-cpu -H 1024 -W 1024 --diffusion-fa --flow-shift 3 --tae ..\ComfyUI\models\vae_approx\taew2_1.pth --vae-conv-direct

.\bin\Release\sd-cli.exe -M vid_gen --diffusion-model '..\..\ComfyUI\models\unet\Wan2.2-TI2V-5B-Q8_0.gguf' --t5xxl ..\..\ComfyUI\models\clip\t5\umt5-xxl-encoder-Q8_0.gguf --tae ..\..\ComfyUI\models\vae_approx\taew2_2.pth -p "The woman drops the marker, and then she starts laughing a bit" -n "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走" --cfg-scale 5.0 --sampling-method euler -v -W 768 -H 768 --color --video-frames 49 -i .\image.png --vae-conv-direct --scheduler smoothstep --steps 50 --fps 24 --diffusion-fa

output.mp4

Speedup and memory saving aren't that impressive yet, maybe it can be improved further?

stduhpf · 2025-11-03T15:28:49Z

Sorry for the unrelated whitespace changes and the debug spam, will fix later

stduhpf · 2025-11-03T21:04:53Z

Oh a new version of the taew2.1 weights just came out, coincidentally.

Old Weights	New Weights

stduhpf · 2025-11-03T23:17:52Z

Now tae decoding for the outputs of Wan2.1 models (and Wan2.2 A14B) works in txt2img mode.

Video decoding is running as well, but the results are obviously incorrect (flashing lights warning)

If someone can see what I'm doing wrong when decoding videos, let me know.

madebyollin · 2025-12-11T00:03:14Z

After fixing the three bugs mentioned in review, image results look correct (tested on GH200 with -DSD_CUDA=ON). I didn't check video.

diffs

diff --git a/tae.hpp b/tae.hpp
index ad0bd37..6a7951f 100644
--- a/tae.hpp
+++ b/tae.hpp
@@ -224,7 +224,7 @@ public:
         h      = conv1->forward(ctx, h);
         h      = ggml_relu_inplace(ctx->ggml_ctx, h);
         h      = conv2->forward(ctx, h);
-        h      = ggml_relu_inplace(ctx->ggml_ctx, h);
+        // h      = ggml_relu_inplace(ctx->ggml_ctx, h);
 
         auto skip = x;
         if (has_skip_conv) {
@@ -323,7 +323,7 @@ public:
         for (int i = 0; i < num_layers; i++) {
             for (int j = 0; j < num_blocks; j++) {
                 auto block = std::dynamic_pointer_cast<MemBlock>(blocks[std::to_string(index++)]);
-                auto mem   = ggml_pad(ctx->ggml_ctx, h, 0, 0, 0, 1);
+                auto mem   = ggml_pad_ext(ctx->ggml_ctx, h, 0, 0, 0, 0, 0, 0, 1, 0);
                 mem        = ggml_view_4d(ctx->ggml_ctx, mem, h->ne[0], h->ne[1], h->ne[2], h->ne[3], h->nb[1], h->nb[2], h->nb[3], 0);
                 h          = block->forward(ctx, h, mem);
             }
@@ -341,7 +341,7 @@ public:
         h              = last_conv->forward(ctx, h);
 
         // shape(W, H, 3, T+3) => shape(W, H, 3, T)
-        h = ggml_view_4d(ctx->ggml_ctx, h, h->ne[0], h->ne[1], h->ne[2], h->ne[3] - 3, h->nb[1], h->nb[2], h->nb[3], 0);
+        h = ggml_view_4d(ctx->ggml_ctx, h, h->ne[0], h->ne[1], h->ne[2], h->ne[3] - 3, h->nb[1], h->nb[2], h->nb[3], 3*h->nb[3]);
         return h;

tae.hpp

stduhpf · 2025-12-11T19:13:25Z

Video is still completely broken, but image decoding works very well now.

stable-diffusion.cpp

stduhpf · 2025-12-13T21:32:21Z

Results for taew2.2 are quite interesting for now.

output.mp4

madebyollin · 2025-12-13T21:44:16Z

Wan 2.2 and Hunyuan 1.5 have 2x2 pixelshuffle on the input/output

stduhpf · 2025-12-13T21:50:07Z

@madebyollin Yes I saw that when looking at the VAE code in wan.hpp, I'm on it

stduhpf · 2025-12-13T22:11:42Z

output.mp4

stduhpf · 2025-12-13T22:32:32Z

output.mp4

madebyollin · 2025-12-14T20:44:12Z

The Wan 2.2 TI2V results still look broken. There's a scaling issue on ~L3600 where sd_ctx->sd->process_latent_out(init_latent); and sd_ctx->sd->process_latent_in(init_latent); are incorrectly called even when using TAEW2.2. After fixing that, initial frame results look correctly-scaled but the video deteriorates into gray mush:

output_with_disabled_process_latent.mp4

This gray-mush issue happens with the default VAE on 8f05f5bc6ee9d6aba9d1ff2be7739a5a3cf1586d (before this PR) so fixing it is likely out of scope for this PR.

output_with_official_vae_on_8f05f5bc6ee9d6aba9d1ff2be7739a5a3cf1586d.mp4

Co-authored-by: Ollin Boer Bohan <madebyollin@gmail.com>

stduhpf · 2025-12-15T00:57:41Z

@madebyollin yes I figured it was probably something that after noticing how much worse the img2vid results were compared to txt2vid. I get no "gray-mush" on my end with this fix though.

leejet · 2025-12-15T16:38:01Z

@stduhpf I used taehv and get results very close to the results of wan vae. Maybe this PR can be merged now?

stduhpf · 2025-12-15T16:42:33Z

I think so too. I haven't tested every possible use case though (for example VACE).

spiderolsen · 2025-12-16T00:13:55Z

I tested I2V with this image and successfully enabled the TAE feature on AMD RX 7800 XT using my ROCm 6.4.2 build. Thanks for implementing it! ❤️

832px x 480px 5s Clip with Wan2.2-TI2V-5B-Q4_0.gguf

20251216_005604.mp4

leejet · 2025-12-16T14:55:52Z

Thank you for your contributions.

CarlGao4 · 2025-12-16T15:02:47Z

How is the VRAM requirement reduction?

stduhpf · 2025-12-16T15:51:32Z

How is the VRAM requirement reduction?

Reesults on ROCm 6.2, Windows, with AMD Radeon RX 6800 (gfx1030) 16GB

Without `--vae-conv-direct`

1024x1024x1

wan2.1_vae vs taew2_1:

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - wan_vae compute buffer size: 7493.50 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 4.37s

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - taehv compute buffer size: 6913.00 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 1.90s

wan2.2_vae vs taew2_2:

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - wan_vae compute buffer size: 9996.77 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 4.71s

 [DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - taehv compute buffer size: 1728.75 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 0.31s

512x512x33

wan2.1_vae vs taew2_1:

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - wan_vae compute buffer size: 12900.63 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 28.24s

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - taehv compute buffer size: 15554.25 MB(VRAM)
H:/stable-diffusion.cpp/ggml/src/ggml-cuda/cpy.cu:359: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed

wan2.2_vae vs taew2_2:

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - wan_vae compute buffer size: 13718.14 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 31.91s

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - taehv compute buffer size: 3889.69 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 0.56s

With `--vae-conv-direct`

1024x1024x1

wan2.1_vae vs taew2_1:

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1705 - wan_vae compute buffer size: 7493.50 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2332 - computing vae decode graph completed, taking 4.56s

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1705 - taehv compute buffer size: 2817.00 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2332 - computing vae decode graph completed, taking 3.13s

wan2.2_vae vs taew2_2:

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - wan_vae compute buffer size: 9996.77 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 4.75s

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - taehv compute buffer size: 704.75 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 0.76s

512x512x33

wan2.1_vae vs taew2_1:

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - wan_vae compute buffer size: 12900.63 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 28.19s

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - taehv compute buffer size: 6338.25 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 6.67s

wan2.2_vae vs taew2_2:

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - wan_vae compute buffer size: 13718.14 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 32.01s

[DEBUG] stable-diffusion.cpp\ggml_extend.hpp:1699 - taehv compute buffer size: 1585.69 MB(VRAM)
[DEBUG] stable-diffusion.cpp:2333 - computing vae decode graph completed, taking 1.70s

CarlGao4 · 2025-12-16T16:04:08Z

Thanks! I'm using wan2.2 5b, and the default vae requires 30GiB VRAM, so now generating video with 16GB RAM becomes possible

stduhpf · 2025-12-16T17:27:00Z

@madebyollin is this expected for taew2_1 to be slower and save less VRAM than taew2_2?

CarlGao4 · 2025-12-17T02:25:35Z

It seems that this implementation has a maximum VRAM. When I used longer video (960x720x81), an error occured:

[INFO ] ggml_extend.hpp:1799 - taehv offload params ( 21.79 MB, 128 tensors) to runtime backend (CUDA0), taking 0.01s
[DEBUG] ggml_extend.hpp:1699 - taehv compute buffer size: 15143.45 MB(VRAM)
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-cuda\cpy.cu:359: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed

stduhpf · 2025-12-17T02:31:27Z

@CarlGao4 Was this with --vae-conv-direct? If no, try with it. If you're already using it and hitting this assert, I coincidentally just found a potential solution, though I don't really know how safe it is yet: just remove the assert and try again

madebyollin · 2025-12-17T02:40:27Z

@stduhpf For a given image size, TAEW2.2 should be ~1/4 the cost of TAEW2.1, because TAEW2.2 uses 2x2 patchification which halves the height and width of all intermediate tensors. The cost of 2x2 patchification is a reduced ability to represent fine details. I pushed a new checkpoint today that should improve TAEW2.2 quality slightly, but it's still less sharp/stable than TAEW2.1 in my testing.

CarlGao4 · 2025-12-17T02:41:46Z

@stduhpf With --vae-conv-direct enabled, the assertation does not fail. Resolution 1024x768x81

Green-Sky · 2025-12-17T11:46:52Z

I pushed a new checkpoint today that should improve TAEW2.2 quality slightly, but it's still less sharp/stable than TAEW2.1 in my testing.

Hard (moving) edges are now noticeably less pixelated/squiggley, especially every non-first frame.

Second frame:

before	after

CarlGao4 · 2025-12-17T14:51:48Z

@madebyollin Can you update the safetensors as well?
Besides, maybe it's time to update the support list https://github.com/madebyollin/taehv#where-can-i-get-taehv

madebyollin · 2025-12-17T15:10:28Z

@CarlGao4 Good point, updated the safetensors and added stable-diffusion.cpp to the support list with credit to this PR

stduhpf mentioned this pull request Nov 6, 2025

[Bug] TAESD with WAN-2.1 and 2.2 dump core #946

Closed

CarlGao4 mentioned this pull request Dec 9, 2025

[Feature] TAEHV Support with WAN weights [TAEW2_2] #1069

Closed

madebyollin suggested changes Dec 11, 2025

View reviewed changes

tae.hpp Outdated Show resolved Hide resolved

tae.hpp Outdated Show resolved Hide resolved

tae.hpp Outdated Show resolved Hide resolved

stduhpf force-pushed the taehv branch from d04fd90 to fde734b Compare December 11, 2025 18:30

madebyollin reviewed Dec 13, 2025

View reviewed changes

stable-diffusion.cpp Outdated Show resolved Hide resolved

stduhpf force-pushed the taehv branch from 16178ca to 1cbcca2 Compare December 13, 2025 21:09

stduhpf marked this pull request as ready for review December 13, 2025 21:10

stduhpf and others added 12 commits December 15, 2025 01:36

Add support for Wan2.1 TAEHV decoding

9dc54ee

--tae instead of --taesd

85607ea

progress towards video support

d7fc012

Wan2.1 decode not crashing anymore (still broken)

d6920cc

Less broken video decode + remove log spam

11470b7

Taehv fixes

9842e34

Co-authored-by: Ollin Boer Bohan <madebyollin@gmail.com>

Adapt to lastest changes

2fc458b

taew2.1 encode support

2162441

fix permute ctx for videeo decoding

9abc513

Co-authored-by: Ollin Boer Bohan <madebyollin@gmail.com>

taehv: support patchified latents

068d928

fix patched pixels order

e4cbcdc

taehv: patchify encode

a7a791d

stduhpf force-pushed the taehv branch from 0b988dd to a7a791d Compare December 15, 2025 02:11

fix img2vid

6a653e6

stduhpf force-pushed the taehv branch from 23fb870 to 6a653e6 Compare December 15, 2025 02:14

leejet added 5 commits December 16, 2025 22:47

Merge branch 'master' into taehv

c7b207d

format code

8a15925

remove some debug log

a7dd964

set option correctly

c6e6144

update docs

1517985

leejet merged commit 9fa7f41 into leejet:master Dec 16, 2025
9 checks passed

Feat; Add support for Wan/Qwen TAEHV decoding #937

Feat; Add support for Wan/Qwen TAEHV decoding #937

Conversation

stduhpf commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stduhpf commented Nov 3, 2025

Uh oh!

stduhpf commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stduhpf commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madebyollin commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stduhpf commented Dec 11, 2025

Uh oh!

Uh oh!

stduhpf commented Dec 13, 2025

Uh oh!

madebyollin commented Dec 13, 2025

Uh oh!

stduhpf commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stduhpf commented Dec 13, 2025

Uh oh!

stduhpf commented Dec 13, 2025

Uh oh!

madebyollin commented Dec 14, 2025

Uh oh!

stduhpf commented Dec 15, 2025

Uh oh!

leejet commented Dec 15, 2025

Uh oh!

stduhpf commented Dec 15, 2025

Uh oh!

spiderolsen commented Dec 16, 2025

Uh oh!

leejet commented Dec 16, 2025

Uh oh!

Uh oh!

CarlGao4 commented Dec 16, 2025

Uh oh!

stduhpf commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Without --vae-conv-direct

1024x1024x1

512x512x33

With --vae-conv-direct

1024x1024x1

512x512x33

Uh oh!

CarlGao4 commented Dec 16, 2025

Uh oh!

stduhpf commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CarlGao4 commented Dec 17, 2025

Uh oh!

stduhpf commented Dec 17, 2025

Uh oh!

madebyollin commented Dec 17, 2025

Uh oh!

CarlGao4 commented Dec 17, 2025

Uh oh!

Green-Sky commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Second frame:

Uh oh!

CarlGao4 commented Dec 17, 2025

Uh oh!

madebyollin commented Dec 17, 2025

Uh oh!

Reviewers

stduhpf commented Nov 3, 2025 •

edited

Loading

stduhpf commented Nov 3, 2025 •

edited

Loading

stduhpf commented Nov 3, 2025 •

edited

Loading

madebyollin commented Dec 11, 2025 •

edited

Loading

stduhpf commented Dec 13, 2025 •

edited

Loading

stduhpf commented Dec 16, 2025 •

edited

Loading

Without `--vae-conv-direct`

With `--vae-conv-direct`

stduhpf commented Dec 16, 2025 •

edited

Loading

Green-Sky commented Dec 17, 2025 •

edited

Loading