-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Thanks for releasing this awesome project and the codebase!
I would like to ask about some details of the model.
-
Did you drop or random select conditions during training ? For example, when training with cond 1, cond 2, and cond 3, cond 1 is dropped at a certain ratio.
-
During training, are all conditions compressed by the same VAE as the video? And when performing self-attention, they are concatenated along the sequence dimension? If so, the computational complexity approaches the square of the number of conditions. Have any corresponding GPU memory optimizations been implemented ?
-
In each DIT block, are cross-attention and the FFN performed separately? For example, with distinct modules like cross_attention_video, cross_attention_cond1, and so on.