Hi, thanks for the great work!
I have a question regarding the logic in calculate_dimensions.
Currently, the image height and width are constrained to be multiples of 32.
From my understanding:
The VAE has a downsampling factor of 8, so the latent spatial size should require the input dimensions to be multiples of 8.
Before entering the DiT, the latent is passed through a Patch Embedding layer with patch_size = 2.
That would further imply a total factor of 8 × 2 = 16.
Based on this, it seems that constraining the image dimensions to be multiples of 16 should already be sufficient.
Could you clarify why a multiple of 32 is required here?
Is there an additional downsampling stage, architectural constraint, or implementation detail that I might be missing?
Thanks in advance for the clarification!