[Wan] Optimize time & memory#12780
[Wan] Optimize time & memory#12780Fabrice-TIERCELIN wants to merge 8 commits intohuggingface:mainfrom
Conversation
|
Hey, be interesting to know the rough before and after metrics? Be great if this does reduce memory as wan really shoots up in memory with resolution and time increases. |
|
I have implemented this code to benchmark: import time
...
start = time.time()
x1 = hidden_states[..., 0::2]
x2 = hidden_states[..., 1::2]
end = time.time()
print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! BENCHMARK !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
print(end - start)There are 160 executions on my startup. Before, after the change and the diff (in seconds):
So the time is reduced by |
c484e2d to
1724a40
Compare
|
Are you waiting for me to do something? |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
Please approve my MR 🥺 A message threatens to archive it. |
|
@sayakpaul or @delmalih, please approve 🥺 |
sayakpaul
left a comment
There was a problem hiding this comment.
Thanks for this!
Can you provide a script that measures the latency and memory consumption with and without this PR?
|
Here are the two modifications to compare: Without improvement
x1, x2 = hidden_states.unflatten(-1, (-1, 2)).unbind(-1)... by that: start = time.time()
x1, x2 = hidden_states.unflatten(-1, (-1, 2)).unbind(-1)
end = time.time()
print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! BENCHMARK !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
print(end - start)With improvement
x1, x2 = hidden_states.unflatten(-1, (-1, 2)).unbind(-1)... by that: start = time.time()
x1 = hidden_states[..., 0::2]
x2 = hidden_states[..., 1::2]
end = time.time()
print("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! BENCHMARK !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
print(end - start)The duration will be logged in both case. You have already the results above. |
|
Those results are not conclusive enough, IMO. We should benchmark the end-to-end latency involved in generating a reasonable video clip. |
|
All this code is run at launch time, not generation time. It optimizes the startup. So the video length is not important. |
|
@sayakpaul, here are a video without: without.mp4and a video with: with.mp4 |
What does this PR do?
This PR reduces the time and space used when running Wan. I have successfully tested the performance improvement and I have done a crash test (put an error in place of my code and see the error). The output result is remains the same.
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.