-
Notifications
You must be signed in to change notification settings - Fork 31
Tuner rewrite #330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tuner rewrite #330
Conversation
Bench: 12044152
Bench: 12044152
Bench: 12044152
Bench: 12044152
|
Looking much better |
Bench: 12044152
Bench: 12044152
|
So, we are using Array of Structures for
We might be missing About the Node Design; We absolutely cannot change Also since |
|
If it's a lot more efficient hopefully we won't need microbatching. |
Good catch, implemented (though with a different scheme) with new commit, decent speedup recorded (see the pr comment).
It is optimizable sure, but thats a problem only for the first batch of the first epoch. Furthermore, we dont really know exactly how much space we need on the tape, so i would leave it as is. I added a bit of reserve just to help a bit anyways.
Absolutely. It needs also redesign to support generic sized input operations.
It doesnt?
Yes. |
|
So much better 😄 |
JonathanHallstrom
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few small things
Bench: 11856625
This rewrite of the tuning system brings a huge speedup by:
It's still probably very optimizable, and during the rewrite i removed two things that definitely will need to be reimplemented:
The rewrite also hopes to catch some stray bug somewhere.
The Node design probably needs to be redone for better cache performance and alignment.
Feedback welcome and needed.
=================================
🚀 Performance Tracking
Machine: Ryzen 7 5800X
Dataset: v2.1 + v2.2 + v3 + dfrcv0 + dfrcv1
Metric: Average epoch runtime over 8 epochs
Baseline
Base: 83.5055 s/epoch
📈 Speedup Progression
Node+alignas(16)std::unreachable🏁 Current Best
6.8646 s/epoch (12.16× faster than baseline)
Bench: 12044152