Skip to content

Conversation

@TheRealGioviok
Copy link
Contributor

@TheRealGioviok TheRealGioviok commented Nov 21, 2025

This rewrite of the tuning system brings a huge speedup by:

  • Not using std::functions for backward
  • Not using smart pointers
  • Switching to an arena allocated tape model for the nodes

It's still probably very optimizable, and during the rewrite i removed two things that definitely will need to be reimplemented:

  • Removed microbatching (should be easy to implement)
  • Removed efficient ::sum (probably a problem, since it also implemented Kahan).

The rewrite also hopes to catch some stray bug somewhere.

The Node design probably needs to be redone for better cache performance and alignment.

Feedback welcome and needed.

=================================

🚀 Performance Tracking

Machine: Ryzen 7 5800X
Dataset: v2.1 + v2.2 + v3 + dfrcv0 + dfrcv1
Metric: Average epoch runtime over 8 epochs

Baseline

Base: 83.5055 s/epoch


📈 Speedup Progression

Step Change Introduced Runtime (s/epoch) Speedup vs Base Speedup vs Previous
1 Base 83.5055
2 Arena + tagged unions 9.2513 9.02×
3 SoA + raw pointer in hot backward loop 7.9926 10.45× 1.13×
4 Lazy-node addition, f64x2 storage, inline arena allocs 7.8862 10.58× 1.01×
5 Smaller Node + alignas(16) 7.2979 11.44× 1.08×
6 Bugfix + std::unreachable 6.8646 12.16× 1.06×

🏁 Current Best

6.8646 s/epoch (12.16× faster than baseline)

Bench: 12044152

@jw1912
Copy link

jw1912 commented Nov 21, 2025

Looking much better

Bench: 12044152
Bench: 12044152
Bench: 12044152
@Aethdv
Copy link
Contributor

Aethdv commented Nov 21, 2025

So, we are using Array of Structures for ValueData. why not split this into two separate arenas: Arena<f64> m_values and Arena<f64> m_gradients?
On forward pass you're never touching the gradient.
Loading ValueData pulls 16 bytes into the cache line, but you effectively waste 50% of that bandwidth? (the gradient double).

std::vector is probably the best choice here, provided we handle growth.

We might be missing reserve(), You clear() the tape and arenas in cleanup(), but clear() keeps capacity. However, the initial allocation in the constructor only allocates for parameters.
So. As the graph grows during the first few iterations, std::vector will reallocate (copy/move) multiple times.

About the Node Design; f64 requires 8-byte alignment, so it seems like the compiler will pad it to 32-bytes (or maybe 24 bytes if packed tightly? though 24 is probably awkward). Ops either Binary or Scalar - I will leave it to @Ravenslofty

We absolutely cannot change ValueHandle to hold pointers. we gotta keep Indices (which is what we want I believe).

Also since Graph is thread_local, I assume each thread runs its own isolated graph and we aggregate gradients globally later?

@87flowers
Copy link
Contributor

If it's a lot more efficient hopefully we won't need microbatching.

@TheRealGioviok
Copy link
Contributor Author

TheRealGioviok commented Nov 22, 2025

So, we are using Array of Structures for ValueData. why not split this into two separate arenas: Arena<f64> m_values and Arena<f64> m_gradients? On forward pass you're never touching the gradient. Loading ValueData pulls 16 bytes into the cache line, but you effectively waste 50% of that bandwidth? (the gradient double).

Good catch, implemented (though with a different scheme) with new commit, decent speedup recorded (see the pr comment).

We might be missing reserve(), You clear() the tape and arenas in cleanup(), but clear() keeps capacity. However, the initial allocation in the constructor only allocates for parameters. So. As the graph grows during the first few iterations, std::vector will reallocate (copy/move) multiple times.

It is optimizable sure, but thats a problem only for the first batch of the first epoch. Furthermore, we dont really know exactly how much space we need on the tape, so i would leave it as is. I added a bit of reserve just to help a bit anyways.

About the Node Design; f64 requires 8-byte alignment, so it seems like the compiler will pad it to 32-bytes (or maybe 24 bytes if packed tightly? though 24 is probably awkward). Ops either Binary or Scalar - I will leave it to @Ravenslofty

Absolutely. It needs also redesign to support generic sized input operations.

We absolutely cannot change ValueHandle to hold pointers. we gotta keep Indices (which is what we want I believe).

It doesnt?

Also since Graph is thread_local, I assume each thread runs its own isolated graph and we aggregate gradients globally later?

Yes.

@Aethdv
Copy link
Contributor

Aethdv commented Nov 23, 2025

So much better 😄

@TheRealGioviok TheRealGioviok marked this pull request as ready for review November 23, 2025 17:32
Copy link
Contributor

@JonathanHallstrom JonathanHallstrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few small things

@TheRealGioviok TheRealGioviok merged commit cde2ced into official-clockwork:main Dec 3, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants