Tuner rewrite #330

TheRealGioviok · 2025-11-21T13:55:17Z

This rewrite of the tuning system brings a huge speedup by:

Not using std::functions for backward
Not using smart pointers
Switching to an arena allocated tape model for the nodes

It's still probably very optimizable, and during the rewrite i removed two things that definitely will need to be reimplemented:

Removed microbatching (should be easy to implement)
Removed efficient ::sum (probably a problem, since it also implemented Kahan).

The rewrite also hopes to catch some stray bug somewhere.

The Node design probably needs to be redone for better cache performance and alignment.

Feedback welcome and needed.

=================================

🚀 Performance Tracking

Machine: Ryzen 7 5800X
Dataset: v2.1 + v2.2 + v3 + dfrcv0 + dfrcv1
Metric: Average epoch runtime over 8 epochs

Baseline

Base: 83.5055 s/epoch

📈 Speedup Progression

Step	Change Introduced	Runtime (s/epoch)	Speedup vs Base	Speedup vs Previous
1	Base	83.5055	—	—
2	Arena + tagged unions	9.2513	9.02×	—
3	SoA + raw pointer in hot backward loop	7.9926	10.45×	1.13×
4	Lazy-node addition, f64x2 storage, inline arena allocs	7.8862	10.58×	1.01×
5	Smaller `Node` + `alignas(16)`	7.2979	11.44×	1.08×
6	Bugfix + `std::unreachable`	6.8646	12.16×	1.06×

🏁 Current Best

6.8646 s/epoch (12.16× faster than baseline)

Bench: 12044152

jw1912 · 2025-11-21T14:05:39Z

Looking much better

Bench: 12044152

Aethdv · 2025-11-21T14:43:28Z

So, we are using Array of Structures for ValueData. why not split this into two separate arenas: Arena<f64> m_values and Arena<f64> m_gradients?
On forward pass you're never touching the gradient.
Loading ValueData pulls 16 bytes into the cache line, but you effectively waste 50% of that bandwidth? (the gradient double).

std::vector is probably the best choice here, provided we handle growth.

We might be missing reserve(), You clear() the tape and arenas in cleanup(), but clear() keeps capacity. However, the initial allocation in the constructor only allocates for parameters.
So. As the graph grows during the first few iterations, std::vector will reallocate (copy/move) multiple times.

About the Node Design; f64 requires 8-byte alignment, so it seems like the compiler will pad it to 32-bytes (or maybe 24 bytes if packed tightly? though 24 is probably awkward). Ops either Binary or Scalar - I will leave it to @Ravenslofty

We absolutely cannot change ValueHandle to hold pointers. we gotta keep Indices (which is what we want I believe).

Also since Graph is thread_local, I assume each thread runs its own isolated graph and we aggregate gradients globally later?

87flowers · 2025-11-22T08:28:34Z

If it's a lot more efficient hopefully we won't need microbatching.

TheRealGioviok · 2025-11-22T12:11:30Z

So, we are using Array of Structures for ValueData. why not split this into two separate arenas: Arena<f64> m_values and Arena<f64> m_gradients? On forward pass you're never touching the gradient. Loading ValueData pulls 16 bytes into the cache line, but you effectively waste 50% of that bandwidth? (the gradient double).

Good catch, implemented (though with a different scheme) with new commit, decent speedup recorded (see the pr comment).

We might be missing reserve(), You clear() the tape and arenas in cleanup(), but clear() keeps capacity. However, the initial allocation in the constructor only allocates for parameters. So. As the graph grows during the first few iterations, std::vector will reallocate (copy/move) multiple times.

It is optimizable sure, but thats a problem only for the first batch of the first epoch. Furthermore, we dont really know exactly how much space we need on the tape, so i would leave it as is. I added a bit of reserve just to help a bit anyways.

About the Node Design; f64 requires 8-byte alignment, so it seems like the compiler will pad it to 32-bytes (or maybe 24 bytes if packed tightly? though 24 is probably awkward). Ops either Binary or Scalar - I will leave it to @Ravenslofty

Absolutely. It needs also redesign to support generic sized input operations.

We absolutely cannot change ValueHandle to hold pointers. we gotta keep Indices (which is what we want I believe).

It doesnt?

Also since Graph is thread_local, I assume each thread runs its own isolated graph and we aggregate gradients globally later?

Yes.

Aethdv · 2025-11-23T00:16:38Z

So much better 😄

Bench: 11233646

Bench: 11856625

JonathanHallstrom

Looks good, just a few small things

src/tuning/arena.hpp

src/util/vec/sse2.hpp

src/evaltune_main.cpp

Bench: 11856625

TheRealGioviok added 10 commits November 21, 2025 00:58

So it begins

a54bc37

Bench: 12044152

fea63ea

Cleanup

ded6eb1

Cleanup2

16a3283

Cleanup3

8ce66c3

Format

e6a25f8

Bench: 12044152

Cleanup + slight optim

0249718

Bench: 12044152

Useless comment begone

3b014eb

Bench: 12044152

[[nodiscards]] are back on the menu

b0b9621

Bench: 12044152

f128 is evil

f68e3e0

Bench: 12044152

TheRealGioviok added 3 commits November 21, 2025 15:06

Format

b06cd29

Bench: 12044152

run meaner.py

648d811

Bench: 12044152

Weird stuff

d409b96

Bench: 12044152

TheRealGioviok added 2 commits November 22, 2025 12:36

SoA + pointer for backward loop

144fe3e

8 epochs for testing

b21e922

TheRealGioviok added 9 commits November 22, 2025 14:26

inline Graph::get

20b4ffc

Bench: 12044152

71bd2ab

Lazy node addition

4c68180

f64x2 are back on the menu

69a78fd

inline allocs

df02d72

tentative make op node 16 bytes

6691b76

alignas(16)

6697062

16 bytes aligned operation node

7462470

Bugfix + unreachable

fe23d85

TheRealGioviok marked this pull request as ready for review November 23, 2025 17:32

TheRealGioviok and others added 6 commits November 23, 2025 18:40

Format and prep for pgo tests

e774189

Merge branch 'main' into tuner_rewrite

9817acf

Fix merge fuckery and format

32317c5

Bench: 11233646

Bench: 11856625

b6dc03b

Bench: 11856625

f06a4ff

ahahhhahahahahHAhAhahhahahAHHaHaHAHAHAHAH

52d8006

Bench: 11856625

JonathanHallstrom suggested changes Dec 3, 2025

View reviewed changes

src/tuning/arena.hpp Outdated Show resolved Hide resolved

src/tuning/arena.hpp Outdated Show resolved Hide resolved

src/util/vec/sse2.hpp Outdated Show resolved Hide resolved

src/evaltune_main.cpp Show resolved Hide resolved

TheRealGioviok added 2 commits December 3, 2025 03:04

Bench: 11856625

a36f77c

format yay

cd7b6b3

Bench: 11856625

TheRealGioviok merged commit cde2ced into official-clockwork:main Dec 3, 2025
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tuner rewrite #330

Tuner rewrite #330

Uh oh!

TheRealGioviok commented Nov 21, 2025 •

edited

Loading

Uh oh!

jw1912 commented Nov 21, 2025

Uh oh!

Aethdv commented Nov 21, 2025 •

edited

Loading

Uh oh!

87flowers commented Nov 22, 2025

Uh oh!

TheRealGioviok commented Nov 22, 2025 •

edited

Loading

Uh oh!

Aethdv commented Nov 23, 2025

Uh oh!

JonathanHallstrom left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Tuner rewrite #330

Tuner rewrite #330

Uh oh!

Conversation

TheRealGioviok commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Performance Tracking

Baseline

📈 Speedup Progression

🏁 Current Best

Uh oh!

jw1912 commented Nov 21, 2025

Uh oh!

Aethdv commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

87flowers commented Nov 22, 2025

Uh oh!

TheRealGioviok commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aethdv commented Nov 23, 2025

Uh oh!

JonathanHallstrom left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TheRealGioviok commented Nov 21, 2025 •

edited

Loading

Aethdv commented Nov 21, 2025 •

edited

Loading

TheRealGioviok commented Nov 22, 2025 •

edited

Loading