Update JIT stage timing heuristic#3143
Conversation
adamsitnik
left a comment
There was a problem hiding this comment.
Hello @timcassell !
Thank you for addressing my feedback. I haven’t reviewed the original changes in #2806, so I’d like to ask a few questions first to better understand the current design and behavior.
- Is the new tiering heuristic based on deterministic signals such as JIT events emitted by the runtime or disassembly data, or is it intended to approximate the CLR’s current heuristic?
- How is this handled across different runtimes? For example, is the new heuristic enabled for AOT and .NET Framework, and how does it behave on older versus newer Mono? Also, do we respect all environment variables that allow users to disable tiered JIT, such as
DOTNET_TieredCompilationandCOMPlus_TieredCompilation? - How does it affect the total time it takes to run a typical micro benchmark?
- How does this interact with other stages, especially Warmup? For example, if the JITting phase runs long enough to satisfy the warmup heuristic, do we then skip the warmup phase?
- How can users configure the JITting phase? Do we expose new Job APIs, attributes, and command-line arguments?
Thanks,
Adam
It does not listen to JIT events, it simply queries the values used by the JIT for number of invokes for each stage, with an approximate wait time between JIT stages. You can see how those values are queried in JitInfo.cs.
It only handles JIT stages in CoreCLR, and when tiering is enabled. In all other runtimes
It depends on how fast the benchmark is. By default, it invokes it 30x per jit stage promotion, with 250ms wait time between stages. On the latest runtime there are 3 calculated stage promotions (it should be 2, but there is a runtime bug that we account for).
Other stages were left unchanged. Warmup stage has existed since even before tiered JIT was a thing and I didn't want to mess with it. We could definitely improve things there, though it's tracked by #1993.
Via environment variables or MSBuild properties that the runtime exposes. No new APIs were added to BDN for it. |
The CoreCLR detection was broken, fixed in #3150. |
Great, thanks for fixing it!
So for benchmarks that are neither very short or long running, like: [Benchmark]
public void Sleep() => Thread.Sleep(100);We are going to spend a lot of time trying to promote them. In this case it was 9+ seconds while the whole benchmarking took 25s. 16s -> 25s is quite a lot. Do you see any way to reduce the time it takes to run it?
I think it would be nice to offer an ability to disable (or hardcode) it. So for example if I am having plenty of benchmarks similar to the config.WithJittingCount(1);
[JittingCount(1)] |
|
The only way to reduce the time is to reduce the CallCountThreshold. Originally I had set it to 1 via TC_AggressiveTiering, but Egor explained in that PR why that was a bad idea, so I went with a timeout instead. We already have config options to skip the JIT stage ( Generally users are most interested in measuring tier1 performance, which is the default
What is that supposed to do? Just run a single JIT iteration? I guess it would kinda make sense to have it get to tier0 before the pilot stage runs. But probably it should just be a bool value for whether to go through all the tiers or not, because I don't know what |
It seems that with those settings you were actually measuring tier0 performance in the first iterations (unless the pilot stage invoked it enough times to bump it up), and possibly higher tiers in later iterations depending how fast the benchmark is. So each iteration was not measuring the same thing. But the perf repo is on a version of BDN that has the new JIT stage, so it should always be measuring tier1 now. I don't have access to the data, so I don't know what effect that had on the total runtime and measurement stability. |
Also if the runtime would fix the duplicated tier0 with OSR enabled we could reduce the time by 1/3, but that's out of our control. |
They are, but Tiered JIT was introduced 7(?) years ago and they have figured various ways of dealing with this problem. Most of them were lucky enough to just run micro-benchmarks that were getting promoted to Tier 1 with the default Pilot/Warmup settings. Others increased warmup count or just disabled Tiered JIT.
In case of dotnet/performance we have went back and forth about Tiered JIT when the feature got introduced (dotnet/performance#247, dotnet/performance#320). What we actually ended up doing (dotnet/performance#1536, dotnet/performance#666) was setting explicit min/max warmup count for just a couple of benchmarks that were not getting promoted to Tier 1 with our custom settings. And what is very important in case of dotnet/performance is the time it takes to run all the benchmarks. Assuming that we run only nano-benchmarks (which is not true) and the new logic takes 1s per benchmark, our 5k benchmarks will need additional 83 minutes to run. And we are aiming at running the benchmarks multiple time per day, for multiple architectures. So in our (.NET Team) case, it really matters to not prolong the time it takes to run all the benchmarks.
Bool is enough for our needs.
@EgorBo @AndyAyersMS Is this something that can be easily fixed? Do we have any plans to do it?
Thank you for providing the link. Ideally all of this should be combined and the total time it takes to get the perfect invocation count, code promoted to Tier 1 and warmed up should be minimal. Also another thing to consider are the "worst case" benchmarks here. Benchmarks that take less than one iteration time to execute (so we are not going to skip the extra warmup), but running them 3x30 times takes a lot of time. Most of them just delegate the work to smaller methods, and these small, hot path methods are usually executed multiple times even for a single invocation. And they usually get promoted to Tier 1 with the default settings. If my memory serves me well, this was the case for all the C# Computer Game benchmarks. What I am trying to say is that based on the evidence we had in the past, very few benchmarks required the extra warmup. |
Right, but that involved extra knowledge of the JIT tiering strategy and how the benchmarks were being invoked. New users run into the same problem. Now that's no longer a concern.
I think now the recommended config can be updated to set warmup count to 0, and remove all the specific benchmark warmup settings since tiering up is automatically handled now.
Dotnet/performance updated to the new JIT stage in dotnet/performance#5073, do you have the numbers from before/after?
I think that option will probably not be needed. With the JIT stage already promoting to tier1, the pilot stage should be able to reach stable more quickly, and with setting warmup count to 0 that's less time spent "guessing" at tiering up. So total time I would think would be not much more than before (except for the duplicated tier0 of course).
I agree. @AndreyAkinshin mentioned combining the pilot and warmup stages in #2787 (also #1210), which I think would also help here, but I'm not sure about concrete plans for how to go about it.
I mean that's what this PR is for, to skip the tiering-up for long-running benchmarks. We can adjust the heuristic for what we consider to be the optimal cutoff value. I don't know what that cutoff value should be, though, so I just started with copying the pilot stage's heuristic. |
If I understand it correctly, it's dotnet/runtime#76402 we probably should indeed address it for .NET 11.0 (am not promising, but will try) |
I don't think it's the same issue. It was described in dotnet/runtime#117787 (comment), I'm not sure if there's a dedicated issue for it. [Edit] Also later comment in that issue dotnet/runtime#76402 (comment). |
|
Also working with Claude it found that we can listen to JIT events in-process, so I'm going to try building a prototype to do that for deterministic tiering-up instead of waiting 250ms. |
39adb6f to
bb136af
Compare
Bails out after the second invocation if it detects a long-running benchmark using the same calculation as the pilot stage, instead of continuing invokes for 10 seconds.
Fixes #3114