Fix SampleType field in ThreadSample event from SampleProfiler#124019
Fix SampleType field in ThreadSample event from SampleProfiler#124019steveisok merged 6 commits intodotnet:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a regression introduced in .NET 9 where the SampleType field in ThreadSample events from the Microsoft-DotNETCore-SampleProfiler was always reporting EP_SAMPLE_PROFILER_SAMPLE_TYPE_EXTERNAL, even when threads were executing managed code. The regression occurred when SuspendAllThreads was ported from NativeAOT to CoreCLR.
Changes:
- Restores the call to
SaveGCModeOnSuspension()for all threads after EE suspension completes inSuspendAllThreads() - This saves each thread's GC mode (cooperative vs preemptive) so the sample profiler can correctly determine if threads were executing managed code
|
Looping over all the threads may not be the 'right' fix here. Is there a better place to mark them? |
|
You need to save the state before suspension begins (or at least before the m_fPreemptiveGCDisabled field is changed during the suspension process). |
|
Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag |
7b71b30 to
49df4a5
Compare
I moved it before where I think that happens. Let me know if that's what you had in mind. |
The SampleType field was always reporting EP_SAMPLE_PROFILER_SAMPLE_TYPE_EXTERNAL even when threads were executing managed code. This regressed in .NET 9 when SuspendAllThreads was ported from NativeAOT to CoreCLR (commit 00a8973). Instead of iterating all threads during SuspendAllThreads (which scales poorly on many-core machines), each thread now saves its own GC mode when it voluntarily suspends: - In RareDisablePreemptiveGC when hitting the IsTrappingThreadsForSuspension check - In RedirectedHandledJITCase when a hijacked thread self-suspends This is pay-for-play: only threads that were actually executing managed code participate, and only during suspension. Fixes dotnet#123996
1422735 to
77f6bbf
Compare
GCX_PREEMP_NO_DTOR_END() re-enters cooperative mode via DisablePreemptiveGC, which hits RareDisablePreemptiveGC where SaveGCModeOnSuspension is already called.
…sion Per feedback, replace the m_gcModeOnSuspension field with a ThreadState bit: - Add TS_SuspensionTrapped (0x00000002) to indicate thread is trapped for suspension - Set the bit when entering the IsTrappingThreadsForSuspension scope - Reset the bit when exiting that scope - Sample profiler checks this bit to determine if thread was in managed code - Delete m_gcModeOnSuspension field and its accessor methods This is cleaner because it directly indicates the thread state rather than redundantly storing the GC mode.
|
Just curious, so if we set the state when moving back from preemptive to cooperate suspend mode to indicate that a thread is running managed code, would that mean we would potentially report more code as being in managed depending on the time it takes to iterate over all "suspended" threads and capture their managed callstacks, checking this new flag and external code get enough time to complete and transition back into rutnime and getting trapped by Looking through the CoreCLR suspension code this is my understanding: For threads already in preemptive mode, leave them running, they will eventually suspend if they are trying to re-enter runtime as part of When we do CPU samples in EventPipe we first suspend all threads, then iterate over the thread list and capture managed stack traces for each thread, part of that process is also classifying if a thread was running managed or external code when suspended. But since threads already being in preemptive code will continue running, that means they might end up trying to re-enter cooperate suspend mode and getting blocked by I guess this is less of a concern since the sample profiler is just sampling and maybe previous solution didn't see difference between a thread trapped when running in cooperate mode vs trying to re-enter runtime from preemptive mode. I guess this is also a definition on when a thread sample actually occurs, in CoreCLR it means the thread sample is not the state when all threads suspended, but when the actual thread is processed by sample profilers In Mono we handle this a little different, when runtime uses the full cooperate suspend model, it will do the suspension of threads moving from external code back to managed code as part of stub frame that stack walker identifies as external code, so even if the thread is trying to transition back into managed code, it will be trapped in the stub frame and stack-walker will classify that as still being in external code. |
|
Yes, the change in the PR will tend to report more threads as running in managed code - if they reached the managed code boundary while the sample is processed. The previous solution saved the current thread state immediately, before the thread got actually suspended. It was possible for it to report thread as running in managed code and the stacktrace pointing to unmanaged code (PInvoke transition). If we think that the previous behavior is better, it is fine with me to keep it as long as it is pay-for-play. The loop that iterates over all threads to save their current thread state immediately should only run for event pipe samples. (We may still want to keep the current change that gets rid of the full long and just uses a bit in a ThreadState.) |
We would add a new SUSPEND_REASON for event pipe sample and run this extra loop only for this suspend reason. |
|
I do like the current change where each thread is noting its state. If reporting more is an edge case or in the end doesn't really matter, I'd say keep the current change as is. @jkotas @lateralusX you guys are the experts, so whatever you prefer... |
|
@valco1994 Do you care about when this state is captured during the sampling event exactly? |
An exact moment of capturing the state is not critical for my scenarios. So, both approaches would be fine for me. |
I am fine with the fix in this PR. The event pipe sampler is not fair. It has a by-design bias for sampling at GC safe points. I do not see problem with the bias that comes with this fix. |
…23996' into fix-sample-profiler-sampletype-123996
|
/ba-g Unclear error in tvOS leg. Unrelated to PR. |
|
/backport to release/10.0 |
|
Started backporting to |
|
@steveisok backporting to git am output$ git am --3way --empty=keep --ignore-whitespace --keep-non-patch changes.patch
Applying: Fix SampleType field in ThreadSample event from SampleProfiler
Applying: Remove redundant SaveGCModeOnSuspension call in RedirectedHandledJITCase
Applying: Use TS_SuspensionTrapped thread state bit instead of m_gcModeOnSuspension
Applying: Fix OFFSETOF__Thread__m_pInterpThreadContext after m_gcModeOnSuspension removal
Using index info to reconstruct a base tree...
M src/coreclr/vm/amd64/asmconstants.h
M src/coreclr/vm/arm/asmconstants.h
M src/coreclr/vm/arm64/asmconstants.h
M src/coreclr/vm/riscv64/asmconstants.h
Falling back to patching base and 3-way merge...
Auto-merging src/coreclr/vm/amd64/asmconstants.h
CONFLICT (content): Merge conflict in src/coreclr/vm/amd64/asmconstants.h
Auto-merging src/coreclr/vm/arm/asmconstants.h
CONFLICT (content): Merge conflict in src/coreclr/vm/arm/asmconstants.h
Auto-merging src/coreclr/vm/arm64/asmconstants.h
CONFLICT (content): Merge conflict in src/coreclr/vm/arm64/asmconstants.h
Auto-merging src/coreclr/vm/riscv64/asmconstants.h
CONFLICT (content): Merge conflict in src/coreclr/vm/riscv64/asmconstants.h
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0004 Fix OFFSETOF__Thread__m_pInterpThreadContext after m_gcModeOnSuspension removal
Error: The process '/usr/bin/git' failed with exit code 128 |
…Profiler Backport of dotnet#124019 to release/10.0 ## Customer Impact - [X] Customer reported - [ ] Found internally The SampleType field in ThreadSample events from SampleProfiler was always reporting EP_SAMPLE_PROFILER_SAMPLE_TYPE_EXTERNAL even when threads were executing managed code. This causes incorrect profiling data for customers using diagnostics tools that rely on this event. Fixes dotnet#123996 ## Regression - [X] Yes - [ ] No This regressed in .NET 9 when SuspendAllThreads was ported from NativeAOT to CoreCLR (commit 00a8973). ## Testing The fix was verified by building CoreCLR with Debug/Checked configuration which validates the asmconstants offsets via compile-time assertions. ## Risk Low The change is minimal and self-contained: - Adds a new thread state flag (TS_SuspensionTrapped) using a previously unused bit - Sets/clears the flag in RareDisablePreemptiveGC when threads voluntarily suspend - Uses the flag instead of the removed m_gcModeOnSuspension field for sample type detection - The change is pay-for-play: only threads that were actually executing managed code participate
… Profiler (#124084) Backport of #124019 to release/10.0 ## Customer Impact - [X] Customer reported - [ ] Found internally The SampleType field in ThreadSample events from SampleProfiler was always reporting EP_SAMPLE_PROFILER_SAMPLE_TYPE_EXTERNAL even when threads were executing managed code. This causes incorrect profiling data for customers using diagnostics tools that rely on this event. Fixes #123996 ## Regression - [X] Yes - [ ] No This regressed in .NET 9 when SuspendAllThreads was ported from NativeAOT to CoreCLR (commit 00a8973). ## Testing Manual ## Risk Low The change is minimal and self-contained: - Adds a new thread state flag (TS_SuspensionTrapped) using a previously unused bit - Sets/clears the flag in RareDisablePreemptiveGC when threads voluntarily suspend - Uses the flag instead of the removed m_gcModeOnSuspension field for sample type detection - The change is pay-for-play: only threads that were actually executing managed code participate
Fix SampleType field in ThreadSample event from SampleProfiler
The SampleType field was always reporting EP_SAMPLE_PROFILER_SAMPLE_TYPE_EXTERNAL
even when threads were executing managed code. This regressed in .NET 9 when
SuspendAllThreads was ported from NativeAOT to CoreCLR (commit 00a8973).
Instead of iterating all threads during SuspendAllThreads (which scales poorly
on many-core machines), each thread now saves its own GC mode when it
voluntarily suspends:
This is pay-for-play: only threads that were actually executing managed code
participate, and only during suspension.
Fixes #123996