Skip to content

Fix SampleType field in ThreadSample event from SampleProfiler#124019

Merged
steveisok merged 6 commits intodotnet:mainfrom
steveisok:fix-sample-profiler-sampletype-123996
Feb 6, 2026
Merged

Fix SampleType field in ThreadSample event from SampleProfiler#124019
steveisok merged 6 commits intodotnet:mainfrom
steveisok:fix-sample-profiler-sampletype-123996

Conversation

@steveisok
Copy link
Member

@steveisok steveisok commented Feb 4, 2026

Fix SampleType field in ThreadSample event from SampleProfiler

The SampleType field was always reporting EP_SAMPLE_PROFILER_SAMPLE_TYPE_EXTERNAL
even when threads were executing managed code. This regressed in .NET 9 when
SuspendAllThreads was ported from NativeAOT to CoreCLR (commit 00a8973).

Instead of iterating all threads during SuspendAllThreads (which scales poorly
on many-core machines), each thread now saves its own GC mode when it
voluntarily suspends:

  • In RareDisablePreemptiveGC when hitting the IsTrappingThreadsForSuspension check

This is pay-for-play: only threads that were actually executing managed code
participate, and only during suspension.

Fixes #123996

@steveisok steveisok requested review from VSadov and Copilot February 4, 2026 23:19
@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Feb 4, 2026
@steveisok steveisok requested a review from a team February 4, 2026 23:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a regression introduced in .NET 9 where the SampleType field in ThreadSample events from the Microsoft-DotNETCore-SampleProfiler was always reporting EP_SAMPLE_PROFILER_SAMPLE_TYPE_EXTERNAL, even when threads were executing managed code. The regression occurred when SuspendAllThreads was ported from NativeAOT to CoreCLR.

Changes:

  • Restores the call to SaveGCModeOnSuspension() for all threads after EE suspension completes in SuspendAllThreads()
  • This saves each thread's GC mode (cooperative vs preemptive) so the sample profiler can correctly determine if threads were executing managed code

@steveisok
Copy link
Member Author

Looping over all the threads may not be the 'right' fix here. Is there a better place to mark them?

@noahfalk
Copy link
Member

noahfalk commented Feb 5, 2026

You need to save the state before suspension begins (or at least before the m_fPreemptiveGCDisabled field is changed during the suspension process).

@jkotas jkotas added area-Diagnostics-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Feb 5, 2026
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag
See info in area-owners.md if you want to be subscribed.

@steveisok steveisok force-pushed the fix-sample-profiler-sampletype-123996 branch from 7b71b30 to 49df4a5 Compare February 5, 2026 01:51
@steveisok
Copy link
Member Author

You need to save the state before suspension begins (or at least before the m_fPreemptiveGCDisabled field is changed during the suspension process).

I moved it before where I think that happens. Let me know if that's what you had in mind.

The SampleType field was always reporting EP_SAMPLE_PROFILER_SAMPLE_TYPE_EXTERNAL
even when threads were executing managed code. This regressed in .NET 9 when
SuspendAllThreads was ported from NativeAOT to CoreCLR (commit 00a8973).

Instead of iterating all threads during SuspendAllThreads (which scales poorly
on many-core machines), each thread now saves its own GC mode when it
voluntarily suspends:
- In RareDisablePreemptiveGC when hitting the IsTrappingThreadsForSuspension check
- In RedirectedHandledJITCase when a hijacked thread self-suspends

This is pay-for-play: only threads that were actually executing managed code
participate, and only during suspension.

Fixes dotnet#123996
@steveisok steveisok force-pushed the fix-sample-profiler-sampletype-123996 branch from 1422735 to 77f6bbf Compare February 5, 2026 04:23
Copilot AI review requested due to automatic review settings February 5, 2026 04:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

GCX_PREEMP_NO_DTOR_END() re-enters cooperative mode via DisablePreemptiveGC,
which hits RareDisablePreemptiveGC where SaveGCModeOnSuspension is already called.
…sion

Per feedback, replace the m_gcModeOnSuspension field with a ThreadState bit:
- Add TS_SuspensionTrapped (0x00000002) to indicate thread is trapped for suspension
- Set the bit when entering the IsTrappingThreadsForSuspension scope
- Reset the bit when exiting that scope
- Sample profiler checks this bit to determine if thread was in managed code
- Delete m_gcModeOnSuspension field and its accessor methods

This is cleaner because it directly indicates the thread state rather than
redundantly storing the GC mode.
Copilot AI review requested due to automatic review settings February 5, 2026 04:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@lateralusX
Copy link
Member

lateralusX commented Feb 5, 2026

Just curious, so if we set the state when moving back from preemptive to cooperate suspend mode to indicate that a thread is running managed code, would that mean we would potentially report more code as being in managed depending on the time it takes to iterate over all "suspended" threads and capture their managed callstacks, checking this new flag and external code get enough time to complete and transition back into rutnime and getting trapped by RareDisablePreemptiveGC?

Looking through the CoreCLR suspension code this is my understanding:

For threads already in preemptive mode, leave them running, they will eventually suspend if they are trying to re-enter runtime as part of RareDisablePreemptiveGC. For threads in cooperate mode, use different techniques to let them toggle GC mode, cooperate->preemptive->cooperate trapped by RareDisablePreemptiveGC since we are trapping threads returning to cooperate mode.

When we do CPU samples in EventPipe we first suspend all threads, then iterate over the thread list and capture managed stack traces for each thread, part of that process is also classifying if a thread was running managed or external code when suspended. But since threads already being in preemptive code will continue running, that means they might end up trying to re-enter cooperate suspend mode and getting blocked by RareDisablePreemptiveGC before sample profiler have got around to that thread in walk_managed_stack_for_threads, reporting the thread as being in managed when it actually was running external code at point of suspension.

I guess this is less of a concern since the sample profiler is just sampling and maybe previous solution didn't see difference between a thread trapped when running in cooperate mode vs trying to re-enter runtime from preemptive mode. I guess this is also a definition on when a thread sample actually occurs, in CoreCLR it means the thread sample is not the state when all threads suspended, but when the actual thread is processed by sample profilers walk_managed_stack_for_threads loop.

In Mono we handle this a little different, when runtime uses the full cooperate suspend model, it will do the suspension of threads moving from external code back to managed code as part of stub frame that stack walker identifies as external code, so even if the thread is trying to transition back into managed code, it will be trapped in the stub frame and stack-walker will classify that as still being in external code.

@jkotas
Copy link
Member

jkotas commented Feb 5, 2026

Yes, the change in the PR will tend to report more threads as running in managed code - if they reached the managed code boundary while the sample is processed.

The previous solution saved the current thread state immediately, before the thread got actually suspended. It was possible for it to report thread as running in managed code and the stacktrace pointing to unmanaged code (PInvoke transition).

If we think that the previous behavior is better, it is fine with me to keep it as long as it is pay-for-play. The loop that iterates over all threads to save their current thread state immediately should only run for event pipe samples. (We may still want to keep the current change that gets rid of the full long and just uses a bit in a ThreadState.)

@jkotas
Copy link
Member

jkotas commented Feb 5, 2026

The loop that iterates over all threads to save their current thread state immediately should only run for event pipe samples

We would add a new SUSPEND_REASON for event pipe sample and run this extra loop only for this suspend reason.

@steveisok
Copy link
Member Author

I do like the current change where each thread is noting its state. If reporting more is an edge case or in the end doesn't really matter, I'd say keep the current change as is.

@jkotas @lateralusX you guys are the experts, so whatever you prefer...

@jkotas
Copy link
Member

jkotas commented Feb 5, 2026

@valco1994 Do you care about when this state is captured during the sampling event exactly?

@valco1994
Copy link
Contributor

@valco1994 Do you care about when this state is captured during the sampling event exactly?

An exact moment of capturing the state is not critical for my scenarios. So, both approaches would be fine for me.

@jkotas
Copy link
Member

jkotas commented Feb 5, 2026

you guys are the experts, so whatever you prefer...

I am fine with the fix in this PR. The event pipe sampler is not fair. It has a by-design bias for sampling at GC safe points. I do not see problem with the bias that comes with this fix.

Copilot AI review requested due to automatic review settings February 6, 2026 00:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

@steveisok steveisok enabled auto-merge (squash) February 6, 2026 04:48
@steveisok
Copy link
Member Author

/ba-g Unclear error in tvOS leg. Unrelated to PR.

@steveisok steveisok merged commit b226ba1 into dotnet:main Feb 6, 2026
104 of 106 checks passed
@steveisok
Copy link
Member Author

/backport to release/10.0

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

Started backporting to release/10.0 (link to workflow run)

@steveisok steveisok deleted the fix-sample-profiler-sampletype-123996 branch February 6, 2026 12:07
@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

@steveisok backporting to release/10.0 failed, the patch most likely resulted in conflicts. Please backport manually!

git am output
$ git am --3way --empty=keep --ignore-whitespace --keep-non-patch changes.patch

Applying: Fix SampleType field in ThreadSample event from SampleProfiler
Applying: Remove redundant SaveGCModeOnSuspension call in RedirectedHandledJITCase
Applying: Use TS_SuspensionTrapped thread state bit instead of m_gcModeOnSuspension
Applying: Fix OFFSETOF__Thread__m_pInterpThreadContext after m_gcModeOnSuspension removal
Using index info to reconstruct a base tree...
M	src/coreclr/vm/amd64/asmconstants.h
M	src/coreclr/vm/arm/asmconstants.h
M	src/coreclr/vm/arm64/asmconstants.h
M	src/coreclr/vm/riscv64/asmconstants.h
Falling back to patching base and 3-way merge...
Auto-merging src/coreclr/vm/amd64/asmconstants.h
CONFLICT (content): Merge conflict in src/coreclr/vm/amd64/asmconstants.h
Auto-merging src/coreclr/vm/arm/asmconstants.h
CONFLICT (content): Merge conflict in src/coreclr/vm/arm/asmconstants.h
Auto-merging src/coreclr/vm/arm64/asmconstants.h
CONFLICT (content): Merge conflict in src/coreclr/vm/arm64/asmconstants.h
Auto-merging src/coreclr/vm/riscv64/asmconstants.h
CONFLICT (content): Merge conflict in src/coreclr/vm/riscv64/asmconstants.h
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0004 Fix OFFSETOF__Thread__m_pInterpThreadContext after m_gcModeOnSuspension removal
Error: The process '/usr/bin/git' failed with exit code 128

Link to workflow output

steveisok added a commit to steveisok/runtime that referenced this pull request Feb 6, 2026
…Profiler

Backport of dotnet#124019 to release/10.0

## Customer Impact

- [X] Customer reported
- [ ] Found internally

The SampleType field in ThreadSample events from SampleProfiler was always
reporting EP_SAMPLE_PROFILER_SAMPLE_TYPE_EXTERNAL even when threads were
executing managed code. This causes incorrect profiling data for customers
using diagnostics tools that rely on this event.

Fixes dotnet#123996

## Regression

- [X] Yes
- [ ] No

This regressed in .NET 9 when SuspendAllThreads was ported from NativeAOT
to CoreCLR (commit 00a8973).

## Testing

The fix was verified by building CoreCLR with Debug/Checked configuration
which validates the asmconstants offsets via compile-time assertions.

## Risk

Low

The change is minimal and self-contained:
- Adds a new thread state flag (TS_SuspensionTrapped) using a previously unused bit
- Sets/clears the flag in RareDisablePreemptiveGC when threads voluntarily suspend
- Uses the flag instead of the removed m_gcModeOnSuspension field for sample type detection
- The change is pay-for-play: only threads that were actually executing managed code participate
steveisok added a commit that referenced this pull request Feb 6, 2026
… Profiler (#124084)

Backport of #124019 to release/10.0

## Customer Impact

- [X] Customer reported
- [ ] Found internally

The SampleType field in ThreadSample events from SampleProfiler was
always reporting EP_SAMPLE_PROFILER_SAMPLE_TYPE_EXTERNAL even when
threads were executing managed code. This causes incorrect profiling
data for customers using diagnostics tools that rely on this event.

Fixes #123996

## Regression

- [X] Yes
- [ ] No

This regressed in .NET 9 when SuspendAllThreads was ported from
NativeAOT to CoreCLR (commit 00a8973).

## Testing
Manual

## Risk
Low

The change is minimal and self-contained:
- Adds a new thread state flag (TS_SuspensionTrapped) using a previously
unused bit
- Sets/clears the flag in RareDisablePreemptiveGC when threads
voluntarily suspend
- Uses the flag instead of the removed m_gcModeOnSuspension field for
sample type detection
- The change is pay-for-play: only threads that were actually executing
managed code participate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SampleType field in ThreadSample event from Microsoft-DotNETCore-SampleProfiler is reported incorrectly since .NET 9

5 participants