Skip to content

feat(subinterpreter): add opt-in TLS-cached thread state mode#6073

Open
ymwang78 wants to merge 4 commits into
pybind:masterfrom
ymwang78:claude/angry-euclid-4dc4a0
Open

feat(subinterpreter): add opt-in TLS-cached thread state mode#6073
ymwang78 wants to merge 4 commits into
pybind:masterfrom
ymwang78:claude/angry-euclid-4dc4a0

Conversation

@ymwang78
Copy link
Copy Markdown

@ymwang78 ymwang78 commented May 19, 2026

subinterpreter_scoped_activate previously created and destroyed a fresh PyThreadState on every activation when the calling OS thread was not already running the target interpreter. Workloads that repeatedly re-enter the same sub-interpreter from the same thread therefore churn thread states and lose per-thread interpreter state between activations (see #6040).

Add an opt-in subinterpreter_thread_state::cached policy: on first use a PyThreadState is created and stored in OS-thread-local storage keyed by the target interpreter; subsequent activations on that thread only swap it in/out and never destroy it. The default stays transient, so existing behavior is unchanged.

Since pybind11 does not control thread lifetime, cleanup is explicit: subinterpreter::release_cached_thread_state() releases the calling thread's cached state for one interpreter, and the static release_all_cached_thread_states() releases all of the calling thread's cached states as an end-of-thread hook. The TLS map's destructor only frees its own nodes and never touches the Python C API, so an unreleased state leaks rather than crashing at thread exit.

Includes test coverage and embedding docs.

Description

Suggested changelog entry:

  • Placeholder.

📚 Documentation preview 📚: https://pybind11--6073.org.readthedocs.build/

subinterpreter_scoped_activate previously created and destroyed a fresh
PyThreadState on every activation when the calling OS thread was not
already running the target interpreter. Workloads that repeatedly
re-enter the same sub-interpreter from the same thread therefore churn
thread states and lose per-thread interpreter state between activations
(see pybind#6040).

Add an opt-in subinterpreter_thread_state::cached policy: on first use a
PyThreadState is created and stored in OS-thread-local storage keyed by
the target interpreter; subsequent activations on that thread only swap
it in/out and never destroy it. The default stays transient, so existing
behavior is unchanged.

Since pybind11 does not control thread lifetime, cleanup is explicit:
subinterpreter::release_cached_thread_state() releases the calling
thread's cached state for one interpreter, and the static
release_all_cached_thread_states() releases all of the calling thread's
cached states as an end-of-thread hook. The TLS map's destructor only
frees its own nodes and never touches the Python C API, so an
unreleased state leaks rather than crashing at thread exit.

Includes test coverage and embedding docs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 19, 2026 05:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an opt-in “cached” subinterpreter activation mode that reuses a per-OS-thread PyThreadState for the same interpreter, plus APIs/docs/tests for explicitly releasing cached thread states to avoid leaks.

Changes:

  • Introduce subinterpreter_thread_state policy (transient vs cached) and thread-local cache plumbing.
  • Add subinterpreter::release_cached_thread_state() and release_all_cached_thread_states() for explicit cleanup.
  • Add documentation and a Catch2 test validating reuse and cleanup behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tests/test_with_catch/test_subinterpreter.cpp Adds coverage for cached thread-state reuse across activations and threads, and for cleanup APIs.
include/pybind11/subinterpreter.h Implements cached activation policy + new release APIs backed by a thread-local unordered_map.
docs/advanced/embedding.rst Documents cached activation behavior, intended use cases, and required cleanup responsibilities.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread include/pybind11/subinterpreter.h Outdated
Comment thread include/pybind11/subinterpreter.h Outdated
Comment thread include/pybind11/subinterpreter.h Outdated
Comment on lines +386 to +404
// Make the cached state current (acquiring this interpreter's GIL) so it can be cleared and
// destroyed on the OS thread that created it, then restore whatever was active before.
PyThreadState *prev = PyThreadState_Swap(cached);
PyThreadState_Clear(cached);
PyThreadState_DeleteCurrent();
PyThreadState_Swap(prev);
}

inline void subinterpreter::release_all_cached_thread_states() {
auto &cache = detail::subinterpreter_thread_state_cache();
for (auto const &entry : cache) {
PyThreadState *cached = entry.second;
// prev is the state active before this swap; it is restored after each deletion, so it is
// never one of the cached states being destroyed here.
PyThreadState *prev = PyThreadState_Swap(cached);
PyThreadState_Clear(cached);
PyThreadState_DeleteCurrent();
PyThreadState_Swap(prev);
}
Copy link
Copy Markdown
Collaborator

@b-pass b-pass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good but a couple fixes would make it easier to use (harder to make a mistake).

Comment thread include/pybind11/subinterpreter.h Outdated
Comment on lines +261 to +269
/// Destroy the PyThreadState (if any) that subinterpreter_thread_state::cached created for
/// THIS interpreter on the CURRENT OS thread, and drop it from that thread's cache.
///
/// Call this on the same OS thread that activated the interpreter, while this subinterpreter
/// is still alive, and while no subinterpreter_scoped_activate scope for it is active on this
/// thread. It is a no-op if this thread has no cached state for this interpreter. The caller
/// need not hold any GIL: the cached state is briefly swapped in (acquiring this interpreter's
/// GIL) to be cleared and deleted, then whatever was active before is restored.
void release_cached_thread_state() const;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little counter-intuitive since normally one would expect the subinterpreter to need to be active before messing with it.

It seems to me that it should be harmless to call this while the interpreter is active ... even if the cached state is currently active when this is called, the release_cached_thread_state function should handle that case.

Comment thread include/pybind11/subinterpreter.h Outdated
Comment thread include/pybind11/subinterpreter.h Outdated
Comment on lines +277 to +278
/// undefined behavior). Must be called on the OS thread that owns the cache, with no
/// subinterpreter_scoped_activate scope using a cached state active on this thread.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this should be made safe to call while the cached thread state is active for one of the interpreters.

@@ -279,13 +358,51 @@ inline subinterpreter_scoped_activate::~subinterpreter_scoped_activate() {
}
#endif
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this check is the only thing preventing someone from calling release_cached_thread_state while the cached state is active.

Comment thread include/pybind11/subinterpreter.h Outdated
PyThreadState *prev = PyThreadState_Swap(cached);
PyThreadState_Clear(cached);
PyThreadState_DeleteCurrent();
PyThreadState_Swap(prev);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be safe to call while the cached state was active if you just make sure you aren't going to call swap again on the state that was just deleted. (When prev is the same as cached.) In that case, just don't Swap at all... effectively leaving no interpreter active.

Comment thread include/pybind11/subinterpreter.h Outdated
PyThreadState *prev = PyThreadState_Swap(cached);
PyThreadState_Clear(cached);
PyThreadState_DeleteCurrent();
PyThreadState_Swap(prev);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, checking to make sure prev != cached before calling Swap here would make this safe to call while a subinterpreter is active (but would deactivate it).

Comment thread include/pybind11/subinterpreter.h Outdated
Comment on lines +271 to +273
/// Destroy every cached PyThreadState that was created on the CURRENT OS thread (for any
/// interpreter) and clear this thread's cache. Intended as an end-of-thread cleanup hook for
/// embedder worker threads.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this wouldn't work for other loaded pybind11 modules' subinterpreters. The cache is effectively module local. I think it would require storing the cache on the pybind11 internals of the main interpreter (which would be an ABI increment) to ensure that all subinterpreters created by other modules also were in the same cache.

This function would be nice and convenient, but it is complicated to provide a guarantee for "all subinterpreters across all modules".

Comment thread include/pybind11/subinterpreter.h Outdated
/// this is called (deleting a PyThreadState whose interpreter was already finalized is
/// undefined behavior). Must be called on the OS thread that owns the cache, with no
/// subinterpreter_scoped_activate scope using a cached state active on this thread.
static void release_all_cached_thread_states();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also note that this will activate/acquire the GIL for each subinterpreter in sequence.

…_thread_state RAII

Address review feedback on the original "cached" mode by switching to an
explicit two-RAII design suggested by @b-pass:

  "Create a class ... to RAII-manage the PyThreadState but start its
   lifetime in an already released state. You could create another
   class (or modify scoped_activate) to scoped/RAII activate the
   inactive threadstate."

Removed
  - enum subinterpreter_thread_state { transient, cached } and the
    defaulted ctor parameter on subinterpreter_scoped_activate.
  - detail::subinterpreter_thread_state_cache thread_local map.
  - subinterpreter::release_cached_thread_state() and
    subinterpreter::release_all_cached_thread_states().

This eliminates: the hidden per-thread map, the "release_all" footgun
across pybind11 modules (the cache was module-local), and the implicit
"must not be active when called" contract on the release functions.

Added
  - Public class subinterpreter_thread_state that owns one PyThreadState
    for a given subinterpreter on its constructing OS thread, created in
    a released state (not current, no GIL). Non-copyable, non-movable
    (PyThreadState is bound to its creating OS thread).
  - subinterpreter_scoped_activate(subinterpreter_thread_state &)
    overload: swaps the owned PyThreadState in on entry, swaps it out
    on exit, does not touch its lifetime.

Behavior
  - The existing subinterpreter_scoped_activate(subinterpreter const &)
    overload is unchanged (still transient: New on entry, Delete on
    exit). All previously-working code keeps working.
  - With subinterpreter_thread_state, one OS thread can alternate
    between multiple subinterpreters and each PyThreadState is preserved
    across activations -- the use case that gil_scoped_release/acquire
    + a long-lived scoped_activate cannot solve alone (the per-thread
    internals.tstate slot holds only one inactive tstate).
  - The dtor of subinterpreter_thread_state guards against the
    "destroyed-while-active" contract violation: if Swap reveals the
    cached tstate was current, do not Swap back to a now-deleted
    pointer (the safe-when-active fix b-pass requested for the old
    release_* functions, applied at the natural location instead).

Lifetime contract is enforced by ordinary C++ scope: typical placement
is `thread_local`. No new release/cleanup APIs are required.

Tests cover (a) tstate identity preserved across activations on a
thread, (b) transient and reusing modes do not share state, (c)
different OS threads get distinct PyThreadStates, and (d) the
multi-subinterpreter alternation case.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ymwang78 ymwang78 requested review from b-pass and Copilot May 20, 2026 04:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment on lines +350 to +375
inline subinterpreter_scoped_activate::subinterpreter_scoped_activate(
subinterpreter_thread_state &ts) {
if (ts.tstate_ == nullptr) {
pybind11_fail("subinterpreter_scoped_activate: empty subinterpreter_thread_state");
}

if (detail::get_interpreter_state_unchecked() == ts.istate_) {
// We are already on this interpreter -- e.g. nested activation, or a different
// PyThreadState for the same interpreter is already current on this thread. Match the
// fast path of the (subinterpreter const&) overload: just ensure the GIL is held. The
// `ts` argument's PyThreadState is intentionally NOT swapped to here; the already-current
// tstate keeps being used until the outer scope exits.
simple_gil_ = true;
gil_state_ = PyGILState_Ensure();
return;
}

tstate_ = ts.tstate_;
borrowed_ = true;

// make the interpreter active and acquire the GIL
old_tstate_ = PyThreadState_Swap(tstate_);

// save this in internals for scoped_gil calls (see also: PR #5870)
detail::get_internals().tstate = tstate_;
}
Comment on lines +404 to +433
inline subinterpreter_thread_state::subinterpreter_thread_state(subinterpreter const &si) {
if (!si.istate_) {
pybind11_fail("subinterpreter_thread_state: null subinterpreter");
}
istate_ = si.istate_;
// PyThreadState_New does not require holding any GIL and does not make the new state current.
tstate_ = PyThreadState_New(istate_);
if (tstate_ == nullptr) {
pybind11_fail("subinterpreter_thread_state: PyThreadState_New returned null");
}
}

inline subinterpreter_thread_state::~subinterpreter_thread_state() {
if (tstate_ == nullptr) {
return;
}
// The PyThreadState must be made current to be cleared and deleted on the owning OS thread.
// Swap it in (which acquires the subinterpreter's GIL), clear+delete, then restore whatever
// was active before.
PyThreadState *prev = PyThreadState_Swap(tstate_);
PyThreadState_Clear(tstate_);
PyThreadState_DeleteCurrent();
// If `prev` is tstate_ itself, the user destroyed this object while it was active via a
// subinterpreter_scoped_activate -- a contract violation, but be defensive: do NOT swap back
// to a now-deleted pointer. Leaving the thread with no current interpreter is consistent
// with the cached state having just been destroyed.
if (prev != nullptr && prev != tstate_) {
PyThreadState_Swap(prev);
}
}
Comment on lines +285 to +311
class subinterpreter_thread_state {
public:
/// Create a PyThreadState for `si` on the calling OS thread. The new state is left in a
/// released state (not current, no GIL acquired).
explicit subinterpreter_thread_state(subinterpreter const &si);

/// Destroy the owned PyThreadState. Must run on the same OS thread that constructed this
/// object, while the owning subinterpreter is still alive, and while no
/// subinterpreter_scoped_activate referring to this object is alive.
~subinterpreter_thread_state();

subinterpreter_thread_state(subinterpreter_thread_state const &) = delete;
subinterpreter_thread_state(subinterpreter_thread_state &&) = delete;
subinterpreter_thread_state &operator=(subinterpreter_thread_state const &) = delete;
subinterpreter_thread_state &operator=(subinterpreter_thread_state &&) = delete;

/// The interpreter this thread state belongs to.
PyInterpreterState *interpreter_state() const { return istate_; }

/// The owned PyThreadState pointer; valid for the lifetime of this object.
PyThreadState *raw_thread_state() const { return tstate_; }

private:
friend class subinterpreter_scoped_activate;
PyThreadState *tstate_ = nullptr;
PyInterpreterState *istate_ = nullptr;
};
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants