feat(subinterpreter): add opt-in TLS-cached thread state mode#6073
feat(subinterpreter): add opt-in TLS-cached thread state mode#6073ymwang78 wants to merge 4 commits into
Conversation
subinterpreter_scoped_activate previously created and destroyed a fresh PyThreadState on every activation when the calling OS thread was not already running the target interpreter. Workloads that repeatedly re-enter the same sub-interpreter from the same thread therefore churn thread states and lose per-thread interpreter state between activations (see pybind#6040). Add an opt-in subinterpreter_thread_state::cached policy: on first use a PyThreadState is created and stored in OS-thread-local storage keyed by the target interpreter; subsequent activations on that thread only swap it in/out and never destroy it. The default stays transient, so existing behavior is unchanged. Since pybind11 does not control thread lifetime, cleanup is explicit: subinterpreter::release_cached_thread_state() releases the calling thread's cached state for one interpreter, and the static release_all_cached_thread_states() releases all of the calling thread's cached states as an end-of-thread hook. The TLS map's destructor only frees its own nodes and never touches the Python C API, so an unreleased state leaks rather than crashing at thread exit. Includes test coverage and embedding docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an opt-in “cached” subinterpreter activation mode that reuses a per-OS-thread PyThreadState for the same interpreter, plus APIs/docs/tests for explicitly releasing cached thread states to avoid leaks.
Changes:
- Introduce
subinterpreter_thread_statepolicy (transientvscached) and thread-local cache plumbing. - Add
subinterpreter::release_cached_thread_state()andrelease_all_cached_thread_states()for explicit cleanup. - Add documentation and a Catch2 test validating reuse and cleanup behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/test_with_catch/test_subinterpreter.cpp | Adds coverage for cached thread-state reuse across activations and threads, and for cleanup APIs. |
| include/pybind11/subinterpreter.h | Implements cached activation policy + new release APIs backed by a thread-local unordered_map. |
| docs/advanced/embedding.rst | Documents cached activation behavior, intended use cases, and required cleanup responsibilities. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Make the cached state current (acquiring this interpreter's GIL) so it can be cleared and | ||
| // destroyed on the OS thread that created it, then restore whatever was active before. | ||
| PyThreadState *prev = PyThreadState_Swap(cached); | ||
| PyThreadState_Clear(cached); | ||
| PyThreadState_DeleteCurrent(); | ||
| PyThreadState_Swap(prev); | ||
| } | ||
|
|
||
| inline void subinterpreter::release_all_cached_thread_states() { | ||
| auto &cache = detail::subinterpreter_thread_state_cache(); | ||
| for (auto const &entry : cache) { | ||
| PyThreadState *cached = entry.second; | ||
| // prev is the state active before this swap; it is restored after each deletion, so it is | ||
| // never one of the cached states being destroyed here. | ||
| PyThreadState *prev = PyThreadState_Swap(cached); | ||
| PyThreadState_Clear(cached); | ||
| PyThreadState_DeleteCurrent(); | ||
| PyThreadState_Swap(prev); | ||
| } |
b-pass
left a comment
There was a problem hiding this comment.
Looks pretty good but a couple fixes would make it easier to use (harder to make a mistake).
| /// Destroy the PyThreadState (if any) that subinterpreter_thread_state::cached created for | ||
| /// THIS interpreter on the CURRENT OS thread, and drop it from that thread's cache. | ||
| /// | ||
| /// Call this on the same OS thread that activated the interpreter, while this subinterpreter | ||
| /// is still alive, and while no subinterpreter_scoped_activate scope for it is active on this | ||
| /// thread. It is a no-op if this thread has no cached state for this interpreter. The caller | ||
| /// need not hold any GIL: the cached state is briefly swapped in (acquiring this interpreter's | ||
| /// GIL) to be cleared and deleted, then whatever was active before is restored. | ||
| void release_cached_thread_state() const; |
There was a problem hiding this comment.
This is a little counter-intuitive since normally one would expect the subinterpreter to need to be active before messing with it.
It seems to me that it should be harmless to call this while the interpreter is active ... even if the cached state is currently active when this is called, the release_cached_thread_state function should handle that case.
| /// undefined behavior). Must be called on the OS thread that owns the cache, with no | ||
| /// subinterpreter_scoped_activate scope using a cached state active on this thread. |
There was a problem hiding this comment.
Similarly, this should be made safe to call while the cached thread state is active for one of the interpreters.
| @@ -279,13 +358,51 @@ inline subinterpreter_scoped_activate::~subinterpreter_scoped_activate() { | |||
| } | |||
| #endif | |||
There was a problem hiding this comment.
Seems like this check is the only thing preventing someone from calling release_cached_thread_state while the cached state is active.
| PyThreadState *prev = PyThreadState_Swap(cached); | ||
| PyThreadState_Clear(cached); | ||
| PyThreadState_DeleteCurrent(); | ||
| PyThreadState_Swap(prev); |
There was a problem hiding this comment.
I think this would be safe to call while the cached state was active if you just make sure you aren't going to call swap again on the state that was just deleted. (When prev is the same as cached.) In that case, just don't Swap at all... effectively leaving no interpreter active.
| PyThreadState *prev = PyThreadState_Swap(cached); | ||
| PyThreadState_Clear(cached); | ||
| PyThreadState_DeleteCurrent(); | ||
| PyThreadState_Swap(prev); |
There was a problem hiding this comment.
Similar to above, checking to make sure prev != cached before calling Swap here would make this safe to call while a subinterpreter is active (but would deactivate it).
| /// Destroy every cached PyThreadState that was created on the CURRENT OS thread (for any | ||
| /// interpreter) and clear this thread's cache. Intended as an end-of-thread cleanup hook for | ||
| /// embedder worker threads. |
There was a problem hiding this comment.
I believe this wouldn't work for other loaded pybind11 modules' subinterpreters. The cache is effectively module local. I think it would require storing the cache on the pybind11 internals of the main interpreter (which would be an ABI increment) to ensure that all subinterpreters created by other modules also were in the same cache.
This function would be nice and convenient, but it is complicated to provide a guarantee for "all subinterpreters across all modules".
| /// this is called (deleting a PyThreadState whose interpreter was already finalized is | ||
| /// undefined behavior). Must be called on the OS thread that owns the cache, with no | ||
| /// subinterpreter_scoped_activate scope using a cached state active on this thread. | ||
| static void release_all_cached_thread_states(); |
There was a problem hiding this comment.
I think this should also note that this will activate/acquire the GIL for each subinterpreter in sequence.
…_thread_state RAII Address review feedback on the original "cached" mode by switching to an explicit two-RAII design suggested by @b-pass: "Create a class ... to RAII-manage the PyThreadState but start its lifetime in an already released state. You could create another class (or modify scoped_activate) to scoped/RAII activate the inactive threadstate." Removed - enum subinterpreter_thread_state { transient, cached } and the defaulted ctor parameter on subinterpreter_scoped_activate. - detail::subinterpreter_thread_state_cache thread_local map. - subinterpreter::release_cached_thread_state() and subinterpreter::release_all_cached_thread_states(). This eliminates: the hidden per-thread map, the "release_all" footgun across pybind11 modules (the cache was module-local), and the implicit "must not be active when called" contract on the release functions. Added - Public class subinterpreter_thread_state that owns one PyThreadState for a given subinterpreter on its constructing OS thread, created in a released state (not current, no GIL). Non-copyable, non-movable (PyThreadState is bound to its creating OS thread). - subinterpreter_scoped_activate(subinterpreter_thread_state &) overload: swaps the owned PyThreadState in on entry, swaps it out on exit, does not touch its lifetime. Behavior - The existing subinterpreter_scoped_activate(subinterpreter const &) overload is unchanged (still transient: New on entry, Delete on exit). All previously-working code keeps working. - With subinterpreter_thread_state, one OS thread can alternate between multiple subinterpreters and each PyThreadState is preserved across activations -- the use case that gil_scoped_release/acquire + a long-lived scoped_activate cannot solve alone (the per-thread internals.tstate slot holds only one inactive tstate). - The dtor of subinterpreter_thread_state guards against the "destroyed-while-active" contract violation: if Swap reveals the cached tstate was current, do not Swap back to a now-deleted pointer (the safe-when-active fix b-pass requested for the old release_* functions, applied at the natural location instead). Lifetime contract is enforced by ordinary C++ scope: typical placement is `thread_local`. No new release/cleanup APIs are required. Tests cover (a) tstate identity preserved across activations on a thread, (b) transient and reusing modes do not share state, (c) different OS threads get distinct PyThreadStates, and (d) the multi-subinterpreter alternation case. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| inline subinterpreter_scoped_activate::subinterpreter_scoped_activate( | ||
| subinterpreter_thread_state &ts) { | ||
| if (ts.tstate_ == nullptr) { | ||
| pybind11_fail("subinterpreter_scoped_activate: empty subinterpreter_thread_state"); | ||
| } | ||
|
|
||
| if (detail::get_interpreter_state_unchecked() == ts.istate_) { | ||
| // We are already on this interpreter -- e.g. nested activation, or a different | ||
| // PyThreadState for the same interpreter is already current on this thread. Match the | ||
| // fast path of the (subinterpreter const&) overload: just ensure the GIL is held. The | ||
| // `ts` argument's PyThreadState is intentionally NOT swapped to here; the already-current | ||
| // tstate keeps being used until the outer scope exits. | ||
| simple_gil_ = true; | ||
| gil_state_ = PyGILState_Ensure(); | ||
| return; | ||
| } | ||
|
|
||
| tstate_ = ts.tstate_; | ||
| borrowed_ = true; | ||
|
|
||
| // make the interpreter active and acquire the GIL | ||
| old_tstate_ = PyThreadState_Swap(tstate_); | ||
|
|
||
| // save this in internals for scoped_gil calls (see also: PR #5870) | ||
| detail::get_internals().tstate = tstate_; | ||
| } |
| inline subinterpreter_thread_state::subinterpreter_thread_state(subinterpreter const &si) { | ||
| if (!si.istate_) { | ||
| pybind11_fail("subinterpreter_thread_state: null subinterpreter"); | ||
| } | ||
| istate_ = si.istate_; | ||
| // PyThreadState_New does not require holding any GIL and does not make the new state current. | ||
| tstate_ = PyThreadState_New(istate_); | ||
| if (tstate_ == nullptr) { | ||
| pybind11_fail("subinterpreter_thread_state: PyThreadState_New returned null"); | ||
| } | ||
| } | ||
|
|
||
| inline subinterpreter_thread_state::~subinterpreter_thread_state() { | ||
| if (tstate_ == nullptr) { | ||
| return; | ||
| } | ||
| // The PyThreadState must be made current to be cleared and deleted on the owning OS thread. | ||
| // Swap it in (which acquires the subinterpreter's GIL), clear+delete, then restore whatever | ||
| // was active before. | ||
| PyThreadState *prev = PyThreadState_Swap(tstate_); | ||
| PyThreadState_Clear(tstate_); | ||
| PyThreadState_DeleteCurrent(); | ||
| // If `prev` is tstate_ itself, the user destroyed this object while it was active via a | ||
| // subinterpreter_scoped_activate -- a contract violation, but be defensive: do NOT swap back | ||
| // to a now-deleted pointer. Leaving the thread with no current interpreter is consistent | ||
| // with the cached state having just been destroyed. | ||
| if (prev != nullptr && prev != tstate_) { | ||
| PyThreadState_Swap(prev); | ||
| } | ||
| } |
| class subinterpreter_thread_state { | ||
| public: | ||
| /// Create a PyThreadState for `si` on the calling OS thread. The new state is left in a | ||
| /// released state (not current, no GIL acquired). | ||
| explicit subinterpreter_thread_state(subinterpreter const &si); | ||
|
|
||
| /// Destroy the owned PyThreadState. Must run on the same OS thread that constructed this | ||
| /// object, while the owning subinterpreter is still alive, and while no | ||
| /// subinterpreter_scoped_activate referring to this object is alive. | ||
| ~subinterpreter_thread_state(); | ||
|
|
||
| subinterpreter_thread_state(subinterpreter_thread_state const &) = delete; | ||
| subinterpreter_thread_state(subinterpreter_thread_state &&) = delete; | ||
| subinterpreter_thread_state &operator=(subinterpreter_thread_state const &) = delete; | ||
| subinterpreter_thread_state &operator=(subinterpreter_thread_state &&) = delete; | ||
|
|
||
| /// The interpreter this thread state belongs to. | ||
| PyInterpreterState *interpreter_state() const { return istate_; } | ||
|
|
||
| /// The owned PyThreadState pointer; valid for the lifetime of this object. | ||
| PyThreadState *raw_thread_state() const { return tstate_; } | ||
|
|
||
| private: | ||
| friend class subinterpreter_scoped_activate; | ||
| PyThreadState *tstate_ = nullptr; | ||
| PyInterpreterState *istate_ = nullptr; | ||
| }; |
subinterpreter_scoped_activate previously created and destroyed a fresh PyThreadState on every activation when the calling OS thread was not already running the target interpreter. Workloads that repeatedly re-enter the same sub-interpreter from the same thread therefore churn thread states and lose per-thread interpreter state between activations (see #6040).
Add an opt-in subinterpreter_thread_state::cached policy: on first use a PyThreadState is created and stored in OS-thread-local storage keyed by the target interpreter; subsequent activations on that thread only swap it in/out and never destroy it. The default stays transient, so existing behavior is unchanged.
Since pybind11 does not control thread lifetime, cleanup is explicit: subinterpreter::release_cached_thread_state() releases the calling thread's cached state for one interpreter, and the static release_all_cached_thread_states() releases all of the calling thread's cached states as an end-of-thread hook. The TLS map's destructor only frees its own nodes and never touches the Python C API, so an unreleased state leaks rather than crashing at thread exit.
Includes test coverage and embedding docs.
Description
Suggested changelog entry:
📚 Documentation preview 📚: https://pybind11--6073.org.readthedocs.build/