Skip to content

fix: Fix dispose() stalling because of Dispatcher#3704

Open
mrousavy wants to merge 3 commits intoShopify:mainfrom
mrousavy:fix/fix-dispose-leak
Open

fix: Fix dispose() stalling because of Dispatcher#3704
mrousavy wants to merge 3 commits intoShopify:mainfrom
mrousavy:fix/fix-dispose-leak

Conversation

@mrousavy
Copy link
Contributor

@mrousavy mrousavy commented Feb 11, 2026

Disclaimer

I am creating this as a PoC PR to discuss further steps - I can only assume what your implementation does, but if you have any specific reason for this to absolutely be there, then we need to discuss next steps.

Journey

In my tests, my Camera pipeline stalled after rendering a few Frames with Skia.
I diagnosed this from top-down - Camera frames are properly freed, NativeBuffers are released, and so on - until I hit TextureHolder from Skia - I added logs in the constructor and destructor:

TextureHolder::TextureHolder(CVMetalTextureRef texture) : _texture(texture) {}
TextureHolder::~TextureHolder() {
// ARC will now automatically release _texture.
CFRelease(_texture);
}

And to my surprise, there were more allocations than deallocations:

Screenshot 2026-02-11 at 19 00 53 Screenshot 2026-02-11 at 19 00 43

It's exactly 6 images not being released before the pipeline stalls - this is just how AVFoundation works, it has a maximum queue of 6 buffers, and once that is exceeded, it stalls:

 [...]
 LOG  render
 LOG  render
 LOG  render
 WARN  Frame Dropped! Reason: out-of-buffers
 WARN  Frame Dropped! Reason: out-of-buffers
 WARN  Frame Dropped! Reason: out-of-buffers

Now I got curious - why are some TextureHolders not deallocated?

The reason for that is the JsiSkImage's destructor - which schedules the deletion on a Dispatcher:

void releaseResources() override {
// Queue deletion on the creation thread if needed
auto image = getObjectUnchecked();
if (image && _dispatcher) {
_dispatcher->run([image]() {
// Image will be deleted when this lambda is destroyed
});
}
// Clear the object
JsiSkWrappingSkPtrHostObject<SkImage>::releaseResources();
}

..and the Dispatcher's job is only ever called when you create a new JsiSkImage, as it calls processQueue() in it's constructor:

JsiSkImage(std::shared_ptr<RNSkPlatformContext> context,
const sk_sp<SkImage> image)
: JsiSkWrappingSkPtrHostObject<SkImage>(std::move(context),
std::move(image)) {
// Get the dispatcher for the current thread
_dispatcher = Dispatcher::getDispatcher();
// Process any pending operations
_dispatcher->processQueue();
}

Note

This is the first problem. Why is deletion scheduled? This goes against a few principles in OOP, and languages like Swift even warn you if you run lambdas in a destructor. It's not really a safe way to destroy/cleanup state.

Then I found out that while there are indeed JsiSkImages being created next, they use a different Dispatcher - because your Dispatcher is cached per C++ Thread - see thread_local here:

// Thread-local storage definition
thread_local std::shared_ptr<Dispatcher::DispatcherData>
Dispatcher::_threadDispatcher;

This is a problem for modern (mobile) architectures, as a Thread =/= Queue.

Both iOS and Android use Thread Pools for their Queues - e.g. on Android a Coroutine uses a Thread Pool, and on iOS GCD (DispatchQueue, which is the default for anything AVFoundation related) uses Thread Pools.
So thread_local is not a safe option to cache things.

Note

This also bit me when I tried to cache SkSurface - that apparently also needs to be invalidated if called on a new Thread - I currently work around this issue by invalidating the cache if a thread_local counter I increment is not the same as the previous one - but that's a topic for another day.

To demonstrate this, I created a thread_local counter in C++:

double HybridThreadHelper::getCurrentThreadMarker() {
  static std::atomic_size_t threadCounter{1};
  static thread_local size_t thisThreadId{0};
  if (thisThreadId == 0) {
    thisThreadId = threadCounter.fetch_add(1);
  }
  return static_cast<double>(thisThreadId);
}

When I call this in my JS Frame Processor, I get these results:

1
1
2
2
2
2
2
3
1
3
2
2
3
3

..and it grows all the way 'til 12, or possibly even more. Again, this is because of GCD using a Thread Pool.
Kotlin (Coroutine) also does this.

The stalling problem is then logically explainable: Each time the GCD pool uses a different Thread, the previous JsiSkImage's constructor calls Dispatcher::processQueue() on an empty queue - the previously enqueued deletion is then left on the old Thread - which might not be touched again for a few more calls. If there are 5 Thread hops, there's 4 stale deletions. Once it's 6 stale deletions, my Camera pipeline stalls.

So thread_local is not a safe way to assume same caller context - and in your case you rely on JsiSkImage grabbing the Dispatcher for the current Thread, which at first call and second call might be the same, but on third call it's a different one, then on the 10th call it might be another one, and so on.
In total, you'd end up with around 6 different Dispatchers, and

Solution

Just remove the Dispatcher. The image/picture will be deleted immediately in the destructor/dispose function. No need to schedule something, no need to lock a mutex, no need to queue remaining deletions on creation of a new Image.

I removed it (see this PR), and it works fine - no more crashes, no more pipeline stalls.

Now, memory usage is stable, nothing stalls, no Images become stale - even at 60 FPS rendering/SkImage creation throughput:

Screenshot 2026-02-11 at 19 05 41

Motivation before

I can see that you introduced this change in #3407 in attempt to fix #3390.

My app has been running for minutes and Hades GC did run, but the images still get destroyed fine.

We can try to reproduce the issue again (as you mentioned you were able to reproduce), and provide a real fix if the issue actually occurs again.

@mrousavy
Copy link
Contributor Author

mrousavy commented Feb 11, 2026

Update:

I understand that this issue only happens if Hades (Hermes' GC) destroys the SkImage reference- because then the destructor is called on a separate Thread, which seemed to crash for @jfhamlin. In my case, it isn't Hades GC destroying it but instead I am manually destroying (via dispose()), that's why I probably don't see the crash he reported.

I quickly checked the Skia docs and I think their assumption is;

  • A raster/CPU resource can be destroyed on any Thread
  • A GPU resource should be destroyed on the same Thread it was created on

So I guess in the case of a GPU resource, we might actually need to jump to the original Thread to run the deletion, if the destructor is being called on another Thread.

If the crash can actually be reproduced again with my PR, then I suggest the following step by step plan:

  1. dispose() should always delete instantly - not schedule deletion (this would already unblock me)
  2. Scheduling to a different Thread in the destructor should only happen if we are really on another Thread, otherwise we can also just delete immediately (this prevents objects floating around until someone calls processQueue(), which might not happen for a while)
  3. Scheduling to a different Thread should actually do scheduling - we need an actual Dispatcher for this - not a queue that will be picked up later. Then the deletion call is always scheduled, and we don't depend on a new SkImage calling processQueue().

wdyt @wcandillon?

@mrousavy
Copy link
Contributor Author

I created a PR to implement point 1 in my comment above here: #3705

@wcandillon
Copy link
Contributor

@mrousavy I agree with all three points above. I am leaning toward removing the Dispatcher altogether. Explicit resource management of GPU-bound objects would be symmetric with WebGPU. I could review our internal APIs to see if this model would fit and implement it.

To your point above, if we still need a Dispatcher for some reason, it should 1) not prevent immediate explicit deletion; 2) work as a real dispatcher (right now we play tricks to drain the dispatch queue).

@wcandillon
Copy link
Contributor

The PR makes sense, I just need to review/test everything. I know already of some corner cases but I think these can be handled elegantly with careful thoughts.

@wcandillon
Copy link
Contributor

The change in this PR is in the right direction, we just need to test that internally with have explicit GPU resource management working nicely as well as making sure that we don't crash if the object is being GC'd if the user forgot/didn't invoke dipose yet. So just a bit more one to do on this.

@mrousavy
Copy link
Contributor Author

agreed! let me know how I can help. Both Terry and I can do some extensive real world testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash: GPU-backed texture freed from GC thread while UI thread is flushing, when using usePictureAsTexture + <Image image={texture} />

2 participants