fix: Fix dispose() stalling because of Dispatcher#3704
fix: Fix dispose() stalling because of Dispatcher#3704mrousavy wants to merge 3 commits intoShopify:mainfrom
dispose() stalling because of Dispatcher#3704Conversation
|
Update: I understand that this issue only happens if Hades (Hermes' GC) destroys the I quickly checked the Skia docs and I think their assumption is;
So I guess in the case of a GPU resource, we might actually need to jump to the original Thread to run the deletion, if the destructor is being called on another Thread. If the crash can actually be reproduced again with my PR, then I suggest the following step by step plan:
wdyt @wcandillon? |
|
I created a PR to implement point 1 in my comment above here: #3705 |
|
@mrousavy I agree with all three points above. I am leaning toward removing the Dispatcher altogether. Explicit resource management of GPU-bound objects would be symmetric with WebGPU. I could review our internal APIs to see if this model would fit and implement it. To your point above, if we still need a Dispatcher for some reason, it should 1) not prevent immediate explicit deletion; 2) work as a real dispatcher (right now we play tricks to drain the dispatch queue). |
|
The PR makes sense, I just need to review/test everything. I know already of some corner cases but I think these can be handled elegantly with careful thoughts. |
|
The change in this PR is in the right direction, we just need to test that internally with have explicit GPU resource management working nicely as well as making sure that we don't crash if the object is being GC'd if the user forgot/didn't invoke dipose yet. So just a bit more one to do on this. |
|
agreed! let me know how I can help. Both Terry and I can do some extensive real world testing |
Disclaimer
I am creating this as a PoC PR to discuss further steps - I can only assume what your implementation does, but if you have any specific reason for this to absolutely be there, then we need to discuss next steps.
Journey
In my tests, my Camera pipeline stalled after rendering a few Frames with Skia.
I diagnosed this from top-down - Camera frames are properly freed,
NativeBuffers are released, and so on - until I hitTextureHolderfrom Skia - I added logs in the constructor and destructor:react-native-skia/packages/skia/apple/SkiaCVPixelBufferUtils.mm
Lines 47 to 51 in 0081cf0
And to my surprise, there were more allocations than deallocations:
It's exactly 6 images not being released before the pipeline stalls - this is just how AVFoundation works, it has a maximum queue of 6 buffers, and once that is exceeded, it stalls:
Now I got curious - why are some
TextureHolders not deallocated?The reason for that is the
JsiSkImage's destructor - which schedules the deletion on aDispatcher:react-native-skia/packages/skia/cpp/api/JsiSkImage.h
Lines 297 to 307 in 0081cf0
..and the
Dispatcher's job is only ever called when you create a newJsiSkImage, as it callsprocessQueue()in it's constructor:react-native-skia/packages/skia/cpp/api/JsiSkImage.h
Lines 286 to 294 in 0081cf0
Note
This is the first problem. Why is deletion scheduled? This goes against a few principles in OOP, and languages like Swift even warn you if you run lambdas in a destructor. It's not really a safe way to destroy/cleanup state.
Then I found out that while there are indeed
JsiSkImages being created next, they use a differentDispatcher- because yourDispatcheris cached per C++ Thread - seethread_localhere:react-native-skia/packages/skia/cpp/api/JsiSkDispatcher.cpp
Lines 5 to 7 in 0081cf0
This is a problem for modern (mobile) architectures, as a Thread =/= Queue.
Both iOS and Android use Thread Pools for their Queues - e.g. on Android a Coroutine uses a Thread Pool, and on iOS GCD (
DispatchQueue, which is the default for anything AVFoundation related) uses Thread Pools.So
thread_localis not a safe option to cache things.Note
This also bit me when I tried to cache
SkSurface- that apparently also needs to be invalidated if called on a new Thread - I currently work around this issue by invalidating the cache if athread_localcounter I increment is not the same as the previous one - but that's a topic for another day.To demonstrate this, I created a
thread_localcounter in C++:When I call this in my JS Frame Processor, I get these results:
..and it grows all the way 'til 12, or possibly even more. Again, this is because of GCD using a Thread Pool.
Kotlin (Coroutine) also does this.
The stalling problem is then logically explainable: Each time the GCD pool uses a different Thread, the previous
JsiSkImage's constructor callsDispatcher::processQueue()on an empty queue - the previously enqueued deletion is then left on the old Thread - which might not be touched again for a few more calls. If there are 5 Thread hops, there's 4 stale deletions. Once it's 6 stale deletions, my Camera pipeline stalls.So
thread_localis not a safe way to assume same caller context - and in your case you rely onJsiSkImagegrabbing theDispatcherfor the current Thread, which at first call and second call might be the same, but on third call it's a different one, then on the 10th call it might be another one, and so on.In total, you'd end up with around 6 different
Dispatchers, andSolution
Just remove the
Dispatcher. Theimage/picturewill be deleted immediately in the destructor/dispose function. No need to schedule something, no need to lock a mutex, no need to queue remaining deletions on creation of a new Image.I removed it (see this PR), and it works fine - no more crashes, no more pipeline stalls.
Now, memory usage is stable, nothing stalls, no Images become stale - even at 60 FPS rendering/
SkImagecreation throughput:Motivation before
I can see that you introduced this change in #3407 in attempt to fix #3390.
My app has been running for minutes and Hades GC did run, but the images still get destroyed fine.
We can try to reproduce the issue again (as you mentioned you were able to reproduce), and provide a real fix if the issue actually occurs again.