Here is the latest Firefox profile running the stack of:
JS WebGPU -> wgpu -> gfx-backend-vulkan -> inplace_it
https://share.firefox.dev/2RmndRr
What I found peculiar is that inplace_or_alloc_from_iter is only half the time of indirect

What else is indirect doing? Can we reduce this overhead?