Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion vk_video_decoder/demos/vk-video-dec/Main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ int main(int argc, const char **argv)
vkDevCtxt.CreateVulkanDevice(numDecodeQueues, // numDecodeQueues
0, // num encode queues
videoCodecOperation, // videoCodecs
false, // createTransferQueue
((vkDevCtxt.GetVideoDecodeQueueFlag() & VK_QUEUE_TRANSFER_BIT) == 0), // createTransferQueue
true, // createGraphicsQueue
true, // createDisplayQueue
requestVideoComputeQueueMask != 0 // createComputeQueue
Expand Down
43 changes: 42 additions & 1 deletion vk_video_decoder/libs/VkVideoDecoder/VkVideoDecoder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1132,7 +1132,7 @@ int VkVideoDecoder::DecodePictureWithParameters(VkParserPerFrameDecodeParameters
VkVideoEndCodingInfoKHR decodeEndInfo = { VK_STRUCTURE_TYPE_VIDEO_END_CODING_INFO_KHR };
m_vkDevCtx->CmdEndVideoCodingKHR(frameDataSlot.commandBuffer, &decodeEndInfo);

if (m_useTransferOperation == VK_TRUE) {
if (m_useTransferOperation == VK_TRUE && m_transferCommandPool == VK_NULL_HANDLE) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried using the filter?

--enablePostProcessFilter 0

This should work on any implementations.

Copy link
Copy Markdown
Contributor Author

@dabrain34 dabrain34 May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm this is working. Thanks for your recommendation.
It was not obvious from the command line help.
So I changed the documentation to explain this param and its default value.

What is the most efficient the transfer in the decode queue or the compute filter ? I enabled the compute filter YCBCRCOPY(1) in order to support Mesa driver by default.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we prefer the compute based copy rather than the transfer based which uses vkCmdCopyImage?

If an implementation has dedicated transfer HW it would be more efficient to use the transfer queue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So vkCmdCopyImage with dedicated transfer queue will be more efficient that the compute based copy ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes since it might use DMA based copy, so that the compute units are free to do other stuff or even be off the conserve power.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I should keep the transfer queue usage seen that it is not used by nvidia

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I brought back the use of this transfer queue as it does not harm if the decode and transfer are on the same queue.

Copy link
Copy Markdown
Contributor Author

@dabrain34 dabrain34 May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compared the use of transfer queue against Process Filter and the result is quite eloquent:

With Intel mesa driver:

  • --noPresent --postProcessFilterType 1 : Frame 301, FPS: 1197.6
  • --noPresent --postProcessFilterType 0 : Frame 301, FPS: 2461.32

With nvidia:

  • --noPresent --postProcessFilterType 1 : Frame 301, FPS: 1534.82
  • --noPresent --postProcessFilterType 0 : Frame 301, FPS: 3040.68

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we prefer the compute based copy rather than the transfer based which uses vkCmdCopyImage?
If an implementation has dedicated transfer HW it would be more efficient to use the transfer queue.
I'm not saying that compute would be more efficient here on all HW. But if we are going to allocate and free resources on each frame and stall the pipeline with fences, then the compute filter would be much more efficient.

The filter interface is a generic class. Instead of using the compute implementation, one can inherit a transfer-based filter. This class provides pre-allocated command buffers, fences, and semaphores. So,

When you are allocating the object for the filter, just create an instance of class transfer, not compute. The rest of the code would work the same. The filter is using a semaphore to synchronize with the video queues without stalling the pipeline.


assert((pOutputPictureResource != nullptr) && (pOutputPictureResourceInfo != nullptr));

Expand Down Expand Up @@ -1270,6 +1270,47 @@ int VkVideoDecoder::DecodePictureWithParameters(VkParserPerFrameDecodeParameters
}
}

if (m_useTransferOperation == VK_TRUE && m_transferCommandPool != VK_NULL_HANDLE) {
Comment thread
dabrain34 marked this conversation as resolved.
VkFence transferCompleteFence = VkFence();
VkFenceCreateInfo fence_info = { VK_STRUCTURE_TYPE_FENCE_CREATE_INFO };
assert(m_vkDevCtx->CreateFence(*m_vkDevCtx, &fence_info, nullptr, &transferCompleteFence) == VK_SUCCESS);
const VkPipelineStageFlags waitDstStageMask = VK_PIPELINE_STAGE_ALL_COMMANDS_BIT;
assert((pOutputPictureResource != nullptr) && (pOutputPictureResourceInfo != nullptr));
VkCommandBufferBeginInfo beginInfo = { VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO };
beginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
beginInfo.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;
beginInfo.pInheritanceInfo = nullptr;

m_vkDevCtx->BeginCommandBuffer(m_transferCommandBuffers[0], &beginInfo);

CopyOptimalToLinearImage(m_transferCommandBuffers[0],
*pOutputPictureResource,
*pOutputPictureResourceInfo,
*pFrameFilterOutResource,
*pFrameFilterOutResourceInfo,
&frameSynchronizationInfo);
m_vkDevCtx->EndCommandBuffer(m_transferCommandBuffers[0]);

const VkSubmitInfo transferSubmitInfo{
VK_STRUCTURE_TYPE_SUBMIT_INFO, // VkStructureType sType;
nullptr, // const void* pNext;
1u, // uint32_t waitSemaphoreCount;
&videoDecodeCompleteSemaphore, // const VkSemaphore* pWaitSemaphores;
&waitDstStageMask, // const VkPipelineStageFlags* pWaitDstStageMask;
1u, // uint32_t commandBufferCount;
&m_transferCommandBuffers[0], // const VkCommandBuffer* pCommandBuffers;
1u, // uint32_t signalSemaphoreCount;
&videoDecodeCompleteSemaphore, // const VkSemaphore* pSignalSemaphores;
};
assert(VK_NOT_READY == m_vkDevCtx->GetFenceStatus(*m_vkDevCtx, transferCompleteFence));
VkResult result = m_vkDevCtx->MultiThreadedQueueSubmit(VulkanDeviceContext::TRANSFER, m_vkDevCtx->GetTransferQueueFamilyIdx(),
1, &transferSubmitInfo, transferCompleteFence);
result = m_vkDevCtx->WaitForFences(*m_vkDevCtx, 1, &transferCompleteFence, true, gFenceTimeout);
Comment thread
dabrain34 marked this conversation as resolved.
assert(result == VK_SUCCESS);
result = m_vkDevCtx->GetFenceStatus(*m_vkDevCtx, transferCompleteFence);
assert(result == VK_SUCCESS);
}

if (m_dumpDecodeData && (m_hwLoadBalancingTimelineSemaphore != VK_NULL_HANDLE)) { // For TL semaphore debug
uint64_t currSemValue = 0;
VkResult semResult = m_vkDevCtx->GetSemaphoreCounterValue(*m_vkDevCtx, m_hwLoadBalancingTimelineSemaphore, &currSemValue);
Expand Down
29 changes: 28 additions & 1 deletion vk_video_decoder/libs/VkVideoDecoder/VkVideoDecoder.h
Original file line number Diff line number Diff line change
Expand Up @@ -237,8 +237,8 @@ class VkVideoDecoder : public IVulkanVideoDecoderHandler {
, m_numBitstreamBuffersToPreallocate(numBitstreamBuffersToPreallocate)
, m_maxStreamBufferSize()
, m_filterType(filterType)
, m_transferCommandPool()
{

assert(m_vkDevCtx->GetVideoDecodeQueueFamilyIdx() != -1);
assert(m_vkDevCtx->GetVideoDecodeNumQueues() > 0);

Expand Down Expand Up @@ -279,6 +279,30 @@ class VkVideoDecoder : public IVulkanVideoDecoderHandler {
<< m_vkDevCtx->GetVideoDecodeNumQueues() << " queues" << std::endl;
}

if (m_vkDevCtx->GetTransferQueue() != VkQueue()) {
VkCommandPoolCreateInfo cmdPoolInfo = {};
cmdPoolInfo.sType = VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO;
cmdPoolInfo.flags = VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT;
cmdPoolInfo.queueFamilyIndex = m_vkDevCtx->GetTransferQueueFamilyIdx();
VkResult result = m_vkDevCtx->CreateCommandPool(*m_vkDevCtx, &cmdPoolInfo, nullptr, &m_transferCommandPool);
assert(result == VK_SUCCESS);
if (result != VK_SUCCESS) {
fprintf(stderr, "\nERROR: CreateCommandPool() result: 0x%x\n", result);
}

VkCommandBufferAllocateInfo cmdInfo = {};
cmdInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;
cmdInfo.commandBufferCount = 1;
cmdInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY;
cmdInfo.commandPool = m_transferCommandPool;

m_transferCommandBuffers.resize(1);
result = m_vkDevCtx->AllocateCommandBuffers(*m_vkDevCtx, &cmdInfo, &m_transferCommandBuffers[0]);
if (result != VK_SUCCESS) {
fprintf(stderr, "\nERROR: AllocateCommandBuffers() result: 0x%x\n", result);
}
}

}

virtual ~VkVideoDecoder();
Expand Down Expand Up @@ -339,4 +363,7 @@ class VkVideoDecoder : public IVulkanVideoDecoderHandler {
VkDeviceSize m_maxStreamBufferSize;
VulkanFilterYuvCompute::FilterType m_filterType;
VkSharedBaseObj<VulkanFilter> m_yuvFilter;
VkCommandPool m_transferCommandPool;
std::vector<VkCommandBuffer> m_transferCommandBuffers;

};
4 changes: 2 additions & 2 deletions vk_video_decoder/test/vulkan-video-dec/Main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ int main(int argc, const char** argv)
requestVideoComputeQueueMask = VK_QUEUE_COMPUTE_BIT;
}

VkVideoCodecOperationFlagsKHR videoCodec = decoderConfig.forceParserType != VK_VIDEO_CODEC_OPERATION_NONE_KHR ?
VkVideoCodecOperationFlagsKHR videoCodec = decoderConfig.forceParserType != VK_VIDEO_CODEC_OPERATION_NONE_KHR ?
decoderConfig.forceParserType :
videoStreamDemuxer->GetVideoCodec();

Expand Down Expand Up @@ -126,7 +126,7 @@ int main(int argc, const char** argv)
vkDevCtxt.CreateVulkanDevice(numDecodeQueues,
0, // num encode queues
videoCodec,
false, // createTransferQueue
((vkDevCtxt.GetVideoDecodeQueueFlag() & VK_QUEUE_TRANSFER_BIT) == 0), // createTransferQueue
true, // createGraphicsQueue
true, // createDisplayQueue
requestVideoComputeQueueMask != 0 // createComputeQueue
Expand Down
Loading