NanoVDB PointsToGrid::countNodes: Use segmented radix sort for higher tile counts#2170
NanoVDB PointsToGrid::countNodes: Use segmented radix sort for higher tile counts#2170swahtz wants to merge 3 commits intoAcademySoftwareFoundation:masterfrom
PointsToGrid::countNodes: Use segmented radix sort for higher tile counts#2170Conversation
… key computation. Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
| // Binary search in prefix-sum offsets to find tile index for this point | ||
| uint32_t lo = 0, hi = numTiles + 1; | ||
| while (lo < hi) { | ||
| uint32_t mid = (lo + hi) / 2; | ||
| if (d_tile_offsets[mid] <= uint32_t(tid)) { | ||
| lo = mid + 1; | ||
| } else { | ||
| hi = mid; | ||
| } | ||
| } |
There was a problem hiding this comment.
💡 suggestion: Use thrust::lower_bound here with a thrust::seq execution policy to do the same thing more literately and robustly.
There was a problem hiding this comment.
Good idea, thanks. I actually think is upper_bound but the same gist.
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
There was a problem hiding this comment.
Pull request overview
This PR optimizes nanovdb::tools::cuda::PointsToGrid::countNodes by adding a bulk segmented radix-sort path for voxel-key sorting when the number of tiles is high, while keeping the existing per-tile sorting path for low tile counts to avoid overhead regressions.
Changes:
- Added a bulk voxel-key generation kernel (
BulkVoxelKeyFunctor) to compute voxel keys for all points in one launch when tile counts are high. - Switched sorting to
cub::DeviceSegmentedRadixSort::SortPairsfor high tile counts using computed per-tile segment offsets, with a threshold-based fallback to the original per-tile loop. - Simplified
setVerboseto only update the local verbosity level.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
harrism
left a comment
There was a problem hiding this comment.
Looks like a great optimization. Well done.
| uint64_t(NanoLeaf< BuildT>::CoordToOffset(ijk)); // voxel offset: 8^3 = 2^9, i.e. first 9 bits | ||
| };// voxelKey lambda functor | ||
| // Find tile index for this point via upper_bound in prefix-sum offsets | ||
| const uint64_t tileID = thrust::upper_bound(thrust::seq, d_tile_offsets, d_tile_offsets + numTiles + 1, uint32_t(tid)) - d_tile_offsets - 1; |
|
Is the difference in absolute time for small numbers of tiles large enough to warrant keeping the older path? |
I was unsure if we should keep the older path too. For small numbers of tiles, the overhead for the segmented sort makes it more expensive. For 100k points and 4 tiles, the new segmented sort is 0.371ms and the old serial per-tile sort is .278ms, a 34% regression (running on my Ada RTX 6000). For small tile counts, do you think it's reasonable to trade that off for code complexity? |
Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
This pull request introduces a performance optimization to the voxel key sorting process in the implementation of
PointsToGrid. The main improvement is the addition of a bulk segmented sort path for cases with many tiles instead of a serial loop of kernel launches per-tile, which significantly speeds up sorting in large datasets. I found that creating a grid for the Stanford dragon at a voxel size that produced 200 tiles, the speedup to sorting was 19x and for a voxel size producing 6000 tiles, the speedup was 73x. The end-to-endPointsToGridimprovements were 17% for the case with 6,000 tiles. For low tile counts, I found that the performance of segmented radix sort was worse than the original so I include a fallback when tile counts are low.Performance and algorithm improvements:
BulkVoxelKeyFunctorstruct and associated kernel launch to efficiently compute voxel keys for all points in a single pass (instead of multiple kernel launches) when the number of tiles exceeds a threshold. This enables a bulk segmented sort path for large tile counts, improving performance for large datasets.SEGMENTED_SORT_TILE_THRESHOLD). Bulk sort is used for large tile counts, while the original per-tile sort is retained for small tile counts where the original per-tile sort was faster.Minor fixes:
setVerbosemethod to only set the local verbosity variable, removing the flag manipulation for clarity and correctness.