Skip to content

Comments

NanoVDB PointsToGrid::countNodes: Use segmented radix sort for higher tile counts#2170

Open
swahtz wants to merge 3 commits intoAcademySoftwareFoundation:masterfrom
swahtz:pointstogrid_segmented_sort
Open

NanoVDB PointsToGrid::countNodes: Use segmented radix sort for higher tile counts#2170
swahtz wants to merge 3 commits intoAcademySoftwareFoundation:masterfrom
swahtz:pointstogrid_segmented_sort

Conversation

@swahtz
Copy link
Contributor

@swahtz swahtz commented Feb 17, 2026

This pull request introduces a performance optimization to the voxel key sorting process in the implementation of PointsToGrid. The main improvement is the addition of a bulk segmented sort path for cases with many tiles instead of a serial loop of kernel launches per-tile, which significantly speeds up sorting in large datasets. I found that creating a grid for the Stanford dragon at a voxel size that produced 200 tiles, the speedup to sorting was 19x and for a voxel size producing 6000 tiles, the speedup was 73x. The end-to-end PointsToGrid improvements were 17% for the case with 6,000 tiles. For low tile counts, I found that the performance of segmented radix sort was worse than the original so I include a fallback when tile counts are low.

Performance and algorithm improvements:

  • Added a new BulkVoxelKeyFunctor struct and associated kernel launch to efficiently compute voxel keys for all points in a single pass (instead of multiple kernel launches) when the number of tiles exceeds a threshold. This enables a bulk segmented sort path for large tile counts, improving performance for large datasets.
  • Modified the sorting logic to choose between bulk segmented sort and serial per-tile sort based on the number of tiles (SEGMENTED_SORT_TILE_THRESHOLD). Bulk sort is used for large tile counts, while the original per-tile sort is retained for small tile counts where the original per-tile sort was faster.

Minor fixes:

  • Fixed the setVerbose method to only set the local verbosity variable, removing the flag manipulation for clarity and correctness.

… key computation.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
@swahtz swahtz requested a review from kmuseth as a code owner February 17, 2026 03:37
@swahtz swahtz added the nanovdb label Feb 17, 2026
Comment on lines 575 to 584
// Binary search in prefix-sum offsets to find tile index for this point
uint32_t lo = 0, hi = numTiles + 1;
while (lo < hi) {
uint32_t mid = (lo + hi) / 2;
if (d_tile_offsets[mid] <= uint32_t(tid)) {
lo = mid + 1;
} else {
hi = mid;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 suggestion: ‏Use thrust::lower_bound here with a thrust::seq execution policy to do the same thing more literately and robustly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thanks. I actually think is upper_bound but the same gist.

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes nanovdb::tools::cuda::PointsToGrid::countNodes by adding a bulk segmented radix-sort path for voxel-key sorting when the number of tiles is high, while keeping the existing per-tile sorting path for low tile counts to avoid overhead regressions.

Changes:

  • Added a bulk voxel-key generation kernel (BulkVoxelKeyFunctor) to compute voxel keys for all points in one launch when tile counts are high.
  • Switched sorting to cub::DeviceSegmentedRadixSort::SortPairs for high tile counts using computed per-tile segment offsets, with a threshold-based fallback to the original per-tile loop.
  • Simplified setVerbose to only update the local verbosity level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a great optimization. Well done.

uint64_t(NanoLeaf< BuildT>::CoordToOffset(ijk)); // voxel offset: 8^3 = 2^9, i.e. first 9 bits
};// voxelKey lambda functor
// Find tile index for this point via upper_bound in prefix-sum offsets
const uint64_t tileID = thrust::upper_bound(thrust::seq, d_tile_offsets, d_tile_offsets + numTiles + 1, uint32_t(tid)) - d_tile_offsets - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 praise: ‏Nice!

@matthewdcong
Copy link
Contributor

Is the difference in absolute time for small numbers of tiles large enough to warrant keeping the older path?

@swahtz
Copy link
Contributor Author

swahtz commented Feb 18, 2026

Is the difference in absolute time for small numbers of tiles large enough to warrant keeping the older path?

I was unsure if we should keep the older path too. For small numbers of tiles, the overhead for the segmented sort makes it more expensive. For 100k points and 4 tiles, the new segmented sort is 0.371ms and the old serial per-tile sort is .278ms, a 34% regression (running on my Ada RTX 6000). For small tile counts, do you think it's reasonable to trade that off for code complexity?

Signed-off-by: Jonathan Swartz <jonathan@jswartz.info>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants