perf(TDDFT): Add CUDA acceleration for snap_psibeta_half function (Useful information about largely improves the snap_psibeta_half function) #6808

dzzz2001 · 2025-12-26T06:43:56Z

Summary

This PR adds CUDA GPU acceleration for the snap_psibeta_half function in RT-TDDFT calculations, achieving significant performance improvements.

Changes

New Files

source/source_lcao/module_rt/kernels/cuda/snap_psibeta_gpu.cu - Main CUDA implementation
source/source_lcao/module_rt/kernels/cuda/snap_psibeta_kernel.cu - CUDA kernel implementations
source/source_lcao/module_rt/kernels/cuda/snap_psibeta_kernel.cuh - Kernel headers and device functions
source/source_lcao/module_rt/kernels/snap_psibeta_gpu.h - Public interface header

Modified Files

source/source_lcao/module_operator_lcao/td_nonlocal_lcao.cpp - Integration with GPU path
source/source_lcao/module_rt/CMakeLists.txt - Build configuration for CUDA files

Key Optimizations

Atom-batch GPU kernel: Processes atoms in batches to maximize GPU utilization
Constant memory grids: Uses CUDA constant memory for Gauss-Legendre integration grids
Warp shuffle reduction: Efficient parallel reduction using warp primitives
Optimized spherical harmonics: GPU-optimized implementation with atan2 and sincos
Template-based kernel dispatch: Compile-time optimization paths

Performance Results

Test Environment

Platform: 北极星 (Polaris)
CPU: Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
GPU: NVIDIA A800 80GB PCIe (2x GPUs used)
Test command: OMP_NUM_THREADS=12 mpirun -n 2 abacus

Benchmark Results

Version	`snap_psibeta` Time	`contribute_HR` Time	`snap` Ratio
v3.9.0.20 (baseline)	306.46s	404.38s	48.32%
v3.9.0.21 (CPU optimized)	158.09s	236.51s	33.57%
This PR (GPU)	6.59s	19.90s	2.58%

Speedup

vs v3.9.0.20: ~46x faster for snap_psibeta
vs v3.9.0.21: ~24x faster for snap_psibeta
Overall contribute_HR: ~20x faster than v3.9.0.20, ~12x faster than v3.9.0.21

…ory grids - Implement GPU-accelerated snap_psibeta_neighbor_batch_kernel - Use constant memory for Lebedev and Gauss-Legendre integration grids - Add multi-GPU support via set_device_by_rank - Initialize/finalize GPU resources in each calculate_HR call - Remove static global variables for cleaner resource management - CPU fallback when GPU processing fails

…ructure - Add ModuleBase::timer for snap_psibeta_atom_batch_gpu function - Remove GPU fallback to CPU design (return true/false in void function) - Replace fallback returns with error messages and proper early exits - Ensure timer is properly called on all exit paths - Simplify code structure for better readability

…uring - Move ylm0 computation outside radial loop (saves 140x redundant calculations) - Hoist A_dot_leb and dR calculations outside inner loop - Add #pragma unroll hints for radial and m0 loops Achieves 23.3% speedup on snap_psibeta_gpu (19.27s -> 14.78s). Numerical correctness verified: energy matches baseline (-756.053 Ry).

- Replace conditional atan branches with single atan2 call - Use sincos() instead of separate sin/cos calls Achieves 8.4% additional speedup (14.78s -> 13.56s) Combined with loop restructuring: 29.6% total from baseline Numerical correctness verified: -756.053 Ry

- Convert compute_ylm_gpu to templated version with L as template param - Use linear array for Legendre polynomials (reduces from 25 to 15 doubles) - Add DISPATCH_YLM macro for runtime-to-template dispatch - Add MAX_M0_SIZE constant for result array sizing - Replace C++17 constexpr if with regular if for C++14 compatibility - Enable compiler loop unrolling with #pragma unroll Performance: snap_psibeta_gpu improved from 13.27s to 9.83s (1.35x speedup)

- Replace shared memory tree reduction with warp shuffle reduction - Use warp_reduce_sum for intra-warp reduction (faster shuffle ops) - Reduce shared memory from BLOCK_SIZE (2KB) to NUM_WARPS (64 bytes) - Cross-warp reduction done by first warp reading from shared memory Reduces register usage from 94 to 88, shared memory from 2KB to 64 bytes.

…umentation - Add comprehensive file headers explaining purpose and key features - Organize code into logical sections with clear separators - Add doxygen-style documentation for all functions, structs, and constants - Fix inaccurate comments (BLOCK_SIZE requirement, direction vector normalization) - Remove unused variables (dR, distance01) - Remove finalize_gpu_resources() as it's not needed for constant memory - Improve inline comments explaining algorithms and optimizations

…ction - Add use_gpu runtime flag that checks both __CUDA macro and PARAM.inp.device - GPU path is now only enabled when __CUDA is defined AND device == "gpu" - Makes the conditional logic clearer with if/else instead of nested #ifdef

- Move CUDA_CHECK macro to shared header snap_psibeta_kernel.cuh - Remove duplicate CUDA_CHECK definition from snap_psibeta_gpu.cu - Remove CUDA_CHECK_KERNEL macro and replace all usages with CUDA_CHECK - Reduces code duplication and improves consistency

- Replace local PI, FOUR_PI, SQRT2 definitions with ModuleBase:: versions - Add include for source_base/constants.h

- Replace fprintf(stderr, ...) with ModuleBase::WARNING_QUIT - Update CUDA_CHECK macro to use WARNING_QUIT instead of fprintf - Add includes for tool_quit.h and string header - Consistent error handling with ABACUS codebase conventions

mohanchen

LGTM, great job.

dzzz2001 added 13 commits December 26, 2025 14:19

remove snap_psibeta_neighbor_batch_gpu

6014c90

make the code more concise

9bdfdb7

refactor(snap_psibeta_kernel): use ModuleBase constants

b0f9427

- Replace local PI, FOUR_PI, SQRT2 definitions with ModuleBase:: versions - Add include for source_base/constants.h

mohanchen approved these changes Dec 28, 2025

View reviewed changes

mohanchen added Refactor Refactor ABACUS codes Performance Issues related to fail running ABACUS GPU & DCU & HPC GPU and DCU and HPC related any issues Useful Information Useful information for others to learn/study labels Dec 28, 2025

mohanchen changed the title ~~perf(TDDFT): Add CUDA acceleration for snap_psibeta_half function~~ perf(TDDFT): Add CUDA acceleration for snap_psibeta_half function (Useful information about largely improves the snap_psibeta_half function) Dec 28, 2025

mohanchen merged commit b188f21 into deepmodeling:develop Dec 28, 2025
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(TDDFT): Add CUDA acceleration for snap_psibeta_half function (Useful information about largely improves the snap_psibeta_half function) #6808

perf(TDDFT): Add CUDA acceleration for snap_psibeta_half function (Useful information about largely improves the snap_psibeta_half function) #6808

Uh oh!

dzzz2001 commented Dec 26, 2025 •

edited

Loading

Uh oh!

mohanchen left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf(TDDFT): Add CUDA acceleration for snap_psibeta_half function (Useful information about largely improves the snap_psibeta_half function) #6808

perf(TDDFT): Add CUDA acceleration for snap_psibeta_half function (Useful information about largely improves the snap_psibeta_half function) #6808

Uh oh!

Conversation

dzzz2001 commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New Files

Modified Files

Key Optimizations

Performance Results

Test Environment

Benchmark Results

Speedup

Uh oh!

mohanchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dzzz2001 commented Dec 26, 2025 •

edited

Loading