Skip to content

Conversation

@dzzz2001
Copy link
Collaborator

@dzzz2001 dzzz2001 commented Dec 26, 2025

Summary

This PR adds CUDA GPU acceleration for the snap_psibeta_half function in RT-TDDFT calculations, achieving significant performance improvements.

Changes

New Files

  • source/source_lcao/module_rt/kernels/cuda/snap_psibeta_gpu.cu - Main CUDA implementation
  • source/source_lcao/module_rt/kernels/cuda/snap_psibeta_kernel.cu - CUDA kernel implementations
  • source/source_lcao/module_rt/kernels/cuda/snap_psibeta_kernel.cuh - Kernel headers and device functions
  • source/source_lcao/module_rt/kernels/snap_psibeta_gpu.h - Public interface header

Modified Files

  • source/source_lcao/module_operator_lcao/td_nonlocal_lcao.cpp - Integration with GPU path
  • source/source_lcao/module_rt/CMakeLists.txt - Build configuration for CUDA files

Key Optimizations

  1. Atom-batch GPU kernel: Processes atoms in batches to maximize GPU utilization
  2. Constant memory grids: Uses CUDA constant memory for Gauss-Legendre integration grids
  3. Warp shuffle reduction: Efficient parallel reduction using warp primitives
  4. Optimized spherical harmonics: GPU-optimized implementation with atan2 and sincos
  5. Template-based kernel dispatch: Compile-time optimization paths

Performance Results

Test Environment

  • Platform: 北极星 (Polaris)
  • CPU: Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
  • GPU: NVIDIA A800 80GB PCIe (2x GPUs used)
  • Test command: OMP_NUM_THREADS=12 mpirun -n 2 abacus

Benchmark Results

Version snap_psibeta Time contribute_HR Time snap Ratio
v3.9.0.20 (baseline) 306.46s 404.38s 48.32%
v3.9.0.21 (CPU optimized) 158.09s 236.51s 33.57%
This PR (GPU) 6.59s 19.90s 2.58%

Speedup

  • vs v3.9.0.20: ~46x faster for snap_psibeta
  • vs v3.9.0.21: ~24x faster for snap_psibeta
  • Overall contribute_HR: ~20x faster than v3.9.0.20, ~12x faster than v3.9.0.21

…ory grids

- Implement GPU-accelerated snap_psibeta_neighbor_batch_kernel
- Use constant memory for Lebedev and Gauss-Legendre integration grids
- Add multi-GPU support via set_device_by_rank
- Initialize/finalize GPU resources in each calculate_HR call
- Remove static global variables for cleaner resource management
- CPU fallback when GPU processing fails
…ructure

- Add ModuleBase::timer for snap_psibeta_atom_batch_gpu function
- Remove GPU fallback to CPU design (return true/false in void function)
- Replace fallback returns with error messages and proper early exits
- Ensure timer is properly called on all exit paths
- Simplify code structure for better readability
…uring

- Move ylm0 computation outside radial loop (saves 140x redundant calculations)
- Hoist A_dot_leb and dR calculations outside inner loop
- Add #pragma unroll hints for radial and m0 loops

Achieves 23.3% speedup on snap_psibeta_gpu (19.27s -> 14.78s).
Numerical correctness verified: energy matches baseline (-756.053 Ry).
- Replace conditional atan branches with single atan2 call
- Use sincos() instead of separate sin/cos calls

Achieves 8.4% additional speedup (14.78s -> 13.56s)
Combined with loop restructuring: 29.6% total from baseline
Numerical correctness verified: -756.053 Ry
- Convert compute_ylm_gpu to templated version with L as template param
- Use linear array for Legendre polynomials (reduces from 25 to 15 doubles)
- Add DISPATCH_YLM macro for runtime-to-template dispatch
- Add MAX_M0_SIZE constant for result array sizing
- Replace C++17 constexpr if with regular if for C++14 compatibility
- Enable compiler loop unrolling with #pragma unroll

Performance: snap_psibeta_gpu improved from 13.27s to 9.83s (1.35x speedup)
- Replace shared memory tree reduction with warp shuffle reduction
- Use warp_reduce_sum for intra-warp reduction (faster shuffle ops)
- Reduce shared memory from BLOCK_SIZE (2KB) to NUM_WARPS (64 bytes)
- Cross-warp reduction done by first warp reading from shared memory

Reduces register usage from 94 to 88, shared memory from 2KB to 64 bytes.
…umentation

- Add comprehensive file headers explaining purpose and key features
- Organize code into logical sections with clear separators
- Add doxygen-style documentation for all functions, structs, and constants
- Fix inaccurate comments (BLOCK_SIZE requirement, direction vector normalization)
- Remove unused variables (dR, distance01)
- Remove finalize_gpu_resources() as it's not needed for constant memory
- Improve inline comments explaining algorithms and optimizations
…ction

- Add use_gpu runtime flag that checks both __CUDA macro and PARAM.inp.device
- GPU path is now only enabled when __CUDA is defined AND device == "gpu"
- Makes the conditional logic clearer with if/else instead of nested #ifdef
- Move CUDA_CHECK macro to shared header snap_psibeta_kernel.cuh
- Remove duplicate CUDA_CHECK definition from snap_psibeta_gpu.cu
- Remove CUDA_CHECK_KERNEL macro and replace all usages with CUDA_CHECK
- Reduces code duplication and improves consistency
- Replace local PI, FOUR_PI, SQRT2 definitions with ModuleBase:: versions
- Add include for source_base/constants.h
- Replace fprintf(stderr, ...) with ModuleBase::WARNING_QUIT
- Update CUDA_CHECK macro to use WARNING_QUIT instead of fprintf
- Add includes for tool_quit.h and string header
- Consistent error handling with ABACUS codebase conventions
Copy link
Collaborator

@mohanchen mohanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great job.

@mohanchen mohanchen added Refactor Refactor ABACUS codes Performance Issues related to fail running ABACUS GPU & DCU & HPC GPU and DCU and HPC related any issues Useful Information Useful information for others to learn/study labels Dec 28, 2025
@mohanchen mohanchen changed the title perf(TDDFT): Add CUDA acceleration for snap_psibeta_half function perf(TDDFT): Add CUDA acceleration for snap_psibeta_half function (Useful information about largely improves the snap_psibeta_half function) Dec 28, 2025
@mohanchen mohanchen merged commit b188f21 into deepmodeling:develop Dec 28, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

GPU & DCU & HPC GPU and DCU and HPC related any issues Performance Issues related to fail running ABACUS Refactor Refactor ABACUS codes Useful Information Useful information for others to learn/study

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants