Topology optimization for linear elastic minimum compliance with volume constraints on cartesian grids in 3D. Implemented using OpenMP target offloading using multiple GPU acceleration.
The code in this repository is a direct continuation of the OpenMP code from Simple and efficient GPU accelerated topology optimisation: Codes and applications.
The code solves the cantilever beam test problem described in Parallel framework for topology optimization using the method of moving asymptotes.
| One Design Iteration | Five Design Iterations | 20 Design Iterations |
|---|---|---|
![]() |
![]() |
![]() |
The code has been implemented and tested with the following version of SuiteSparse, OpenBLAS,NVC, and GCC in mind.
| Package | Version | Installation |
|---|---|---|
SuiteSparse/CHOLMOD |
5.1.2 | See Github release notes |
OpenBLAS |
0.2.20 | See Github release notes |
nvhpc |
21.9 | NVIDIA HPC SDK |
CUDA |
11.1 | CUDA Toolkit 11.1.0 |
GCC |
13.0 | See the GCC offloading page |
clang |
16.0 | See the LLCM offloading page |
It is straight forward to install the NVIDIA nvc compiler that can be downloaded from the NVIDIA HPC SDK.
Installing an offloading enabled version of clang or gcc is slightly more complicated. Scripts for installing gcc 13 and clang 16 for NVPTX, AMD-GCN, and AMD-HSA backends can be found in ./compilers.
To set PATH and LD_LIBRARY_PATH you may consider exporting the variables in paths/gbar/ or paths/lumi. For instance, to compile with NVC on gbar.dtu.dk, load
source paths/gbar/nvc.sh
The config directory contains makefiles for NVC and GCC and config/gpu contains makefiles for specific offloading targets. To compile the code for Nvidia Tesla V100 with GCC you may type
make COMPILER=gcc GPU=V100
A number of definitions are necessary to achieve good performance on a specific architecture. One can adjust the following settings.
| Variable | Options | Description |
|---|---|---|
USE_CHOLMOD |
1 (on) or 0 (off) | Use a direct solver on the coarsest multigrid level |
SIMD |
1 (on) or 0 (off) | Add explicit #pragma omp simd directives in target regions |
STENCIL_SIZE_Y |
2^k,k>1 | Select a block size for simd operations of at least two |
For instance, compile with
make COMPILER=nvc GPU=A100 USE_CHOLMOD=1 SIMD=0 STENCIL_SIZE_Y=4
to target a Nvidia Tesla A100 GPU with NVC that uses a direct solver on the coarsest level, does not use SIMD pragmas and uses blocks of size four.
Remember that the domain is partitioned along the x-axis. Hence, the length in x divided by the number of GPUs divided by the number of levels should be at least 1. For example, try
OMP_NUM_THREADS=8 CUDA_VISIBLE_DEVICES=0,1 ./top3d 512 256 256 1 0 3 5
to run three iterations of an experiment without saving the result in verbode mode while using 5 levels in the MG preconditioner. If you type
./top3d
without any input arguments you will be prompted how to use the executable.
The following figure illustrates how voxels and lattices are partitioned amongst the GPUs.
In the example, the XY-plane is 12 times 4 voxels yielding that each GPU owns $4\times 4$ voxels. The hollow circles indicate halo lattices and the dense points indicate interior lattices.


