2025-11-30
OS: Windows 10
RAM: 32 GB
GPU: NVIDIA GeForce GTX 1080
CPU: AMD Ryzen 5 5600X
This benchmark was performed using test_performance_benchmark.cpp from the tests folder.
Configuration:
Processing: Resampling(CUBIC) + Windowing + Dispersion + BG-Removal + Log-Scale
Iterations per test: 200
Backends: CPU CUDA OpenCL Vulkan
Bitdepth: 16-bit unsigned integer
Signal is the number of samples per raw A-scan
Ascans is the number of A-scans per B-scan
Bscans is the number of B-scans per buffer
A buffer is the amount of data processed in one go on GPU
| Signal | AScans | BScans | Backend | Time(ms) | BScans/s | AScans/s | MB/s | Speedup |
|---|---|---|---|---|---|---|---|---|
| 512 | 256 | 1 | CPU | 1.904 | 525 | 134,481 | 131.33 | - |
| 512 | 256 | 1 | CUDA | 0.189 | 5,289 | 1,353,888 | 1322.16 | 10.07x |
| 512 | 256 | 1 | OpenCL | 0.287 | 3,483 | 891,598 | 870.70 | 6.63x |
| 512 | 256 | 1 | Vulkan | 0.102 | 9,809 | 2,511,035 | 2452.18 | 18.67x |
| 512 | 512 | 1 | CPU | 3.735 | 268 | 137,081 | 133.87 | - |
| 512 | 512 | 1 | CUDA | 0.173 | 5,775 | 2,956,632 | 2887.34 | 21.57x |
| 512 | 512 | 1 | OpenCL | 0.363 | 2,754 | 1,410,138 | 1377.09 | 10.29x |
| 512 | 512 | 1 | Vulkan | 0.173 | 5,782 | 2,960,222 | 2890.84 | 21.59x |
| 512 | 1024 | 1 | CPU | 7.464 | 134 | 137,195 | 133.98 | - |
| 512 | 1024 | 1 | CUDA | 0.336 | 2,976 | 3,047,664 | 2976.23 | 22.21x |
| 512 | 1024 | 1 | OpenCL | 0.684 | 1,462 | 1,496,802 | 1461.72 | 10.91x |
| 512 | 1024 | 1 | Vulkan | 0.310 | 3,229 | 3,306,212 | 3228.72 | 24.10x |
| 1024 | 256 | 1 | CPU | 3.639 | 275 | 70,340 | 137.38 | - |
| 1024 | 256 | 1 | CUDA | 0.240 | 4,170 | 1,067,490 | 2084.94 | 15.18x |
| 1024 | 256 | 1 | OpenCL | 0.389 | 2,569 | 657,692 | 1284.55 | 9.35x |
| 1024 | 256 | 1 | Vulkan | 0.174 | 5,737 | 1,468,732 | 2868.62 | 20.88x |
| 1024 | 512 | 1 | CPU | 7.321 | 137 | 69,940 | 136.60 | - |
| 1024 | 512 | 1 | CUDA | 0.343 | 2,914 | 1,491,928 | 2913.92 | 21.33x |
| 1024 | 512 | 1 | OpenCL | 0.679 | 1,472 | 753,734 | 1472.14 | 10.78x |
| 1024 | 512 | 1 | Vulkan | 0.313 | 3,193 | 1,634,634 | 3192.64 | 23.37x |
| 1024 | 1024 | 1 | CPU | 14.916 | 67 | 68,649 | 134.08 | - |
| 1024 | 1024 | 1 | CUDA | 0.502 | 1,992 | 2,039,556 | 3983.51 | 29.71x |
| 1024 | 1024 | 1 | OpenCL | 1.108 | 902 | 923,971 | 1804.63 | 13.46x |
| 1024 | 1024 | 1 | Vulkan | 0.581 | 1,722 | 1,763,708 | 3444.74 | 25.69x |
| 2048 | 256 | 1 | CPU | 7.463 | 134 | 34,301 | 133.99 | - |
| 2048 | 256 | 1 | CUDA | 0.308 | 3,243 | 830,279 | 3243.28 | 24.21x |
| 2048 | 256 | 1 | OpenCL | 0.686 | 1,457 | 373,020 | 1457.11 | 10.87x |
| 2048 | 256 | 1 | Vulkan | 0.354 | 2,824 | 723,011 | 2824.26 | 21.08x |
| 2048 | 512 | 1 | CPU | 14.970 | 67 | 34,201 | 133.60 | - |
| 2048 | 512 | 1 | CUDA | 0.475 | 2,104 | 1,077,067 | 4207.29 | 31.49x |
| 2048 | 512 | 1 | OpenCL | 1.146 | 872 | 446,580 | 1744.45 | 13.06x |
| 2048 | 512 | 1 | Vulkan | 0.693 | 1,443 | 739,041 | 2886.88 | 21.61x |
| 2048 | 1024 | 1 | CPU | 30.994 | 32 | 33,038 | 129.06 | - |
| 2048 | 1024 | 1 | CUDA | 0.798 | 1,252 | 1,282,413 | 5009.42 | 38.82x |
| 2048 | 1024 | 1 | OpenCL | 2.131 | 469 | 480,565 | 1877.21 | 14.55x |
| 2048 | 1024 | 1 | Vulkan | 1.345 | 743 | 761,140 | 2973.20 | 23.04x |
2025-11-30
Device: NVIDIA Jetson Orin Nano 8GB
GPU: NVIDIA Ampere architecture (1024 CUDA cores, Compute Capability 8.7)
RAM: 8 GB LPDDR5
CPU: 6-core ARM Cortex-A78AE
This benchmark was performed using test_performance_benchmark.cpp from the tests folder.
Configuration:
Processing: Resampling(CUBIC) + Windowing + Dispersion + DC-Removal + Log-Scale
Iterations per test: 20000
Backends: CUDA
Bitdepth: 16-bit unsigned integer
Signal is the number of samples per raw A-scan
Ascans is the number of A-scans per B-scan
Bscans is the number of B-scans per buffer
A buffer is the amount of data processed in one go on GPU
| Signal | AScans | BScans | Backend | Time(ms) | BScans/s | AScans/s | MB/s | Speedup |
|---|---|---|---|---|---|---|---|---|
| 512 | 256 | 1 | CUDA | 0.448 | 2,232 | 571,503 | 558.11 | - |
| 512 | 512 | 1 | CUDA | 0.493 | 2,029 | 1,038,764 | 1014.42 | - |
| 512 | 1024 | 1 | CUDA | 0.686 | 1,458 | 1,492,818 | 1457.83 | - |
| 1024 | 256 | 1 | CUDA | 0.488 | 2,048 | 524,194 | 1023.82 | - |
| 1024 | 512 | 1 | CUDA | 0.697 | 1,436 | 735,040 | 1435.62 | - |
| 1024 | 1024 | 1 | CUDA | 1.462 | 684 | 700,247 | 1367.67 | - |
| 2048 | 256 | 1 | CUDA | 0.703 | 1,423 | 364,359 | 1423.28 | - |
| 2048 | 512 | 1 | CUDA | 1.425 | 702 | 359,256 | 1403.34 | - |
| 2048 | 1024 | 1 | CUDA | 2.802 | 357 | 365,451 | 1427.54 | - |