This performance evaluation was done to measure the GPU performance of the porting of fesom ice dwarf. It was performed at the end of February 2023 and presented to AWI developers during a meeting on 2023/03/08.
## Setup
2 binaries compared, with builds for CPU/GPU from a single version of the code
- dwarf_ice_porting:
- Kernel porting + data movement optimizations
- Very minor code changes
- In the plots suggestively referred to as CPU (when built for CPU) or GPU (when built for GPU)
Data collection done with MPI Barriers and Wtime before and after each kernel. Initial and final data movement as well as IO not accounted as it will be diluted during actual full model execution.
STORM mesh, with data and config files obtained from Nikolay’s Levante folders.
Data collection done on Levante
128 MPI processes employed for the CPU execution (1 full Levante compute node)
4 GPU-bound MPI processes employed for the GPU execution (1 full Levante GPU node with remaining CPU cores idle)
Execution of 10 time steps inside the dwarf.
Dwarf executed 5 times.
50 data points employed for each measured region, simple average employed for the plot values.
The first and most important metric is the execution time, which is shown in the following plot. The stress_tensor, stress2rhs and evp_loop regions are executed inside of the EVP main loop, which for the current experiment configuration is run for 120 iterations. Therefore, the total accumulated wall time of theses regions is 120x larger. Here their times are shown as for a single step to allow the single launch comparison between kernels/parallel regions. The ice_fem_fct kernel is also executed 3 times for 3 different arrays, however the time shown is aggregated for the 3.
The following plot shows the achieved speedups for each kernel and the overall combined compute time of all the kernels. For this latter value the sum of the time from all compute regions in the CPU is divided by the sum of the time from all kernels on the GPU.
From the plot, although evp_pre_loop and stress_tensor have the highest speedups, the overall compute speedup is not highly increased as these are relatively small kernels.