Update Home/FESOM GPU porting/Tracer Advection Dwarf: STORM authored by Wilton Jaciel Loch's avatar Wilton Jaciel Loch
# #4: Tracer Advection Dwarf - STORM
Second tracer advection dwarf performance evaluation presented on 2023/03/08 in a meeting with AWI developers. The goal of this performance evaluation was to assess the GPU speedup of the ported dwarf when using a higher resolution mesh as STORM and compare it to the results obtained in the first evaluation with the lower resolution mesh CORE2.
## Setup
4 binaries compared, with builds for CPU/GPU from 2 versions of the code:
- acc_porting:
- Kernel porting + data movement optimizations
- Very minor code changes (removal of maxval in FCT subroutine)
- Referred to as CPU (when built for CPU) or;
- GPU (when built for GPU)
- tra_adv_acc_optim:
- Kernel porting + data movement optimizations + loop collapsing + initialization fusing
- Larger code changes
- Referred to as CPU + loop collapsing (when built for CPU) or;
- GPU + loop collapsing (when built for GPU)
Data collection done with MPI Barriers and Wtime before and after each kernel. Initial and final data movement as well as IO not accounted as it will be diluted during actual full model execution.
STORM mesh, with data and config files obtained from Nikolay’s Levante folders.
Data collection done on Levante
128 MPI processes employed for the CPU execution (1 full Levante compute node)
4 GPU-bound MPI processes employed for the GPU execution (1 full Levante GPU node with remaining CPU cores idle)
Execution of 10 time steps for the dwarf.
Dwarf executed 5 times.
50 data points employed, simple average employed for the plot values.
Times collected for the most expensive parallel regions:
| Abbreviated name | Subroutine name | file |
| --- | --- | --- |
| UPWH | adv_tra_hor_upw1 | oce_adv_tra_hor.F90 |
| UPWV | adv_tra_ver_upw1 | oce_adv_tra_ver.F90 |
| MFCT | adv_tra_hor_mfct | oce_adv_tra_hor.F90 |
| QR4C | adv_tra_ver_qr4c | oce_adv_tra_ver.F90 |
| FCT | oce_tra_adv_fct | oce_adv_tra_fct.F90 |
## Results
The first metric is the execution time of each of the parallel regions, which is shown ion the following plot. All times are in milliseconds.
![mean_execution_times](uploads/6db53a7ed007bd2d0b1cd58c97317632/mean_execution_times.png)
The version with the loops collapsed, when built for the CPU, presents in 3 of the 5 parallel regions a reduction in total time when compared with the baseline built for the CPU target. This is in contrast with the data obtained from the CORE2 tracer dwarf performance evaluation where most of the kernels suffered from an increase in CPU time when moving from the baseline to the version with the loops collapsed. A similar improvement and degradation behavior with the collapsing of the loops is observed in the “OpenMP vs OpenACC” performance evaluation (improvement for UPWH and UPWV with degradation for MFCT and QR4C) for both tools.
### Speedups
The following plot shows the achieved speedups for each kernel/parallel region, the overall combined compute time of all the kernels and the total dwarf time (including MPI communication time). Compute time is defined as the sum of all kernel times.
![speedups](uploads/cfe3da21055f386ef811167d06f0bc62/speedups.png)
\ No newline at end of file