Changes

Wilton Jaciel Loch · 90137ae0
--- a/Home/FESOM-GPU-porting.md
+++ b/Home/FESOM-GPU-porting.md
+# Sprint 1: FESOM
+
+This page is dedicated to host all the information gathered during the FESOM GPU porting sprint.
+
+---
+
+# Overall information
+
+- **Goals:**
+    - Investigate the most suitable porting strategies and tools (OpenMP, OpenACC, etc.)
+    - Start porting FESOM 2.1 to GPUs with Levante and the JUWELS booster as main targets.
+    - Evaluate possible seedups of ported parts of the code.
+    - Integrate the ported versions into the main upstream FESOM repository.
+- **Planned start and end date:** 2022.11.08 to 2023.05.08
+- **Planned Duration:** 6 months
+- **End date:** 2022.04.20
+- **Main contacts:** Dmitry Sidorenko, Nikolay Koldunov
+- **Assigned RSE:** Wilton Loch
+
+---
+
+# Timeline
+
+The following timeline lists the major events and meetings held during the sprint. Some dates related to development milestones may be approximated due to a lack of precise recording of such events.
+
+| Date | Description |
+| --- | --- |
+| 2022.11.08 | Kick-off meeting with AWI developers |
+| 2022.12.10 | Finished initial ported version of the tracer advection dwarf |
+| 2022.12.14 | Meeting with AWI developers to discuss current porting status and next steps for the sprint |
+| 2022.01.05 | Finished loop collapsing on all kernels of the tracer dwarf |
+| 2023.01.13 | Meeting with AWI developers to showcase the preliminary performance results for the tracer advection dwarf ported versions with low-resolution mesh |
+| 2023.01.25 | Finished ice dwarf initial porting and data movement optimization |
+| 2023.02.02 | Finished OpenMP vs OpenACC multicore performance evaluation |
+| 2023.02.07 | Finished eDSL performance portability layer prototype and performance evaluation |
+| 2023.03.08 | Meeting with AWI developers to present the performance results of the porting for the tracer advection and ice dwarf with high-resolution mesh |
+| 2023.04.19 | Finished the porting and data movement optimization of selected dynamics subroutines  |
+| 2023.04.20 | FESOM sprint closing meeting |
+
+The initial porting effort was focused on parts of the code which had a dwarf available. A dwarf is a granular application which comprises and executes only a specific functionality of the full model, allowing better isolation, easier testing and increased development speed. At the beginning of the sprint two dwarfs were availabe: tracer advection and sea-ice dynamics. 
+
+The work then started with the tracer advection dwarf as this part hasz the largest percentage of time consumption in the full model execution. By the end of the sprint, two ported versions of the tracer advection were available. The first, containing the OpenACC decorations and only very minor code modifications, was merged into FESOM main development branch. The second, containing kernel loop collapsing and other initial optimizations is awaiting approval/discussions from the AWI scientists to be merged. Both versions had single-node performance evaluations conducted during the sprint, whose results can be seen in the performance evaluations section.
+
+The next step in the porting process was the sea-ice dynamics dwarf and at the end of the sprint one ported version was available. It comprised the OpenACC decorations and only minor changes in the application code. This version was merged into FESOM main development branch. This version also had single-node performance evaluations conducted during the sprint, whose results can be seen in the performance evaluations section.
+
+After discussions with the AWI scientists it was decided that some selected subroutines of the dynamics, specially pressure calculations, should also be ported. One ported version of a set of subroutines was available at the end of the sprint and was merged into the FESOM main development branch.
+
+---
+
+# Performance evaluations
+
+A number of performance evaluations were conducted in order to assess the achieved speedups of the ported versions. Further details for each one are available below:
+
+
+---
+
+## Lessons Learned
+
+- OpenACC multicore target build has similar performance to the OpenMP multicore target for the majority of the kernels. Locks and atomic updates on the multicore OpenMP target have worse performance when compared to OpenACC atomics.
+- For bit identical results when comparing CPU and GPU results the compiler optimizations should be set to the minimum.
+- Increment OpenACC porting guide (perhaps as a standalone document).
+
+## Optimizations
+
+### OCE_FCT
+
+- Conceptual mismatch: AUX is defined to be an auxiliary array to save space, however the values written to it are meaningful to further calculations done in other parts of the code.
+- AUX is a pointer to edge_up_dn_grad, whose first dimension has size 4. Inside of fct only the two first positions of the first dimensions are used, which creates a non-contiguous memory access between different threads.
+- Merging two last kernels into one: Loop over the number of nodes and vertical levels, for each node perform the calculation of the original second kernel for #edges/#nodes edges.
+
+## Ideas
+
+
+# Remaining work
+
+As a result of the sprint a number of open topics exist for future work. Below there is a list of the gathered points:
+
+- **MPI time investigation and improvement:** During the single-node performance evaluations it was noticed from the data that there was a significant degradation in MPI performance when executing on the GPU in comparison to the CPU. This has been seen to not be partition dependent as a pure CPU execution on the Levante GPU partition using the same number of processes does not suffer from such degradation and stays around the same order of magnitude as for the CPU execution in the CPU partition. The degradation can reach the order of 100s or 1000s of times slower communication time. Currently, the main hypothesis is that the degradation comes from the use of MPI derived data types in the exchanges, from which similar performance degradation has been observed in other contexts when using GPU to GPU communication, as is the case for the ported versions. Given the magnitude of the degradations, this is a pressing point which should be investigated before full model GPU execution and surely before any full model GPU performance evaluations.
+- **LUMI ported version execution:** As one of the main targets for future FESOM execution is LUMI, which has AMD GPUs, testing the ported versions of the code on such infrastructure is a demanding next step to verify both execution and performance portability levels.
+- **Scaling performance evaluation:** During the sprint only single node performance evaluations were performed. Therefore, there is also the need for multi-node scaling tests to be performed in order to verify the scaling behavior of the ported versions and to define a threshold of minimum load for effective GPU execution of the model
+- **Dynamics loop collapsing:** The last part of the work done during the sprint was the porting of a number of subroutines in the dynamics section, specially for density and pressure calculations. For these sections, as opposed to the ice dwarf, there are many three-dimensional loops which can be collapsed for increased performance when executing on the GPUs. Therefore, another front of future work is performing these first-level optimizations in kernels pertaining to these selected subroutines.