Add NEC-specific compiler option to mo_lib_divrot
- Dec 20, 2024
-
-
Daniel Reinert authored
add NEC-specific compiler option to mo_lib_divrot in order to ensure complete unrolling of dot products
-
The least squares polynomial reconstruction routines recon_lsq_cell_XY
have been extracted from the main ICON code and moved to the library libiconmath
. Testing on NEC@DWD revealed that some of these routines did not vectorize anymore after moving them to libiconmath
. To illustrate the issue, the compile listing of the loop body is given below for recon_lsq_cell_c_lib
1175: |+-----> DO jk = slev, elev
1176: ||
1177: || !$ACC LOOP VECTOR PRIVATE(z_qt_times_d)
1178: || !NEC$ ivdep
1179: ||+----> DO jc = i_startidx, i_endidx
1180: |||
1181: ||| ! calculate matrix vector product Q^T d (transposed of Q times LHS)
1182: ||| ! (intrinsic function matmul not applied, due to massive
1183: ||| ! performance penalty on the NEC. Instead the intrinsic dot product
1184: ||| ! function is applied
1185: ||| !TODO: these should be nine scalars, since they should reside in registers
1186: |||V===> z_qt_times_d(1) = DOT_PRODUCT(lsq_qtmat_c(jc, 1, 1:9, jb), z_d(1:9, jc, jk))
1187: |||V===> z_qt_times_d(2) = DOT_PRODUCT(lsq_qtmat_c(jc, 2, 1:9, jb), z_d(1:9, jc, jk))
1188: |||V===> z_qt_times_d(3) = DOT_PRODUCT(lsq_qtmat_c(jc, 3, 1:9, jb), z_d(1:9, jc, jk))
1189: |||V===> z_qt_times_d(4) = DOT_PRODUCT(lsq_qtmat_c(jc, 4, 1:9, jb), z_d(1:9, jc, jk))
For the library-variant of recon_lsq_cell_c
we see that the DOT_PRODUCT gets vectorized rather than the horizontal jc
loop, which leads to a significant performance penalty. Testing revealed, that the desired vectorization can be achieved by ensuring that a complete unrolling of the dot product is performed by the compiler. Unrolling is achieved by changing the compile option floop-unroll-completely=m
from m=8
to m=10
. This makes sense, since the loop count for the dot product in the example given is m=9
.
In order to minimize the possibility of unexpected side-effects, the updated compiler option is applied only locally to the module mo_lib_divrot
, by adding the directive
!NEC$ options "-floop-unroll-completely=10"
to the top of the module. The resulting compile listing with corrected vectorization is given below:
1180: |+-----> DO jk = slev, elev
1181: ||
1182: || !$ACC LOOP VECTOR PRIVATE(z_qt_times_d)
1183: || !NEC$ ivdep
1184: ||V----> DO jc = i_startidx, i_endidx
1185: |||
1186: ||| ! calculate matrix vector product Q^T d (transposed of Q times LHS)
1187: ||| ! (intrinsic function matmul not applied, due to massive
1188: ||| ! performance penalty on the NEC. Instead the intrinsic dot product
1189: ||| ! function is applied
1190: ||| !TODO: these should be nine scalars, since they should reside in registers
1191: |||*===> z_qt_times_d(1) = DOT_PRODUCT(lsq_qtmat_c(jc, 1, 1:9, jb), z_d(1:9, jc, jk))
1192: |||*===> z_qt_times_d(2) = DOT_PRODUCT(lsq_qtmat_c(jc, 2, 1:9, jb), z_d(1:9, jc, jk))
1193: |||*===> z_qt_times_d(3) = DOT_PRODUCT(lsq_qtmat_c(jc, 3, 1:9, jb), z_d(1:9, jc, jk))
1194: |||*===> z_qt_times_d(4) = DOT_PRODUCT(lsq_qtmat_c(jc, 4, 1:9, jb), z_d(1:9, jc, jk))
1195: |||*===> z_qt_times_d(5) = DOT_PRODUCT(lsq_qtmat_c(jc, 5, 1:9, jb), z_d(1:9, jc, jk))
1196: |||*===> z_qt_times_d(6) = DOT_PRODUCT(lsq_qtmat_c(jc, 6, 1:9, jb), z_d(1:9, jc, jk))
make format
for linting)
Draft:
You are not supposed to merge this request by yourself, the maintainers of libiconmath take care of this action!
add NEC-specific compiler option to mo_lib_divrot in order to ensure complete unrolling of dot products