Add NEC-specific compiler option to mo_lib_divrot
What is the bug
The least squares polynomial reconstruction routines recon_lsq_cell_XY
have been extracted from the main ICON code and moved to the library libiconmath
. Testing on NEC@DWD revealed that some of these routines did not vectorize anymore after moving them to libiconmath
. To illustrate the issue, the compile listing of the loop body is given below for recon_lsq_cell_c_lib
1175: |+-----> DO jk = slev, elev
1176: ||
1177: || !$ACC LOOP VECTOR PRIVATE(z_qt_times_d)
1178: || !NEC$ ivdep
1179: ||+----> DO jc = i_startidx, i_endidx
1180: |||
1181: ||| ! calculate matrix vector product Q^T d (transposed of Q times LHS)
1182: ||| ! (intrinsic function matmul not applied, due to massive
1183: ||| ! performance penalty on the NEC. Instead the intrinsic dot product
1184: ||| ! function is applied
1185: ||| !TODO: these should be nine scalars, since they should reside in registers
1186: |||V===> z_qt_times_d(1) = DOT_PRODUCT(lsq_qtmat_c(jc, 1, 1:9, jb), z_d(1:9, jc, jk))
1187: |||V===> z_qt_times_d(2) = DOT_PRODUCT(lsq_qtmat_c(jc, 2, 1:9, jb), z_d(1:9, jc, jk))
1188: |||V===> z_qt_times_d(3) = DOT_PRODUCT(lsq_qtmat_c(jc, 3, 1:9, jb), z_d(1:9, jc, jk))
1189: |||V===> z_qt_times_d(4) = DOT_PRODUCT(lsq_qtmat_c(jc, 4, 1:9, jb), z_d(1:9, jc, jk))
For the library-variant of recon_lsq_cell_c
we see that the DOT_PRODUCT gets vectorized rather than the horizontal jc
loop, which leads to a significant performance penalty. Testing revealed, that the desired vectorization can be achieved by ensuring that a complete unrolling of the dot product is performed by the compiler. Unrolling is achieved by changing the compile option floop-unroll-completely=m
from m=8
to m=10
. This makes sense, since the loop count for the dot product in the example given is m=9
.
How do you fix it
In order to minimize the possibility of unexpected side-effects, the updated compiler option is applied only locally to the module mo_lib_divrot
, by adding the directive
!NEC$ options "-floop-unroll-completely=10"
to the top of the module. The resulting compile listing with corrected vectorization is given below:
1180: |+-----> DO jk = slev, elev
1181: ||
1182: || !$ACC LOOP VECTOR PRIVATE(z_qt_times_d)
1183: || !NEC$ ivdep
1184: ||V----> DO jc = i_startidx, i_endidx
1185: |||
1186: ||| ! calculate matrix vector product Q^T d (transposed of Q times LHS)
1187: ||| ! (intrinsic function matmul not applied, due to massive
1188: ||| ! performance penalty on the NEC. Instead the intrinsic dot product
1189: ||| ! function is applied
1190: ||| !TODO: these should be nine scalars, since they should reside in registers
1191: |||*===> z_qt_times_d(1) = DOT_PRODUCT(lsq_qtmat_c(jc, 1, 1:9, jb), z_d(1:9, jc, jk))
1192: |||*===> z_qt_times_d(2) = DOT_PRODUCT(lsq_qtmat_c(jc, 2, 1:9, jb), z_d(1:9, jc, jk))
1193: |||*===> z_qt_times_d(3) = DOT_PRODUCT(lsq_qtmat_c(jc, 3, 1:9, jb), z_d(1:9, jc, jk))
1194: |||*===> z_qt_times_d(4) = DOT_PRODUCT(lsq_qtmat_c(jc, 4, 1:9, jb), z_d(1:9, jc, jk))
1195: |||*===> z_qt_times_d(5) = DOT_PRODUCT(lsq_qtmat_c(jc, 5, 1:9, jb), z_d(1:9, jc, jk))
1196: |||*===> z_qt_times_d(6) = DOT_PRODUCT(lsq_qtmat_c(jc, 6, 1:9, jb), z_d(1:9, jc, jk))
How urgent is the bugfix
-
I need it as soon as possible -
I can wait for a couple of days -
None of my current codes is directly affected
Mandatory steps before review
-
Gitlab CI passes (Hint: use make format
for linting) -
Bugfix is covered by additional unit tests -
Mark the merge request as ready by removing Draft:
Mandatory steps before merge
-
Reviewed by a maintainer -
Incorporate review suggestions -
Remember to edit the commit message and select the proper changelog category (feature/bugfix/other) -
Prior to merging, please remove any boilerplate from the MR description, retaining only the What is the bug and How do you fix it section to maintain
You are not supposed to merge this request by yourself, the maintainers of libiconmath take care of this action!