Skip to content

Add NEC-specific compiler option to mo_lib_divrot

Daniel Reinert requested to merge libiconmath_nec_hotfix into main

What is the bug

The least squares polynomial reconstruction routines recon_lsq_cell_XY have been extracted from the main ICON code and moved to the library libiconmath. Testing on NEC@DWD revealed that some of these routines did not vectorize anymore after moving them to libiconmath. To illustrate the issue, the compile listing of the loop body is given below for recon_lsq_cell_c_lib

 1175: |+----->        DO jk = slev, elev
  1176: ||        
  1177: ||                !$ACC LOOP VECTOR PRIVATE(z_qt_times_d)
  1178: ||        !NEC$ ivdep
  1179: ||+---->          DO jc = i_startidx, i_endidx
  1180: |||       
  1181: |||                 ! calculate matrix vector product Q^T d (transposed of Q times LHS)
  1182: |||                 ! (intrinsic function matmul not applied, due to massive
  1183: |||                 ! performance penalty on the NEC. Instead the intrinsic dot product
  1184: |||                 ! function is applied
  1185: |||       !TODO:  these should be nine scalars, since they should reside in registers
  1186: |||V===>            z_qt_times_d(1) = DOT_PRODUCT(lsq_qtmat_c(jc, 1, 1:9, jb), z_d(1:9, jc, jk))
  1187: |||V===>            z_qt_times_d(2) = DOT_PRODUCT(lsq_qtmat_c(jc, 2, 1:9, jb), z_d(1:9, jc, jk))
  1188: |||V===>            z_qt_times_d(3) = DOT_PRODUCT(lsq_qtmat_c(jc, 3, 1:9, jb), z_d(1:9, jc, jk))
  1189: |||V===>            z_qt_times_d(4) = DOT_PRODUCT(lsq_qtmat_c(jc, 4, 1:9, jb), z_d(1:9, jc, jk))

For the library-variant of recon_lsq_cell_c we see that the DOT_PRODUCT gets vectorized rather than the horizontal jc loop, which leads to a significant performance penalty. Testing revealed, that the desired vectorization can be achieved by ensuring that a complete unrolling of the dot product is performed by the compiler. Unrolling is achieved by changing the compile option floop-unroll-completely=m from m=8 to m=10. This makes sense, since the loop count for the dot product in the example given is m=9.

How do you fix it

In order to minimize the possibility of unexpected side-effects, the updated compiler option is applied only locally to the module mo_lib_divrot, by adding the directive

!NEC$ options "-floop-unroll-completely=10"

to the top of the module. The resulting compile listing with corrected vectorization is given below:

  1180: |+----->        DO jk = slev, elev
  1181: ||        
  1182: ||                !$ACC LOOP VECTOR PRIVATE(z_qt_times_d)
  1183: ||        !NEC$ ivdep
  1184: ||V---->          DO jc = i_startidx, i_endidx
  1185: |||       
  1186: |||                 ! calculate matrix vector product Q^T d (transposed of Q times LHS)
  1187: |||                 ! (intrinsic function matmul not applied, due to massive
  1188: |||                 ! performance penalty on the NEC. Instead the intrinsic dot product
  1189: |||                 ! function is applied
  1190: |||       !TODO:  these should be nine scalars, since they should reside in registers
  1191: |||*===>            z_qt_times_d(1) = DOT_PRODUCT(lsq_qtmat_c(jc, 1, 1:9, jb), z_d(1:9, jc, jk))
  1192: |||*===>            z_qt_times_d(2) = DOT_PRODUCT(lsq_qtmat_c(jc, 2, 1:9, jb), z_d(1:9, jc, jk))
  1193: |||*===>            z_qt_times_d(3) = DOT_PRODUCT(lsq_qtmat_c(jc, 3, 1:9, jb), z_d(1:9, jc, jk))
  1194: |||*===>            z_qt_times_d(4) = DOT_PRODUCT(lsq_qtmat_c(jc, 4, 1:9, jb), z_d(1:9, jc, jk))
  1195: |||*===>            z_qt_times_d(5) = DOT_PRODUCT(lsq_qtmat_c(jc, 5, 1:9, jb), z_d(1:9, jc, jk))
  1196: |||*===>            z_qt_times_d(6) = DOT_PRODUCT(lsq_qtmat_c(jc, 6, 1:9, jb), z_d(1:9, jc, jk))

How urgent is the bugfix

  • I need it as soon as possible
  • I can wait for a couple of days
  • None of my current codes is directly affected

Mandatory steps before review

  • Gitlab CI passes (Hint: use make format for linting)
  • Bugfix is covered by additional unit tests
  • Mark the merge request as ready by removing Draft:

Mandatory steps before merge

  • Reviewed by a maintainer
  • Incorporate review suggestions
  • Remember to edit the commit message and select the proper changelog category (feature/bugfix/other)
  • Prior to merging, please remove any boilerplate from the MR description, retaining only the What is the bug and How do you fix it section to maintain

You are not supposed to merge this request by yourself, the maintainers of libiconmath take care of this action!

Edited by Pradipta Samanta

Merge request reports

Loading