An error occurred while fetching folder content.

!20)
Daniel Reinert
authored
## What is the bug The least squares polynomial reconstruction routines `recon_lsq_cell_XY` have been extracted from the main ICON code and moved to the library `libiconmath`. Testing on NEC@DWD revealed that some of these routines did not vectorize anymore after moving them to `libiconmath`. To illustrate the issue, the compile listing of the loop body is given below for `recon_lsq_cell_c_lib` ``` 1175: |+-----> DO jk = slev, elev 1176: || 1177: || !$ACC LOOP VECTOR PRIVATE(z_qt_times_d) 1178: || !NEC$ ivdep 1179: ||+----> DO jc = i_startidx, i_endidx 1180: ||| 1181: ||| ! calculate matrix vector product Q^T d (transposed of Q times LHS) 1182: ||| ! (intrinsic function matmul not applied, due to massive 1183: ||| ! performance penalty on the NEC. Instead the intrinsic dot product 1184: ||| ! function is applied 1185: ||| !TODO: these should be nine scalars, since they should reside in registers 1186: |||V===> z_qt_times_d(1) = DOT_PRODUCT(lsq_qtmat_c(jc, 1, 1:9, jb), z_d(1:9, jc, jk)) 1187: |||V===> z_qt_times_d(2) = DOT_PRODUCT(lsq_qtmat_c(jc, 2, 1:9, jb), z_d(1:9, jc, jk)) 1188: |||V===> z_qt_times_d(3) = DOT_PRODUCT(lsq_qtmat_c(jc, 3, 1:9, jb), z_d(1:9, jc, jk)) 1189: |||V===> z_qt_times_d(4) = DOT_PRODUCT(lsq_qtmat_c(jc, 4, 1:9, jb), z_d(1:9, jc, jk)) ``` For the library-variant of `recon_lsq_cell_c` we see that the DOT_PRODUCT gets vectorized rather than the horizontal `jc` loop, which leads to a significant performance penalty. Testing revealed, that the desired vectorization can be achieved by ensuring that a complete unrolling of the dot product is performed by the compiler. Unrolling is achieved by changing the compile option `floop-unroll-completely=m` from `m=8` to `m=10`. This makes sense, since the loop count for the dot product in the example given is `m=9`. ## How do you fix it In order to minimize the possibility of unexpected side-effects, the updated compiler option is applied only locally to the module `mo_lib_divrot`, by adding the directive ``` !NEC$ options "-floop-unroll-completely=10" ``` to the top of the module. The resulting compile listing with corrected vectorization is given below: ``` 1180: |+-----> DO jk = slev, elev 1181: || 1182: || !$ACC LOOP VECTOR PRIVATE(z_qt_times_d) 1183: || !NEC$ ivdep 1184: ||V----> DO jc = i_startidx, i_endidx 1185: ||| 1186: ||| ! calculate matrix vector product Q^T d (transposed of Q times LHS) 1187: ||| ! (intrinsic function matmul not applied, due to massive 1188: ||| ! performance penalty on the NEC. Instead the intrinsic dot product 1189: ||| ! function is applied 1190: ||| !TODO: these should be nine scalars, since they should reside in registers 1191: |||*===> z_qt_times_d(1) = DOT_PRODUCT(lsq_qtmat_c(jc, 1, 1:9, jb), z_d(1:9, jc, jk)) 1192: |||*===> z_qt_times_d(2) = DOT_PRODUCT(lsq_qtmat_c(jc, 2, 1:9, jb), z_d(1:9, jc, jk)) 1193: |||*===> z_qt_times_d(3) = DOT_PRODUCT(lsq_qtmat_c(jc, 3, 1:9, jb), z_d(1:9, jc, jk)) 1194: |||*===> z_qt_times_d(4) = DOT_PRODUCT(lsq_qtmat_c(jc, 4, 1:9, jb), z_d(1:9, jc, jk)) 1195: |||*===> z_qt_times_d(5) = DOT_PRODUCT(lsq_qtmat_c(jc, 5, 1:9, jb), z_d(1:9, jc, jk)) 1196: |||*===> z_qt_times_d(6) = DOT_PRODUCT(lsq_qtmat_c(jc, 6, 1:9, jb), z_d(1:9, jc, jk)) ``` Approved-by:Pradipta Samanta <samanta@dkrz.de> Merged-by:
Pradipta Samanta <samanta@dkrz.de> Changelog: default
Name | Last commit | Last update |
---|