Skip to content
Snippets Groups Projects
Daniel Reinert's avatar
Daniel Reinert authored
## What is the bug

The least squares polynomial reconstruction routines `recon_lsq_cell_XY` have been extracted from the main ICON code and moved to the library `libiconmath`. Testing on NEC@DWD revealed that some of these routines did not vectorize anymore after moving them to `libiconmath`. To illustrate the issue, the compile listing of the loop body is given below for `recon_lsq_cell_c_lib`
```
 1175: |+----->        DO jk = slev, elev
  1176: ||        
  1177: ||                !$ACC LOOP VECTOR PRIVATE(z_qt_times_d)
  1178: ||        !NEC$ ivdep
  1179: ||+---->          DO jc = i_startidx, i_endidx
  1180: |||       
  1181: |||                 ! calculate matrix vector product Q^T d (transposed of Q times LHS)
  1182: |||                 ! (intrinsic function matmul not applied, due to massive
  1183: |||                 ! performance penalty on the NEC. Instead the intrinsic dot product
  1184: |||                 ! function is applied
  1185: |||       !TODO:  these should be nine scalars, since they should reside in registers
  1186: |||V===>            z_qt_times_d(1) = DOT_PRODUCT(lsq_qtmat_c(jc, 1, 1:9, jb), z_d(1:9, jc, jk))
  1187: |||V===>            z_qt_times_d(2) = DOT_PRODUCT(lsq_qtmat_c(jc, 2, 1:9, jb), z_d(1:9, jc, jk))
  1188: |||V===>            z_qt_times_d(3) = DOT_PRODUCT(lsq_qtmat_c(jc, 3, 1:9, jb), z_d(1:9, jc, jk))
  1189: |||V===>            z_qt_times_d(4) = DOT_PRODUCT(lsq_qtmat_c(jc, 4, 1:9, jb), z_d(1:9, jc, jk))
```
For the library-variant of `recon_lsq_cell_c` we see that the DOT_PRODUCT gets vectorized rather than the horizontal `jc` loop, which leads to a significant performance penalty. Testing revealed, that the desired vectorization can be achieved by ensuring that a complete unrolling of the dot product is performed by the compiler. Unrolling is achieved by changing the compile option `floop-unroll-completely=m` from `m=8` to `m=10`. This makes sense, since the loop count for the dot product in the example given is `m=9`. 

## How do you fix it

In order to minimize the possibility of unexpected side-effects, the updated compiler option is applied only locally to the module  `mo_lib_divrot`, by adding the directive
```
!NEC$ options "-floop-unroll-completely=10"
```
to the top of the module. The resulting compile listing with corrected vectorization is given below:
```
  1180: |+----->        DO jk = slev, elev
  1181: ||        
  1182: ||                !$ACC LOOP VECTOR PRIVATE(z_qt_times_d)
  1183: ||        !NEC$ ivdep
  1184: ||V---->          DO jc = i_startidx, i_endidx
  1185: |||       
  1186: |||                 ! calculate matrix vector product Q^T d (transposed of Q times LHS)
  1187: |||                 ! (intrinsic function matmul not applied, due to massive
  1188: |||                 ! performance penalty on the NEC. Instead the intrinsic dot product
  1189: |||                 ! function is applied
  1190: |||       !TODO:  these should be nine scalars, since they should reside in registers
  1191: |||*===>            z_qt_times_d(1) = DOT_PRODUCT(lsq_qtmat_c(jc, 1, 1:9, jb), z_d(1:9, jc, jk))
  1192: |||*===>            z_qt_times_d(2) = DOT_PRODUCT(lsq_qtmat_c(jc, 2, 1:9, jb), z_d(1:9, jc, jk))
  1193: |||*===>            z_qt_times_d(3) = DOT_PRODUCT(lsq_qtmat_c(jc, 3, 1:9, jb), z_d(1:9, jc, jk))
  1194: |||*===>            z_qt_times_d(4) = DOT_PRODUCT(lsq_qtmat_c(jc, 4, 1:9, jb), z_d(1:9, jc, jk))
  1195: |||*===>            z_qt_times_d(5) = DOT_PRODUCT(lsq_qtmat_c(jc, 5, 1:9, jb), z_d(1:9, jc, jk))
  1196: |||*===>            z_qt_times_d(6) = DOT_PRODUCT(lsq_qtmat_c(jc, 6, 1:9, jb), z_d(1:9, jc, jk))
```

Approved-by: default avatarPradipta Samanta <samanta@dkrz.de>
Merged-by: default avatarPradipta Samanta <samanta@dkrz.de>
Changelog: default
9b808d6d
History
Name Last commit Last update