Skip to content
Snippets Groups Projects
  1. Dec 20, 2024
    • Daniel Reinert's avatar
      Add NEC-specific compiler option to mo_lib_divrot (!20) · 9b808d6d
      Daniel Reinert authored and Pradipta Samanta's avatar Pradipta Samanta committed
      
      ## What is the bug
      
      The least squares polynomial reconstruction routines `recon_lsq_cell_XY` have been extracted from the main ICON code and moved to the library `libiconmath`. Testing on NEC@DWD revealed that some of these routines did not vectorize anymore after moving them to `libiconmath`. To illustrate the issue, the compile listing of the loop body is given below for `recon_lsq_cell_c_lib`
      ```
       1175: |+----->        DO jk = slev, elev
        1176: ||        
        1177: ||                !$ACC LOOP VECTOR PRIVATE(z_qt_times_d)
        1178: ||        !NEC$ ivdep
        1179: ||+---->          DO jc = i_startidx, i_endidx
        1180: |||       
        1181: |||                 ! calculate matrix vector product Q^T d (transposed of Q times LHS)
        1182: |||                 ! (intrinsic function matmul not applied, due to massive
        1183: |||                 ! performance penalty on the NEC. Instead the intrinsic dot product
        1184: |||                 ! function is applied
        1185: |||       !TODO:  these should be nine scalars, since they should reside in registers
        1186: |||V===>            z_qt_times_d(1) = DOT_PRODUCT(lsq_qtmat_c(jc, 1, 1:9, jb), z_d(1:9, jc, jk))
        1187: |||V===>            z_qt_times_d(2) = DOT_PRODUCT(lsq_qtmat_c(jc, 2, 1:9, jb), z_d(1:9, jc, jk))
        1188: |||V===>            z_qt_times_d(3) = DOT_PRODUCT(lsq_qtmat_c(jc, 3, 1:9, jb), z_d(1:9, jc, jk))
        1189: |||V===>            z_qt_times_d(4) = DOT_PRODUCT(lsq_qtmat_c(jc, 4, 1:9, jb), z_d(1:9, jc, jk))
      ```
      For the library-variant of `recon_lsq_cell_c` we see that the DOT_PRODUCT gets vectorized rather than the horizontal `jc` loop, which leads to a significant performance penalty. Testing revealed, that the desired vectorization can be achieved by ensuring that a complete unrolling of the dot product is performed by the compiler. Unrolling is achieved by changing the compile option `floop-unroll-completely=m` from `m=8` to `m=10`. This makes sense, since the loop count for the dot product in the example given is `m=9`. 
      
      ## How do you fix it
      
      In order to minimize the possibility of unexpected side-effects, the updated compiler option is applied only locally to the module  `mo_lib_divrot`, by adding the directive
      ```
      !NEC$ options "-floop-unroll-completely=10"
      ```
      to the top of the module. The resulting compile listing with corrected vectorization is given below:
      ```
        1180: |+----->        DO jk = slev, elev
        1181: ||        
        1182: ||                !$ACC LOOP VECTOR PRIVATE(z_qt_times_d)
        1183: ||        !NEC$ ivdep
        1184: ||V---->          DO jc = i_startidx, i_endidx
        1185: |||       
        1186: |||                 ! calculate matrix vector product Q^T d (transposed of Q times LHS)
        1187: |||                 ! (intrinsic function matmul not applied, due to massive
        1188: |||                 ! performance penalty on the NEC. Instead the intrinsic dot product
        1189: |||                 ! function is applied
        1190: |||       !TODO:  these should be nine scalars, since they should reside in registers
        1191: |||*===>            z_qt_times_d(1) = DOT_PRODUCT(lsq_qtmat_c(jc, 1, 1:9, jb), z_d(1:9, jc, jk))
        1192: |||*===>            z_qt_times_d(2) = DOT_PRODUCT(lsq_qtmat_c(jc, 2, 1:9, jb), z_d(1:9, jc, jk))
        1193: |||*===>            z_qt_times_d(3) = DOT_PRODUCT(lsq_qtmat_c(jc, 3, 1:9, jb), z_d(1:9, jc, jk))
        1194: |||*===>            z_qt_times_d(4) = DOT_PRODUCT(lsq_qtmat_c(jc, 4, 1:9, jb), z_d(1:9, jc, jk))
        1195: |||*===>            z_qt_times_d(5) = DOT_PRODUCT(lsq_qtmat_c(jc, 5, 1:9, jb), z_d(1:9, jc, jk))
        1196: |||*===>            z_qt_times_d(6) = DOT_PRODUCT(lsq_qtmat_c(jc, 6, 1:9, jb), z_d(1:9, jc, jk))
      ```
      
      Approved-by: default avatarPradipta Samanta <samanta@dkrz.de>
      Merged-by: default avatarPradipta Samanta <samanta@dkrz.de>
      Changelog: default
  2. Sep 24, 2024
  3. Sep 23, 2024
  4. Aug 27, 2024
  5. Aug 23, 2024
  6. Aug 21, 2024
  7. Aug 16, 2024
  8. Aug 15, 2024
  9. Aug 13, 2024
  10. Aug 12, 2024
  11. Aug 08, 2024
  12. Aug 07, 2024
Loading