Add NEC-specific compiler option to mo_lib_divrot

Review changes
Download
Patches
Plain diff

Merged Add NEC-specific compiler option to mo_lib_divrot

Overview 0
Commits 1
Pipelines 1
Changes 1

Merged Daniel Reinert requested to merge libiconmath_nec_hotfix into main 4 months ago

Overview 0
Commits 1
Pipelines 1
Changes 1

What is the bug

The least squares polynomial reconstruction routines recon_lsq_cell_XY have been extracted from the main ICON code and moved to the library libiconmath. Testing on NEC@DWD revealed that some of these routines did not vectorize anymore after moving them to libiconmath. To illustrate the issue, the compile listing of the loop body is given below for recon_lsq_cell_c_lib

 1175: |+----->        DO jk = slev, elev
  1176: ||        
  1177: ||                !$ACC LOOP VECTOR PRIVATE(z_qt_times_d)
  1178: ||        !NEC$ ivdep
  1179: ||+---->          DO jc = i_startidx, i_endidx
  1180: |||       
  1181: |||                 ! calculate matrix vector product Q^T d (transposed of Q times LHS)
  1182: |||                 ! (intrinsic function matmul not applied, due to massive
  1183: |||                 ! performance penalty on the NEC. Instead the intrinsic dot product
  1184: |||                 ! function is applied
  1185: |||       !TODO:  these should be nine scalars, since they should reside in registers
  1186: |||V===>            z_qt_times_d(1) = DOT_PRODUCT(lsq_qtmat_c(jc, 1, 1:9, jb), z_d(1:9, jc, jk))
  1187: |||V===>            z_qt_times_d(2) = DOT_PRODUCT(lsq_qtmat_c(jc, 2, 1:9, jb), z_d(1:9, jc, jk))
  1188: |||V===>            z_qt_times_d(3) = DOT_PRODUCT(lsq_qtmat_c(jc, 3, 1:9, jb), z_d(1:9, jc, jk))
  1189: |||V===>            z_qt_times_d(4) = DOT_PRODUCT(lsq_qtmat_c(jc, 4, 1:9, jb), z_d(1:9, jc, jk))

For the library-variant of recon_lsq_cell_c we see that the DOT_PRODUCT gets vectorized rather than the horizontal jc loop, which leads to a significant performance penalty. Testing revealed, that the desired vectorization can be achieved by ensuring that a complete unrolling of the dot product is performed by the compiler. Unrolling is achieved by changing the compile option floop-unroll-completely=m from m=8 to m=10. This makes sense, since the loop count for the dot product in the example given is m=9.

How do you fix it

In order to minimize the possibility of unexpected side-effects, the updated compiler option is applied only locally to the module mo_lib_divrot, by adding the directive

!NEC$ options "-floop-unroll-completely=10"

to the top of the module. The resulting compile listing with corrected vectorization is given below:

  1180: |+----->        DO jk = slev, elev
  1181: ||        
  1182: ||                !$ACC LOOP VECTOR PRIVATE(z_qt_times_d)
  1183: ||        !NEC$ ivdep
  1184: ||V---->          DO jc = i_startidx, i_endidx
  1185: |||       
  1186: |||                 ! calculate matrix vector product Q^T d (transposed of Q times LHS)
  1187: |||                 ! (intrinsic function matmul not applied, due to massive
  1188: |||                 ! performance penalty on the NEC. Instead the intrinsic dot product
  1189: |||                 ! function is applied
  1190: |||       !TODO:  these should be nine scalars, since they should reside in registers
  1191: |||*===>            z_qt_times_d(1) = DOT_PRODUCT(lsq_qtmat_c(jc, 1, 1:9, jb), z_d(1:9, jc, jk))
  1192: |||*===>            z_qt_times_d(2) = DOT_PRODUCT(lsq_qtmat_c(jc, 2, 1:9, jb), z_d(1:9, jc, jk))
  1193: |||*===>            z_qt_times_d(3) = DOT_PRODUCT(lsq_qtmat_c(jc, 3, 1:9, jb), z_d(1:9, jc, jk))
  1194: |||*===>            z_qt_times_d(4) = DOT_PRODUCT(lsq_qtmat_c(jc, 4, 1:9, jb), z_d(1:9, jc, jk))
  1195: |||*===>            z_qt_times_d(5) = DOT_PRODUCT(lsq_qtmat_c(jc, 5, 1:9, jb), z_d(1:9, jc, jk))
  1196: |||*===>            z_qt_times_d(6) = DOT_PRODUCT(lsq_qtmat_c(jc, 6, 1:9, jb), z_d(1:9, jc, jk))

How urgent is the bugfix

I need it as soon as possible
I can wait for a couple of days
None of my current codes is directly affected

Mandatory steps before review

Gitlab CI passes (Hint: use make format for linting)
Bugfix is covered by additional unit tests
Mark the merge request as ready by removing Draft:

Mandatory steps before merge

Reviewed by a maintainer
Incorporate review suggestions
Remember to edit the commit message and select the proper changelog category (feature/bugfix/other)
Prior to merging, please remove any boilerplate from the MR description, retaining only the What is the bug and How do you fix it section to maintain

You are not supposed to merge this request by yourself, the maintainers of libiconmath take care of this action!

Edited 4 months ago by Pradipta Samanta

Merge request reports

Activity

Filter activity

Approvals
Assignees & reviewers
Comments (from bots)
Comments (from users)
Commits & branches
Edits
Labels
Lock status
Mentions
Merge request status
Tracking

Please register or sign in to reply

0 Assignees

None

Select assignees

0 Reviewers

None

Request review from

Labels

ready for review

Select labels

Manage project labels

Milestone

None

Time tracking

No estimate or time spent

0 Participants

Dec 20, 2024
- add NEC-specific compiler option to mo_lib_divrot in order to ensure complete... · af61c651
  Daniel Reinert authored 4 months ago
  
  add NEC-specific compiler option to mo_lib_divrot in order to ensure complete unrolling of dot products
  af61c651