The version with the loops collapsed, when built for the CPU, presents in 3 of the 5 parallel regions a reduction in total time when compared with the baseline built for the CPU target. This is in contrast with the data obtained from the CORE2 tracer dwarf performance evaluation where most of the kernels suffered from an increase in CPU time when moving from the baseline to the version with the loops collapsed. A similar improvement and degradation behavior with the collapsing of the loops is observed in the “OpenMP vs OpenACC” performance evaluation (improvement for UPWH and UPWV with degradation for MFCT and QR4C) for both tools.
...
...
@@ -55,4 +55,4 @@ The version with the loops collapsed, when built for the CPU, presents in 3 of t
The following plot shows the achieved speedups for each kernel/parallel region, the overall combined compute time of all the kernels and the total dwarf time (including MPI communication time). Compute time is defined as the sum of all kernel times.