Skip to content

Refactor backtrace

Jonas Jucker requested to merge backtrace into master

What happens when finish is called:

  1. call backtrace passed as callback
  2. raise SIGSEGV in traceback function
  3. signal handler aborts MPI-procs

What code implies happens when finish is called:

  1. call backtrace passed as callback
  2. call model_abort passed as callback
  3. model_abort call MPI_ABORT

This MR moves the code path back to what should happen with the following changes:

  • remove any raise(SIGSEGV) from traceback-function, except NEC that offers no better traceback option
  • fallback options with primitive backtrace if unwind.h and pthreads.h not available
[util_backtrace]: /hpc/sw/buildbot/home/data0/DWD_nec/build/bin/icon() [0x6000061e7c98]
[util_backtrace]: /hpc/sw/buildbot/home/data0/DWD_nec/build/bin/icon() [0x6000061b1d28]
[util_backtrace]: /hpc/sw/buildbot/home/data0/DWD_nec/build/bin/icon() [0x600006168570]
[util_backtrace]: /hpc/sw/buildbot/home/data0/DWD_nec/build/bin/icon() [0x600003cbedc0]
[util_backtrace]: /hpc/sw/buildbot/home/data0/DWD_nec/build/bin/icon() [0x600002499fc0]
[util_backtrace]: /hpc/sw/buildbot/home/data0/DWD_nec/build/bin/icon() [0x600000fe73b0]
[util_backtrace]: /hpc/sw/buildbot/home/data0/DWD_nec/build/bin/icon() [0x600000fd0eb8]
[util_backtrace]: /hpc/sw/buildbot/home/data0/DWD_nec/build/bin/icon() [0x600000fc4280]
[util_backtrace]: /hpc/sw/buildbot/home/data0/DWD_nec/build/bin/icon() [0x600000387d18]
[util_backtrace]: /hpc/sw/buildbot/home/data0/DWD_nec/build/bin/icon() [0x60000014d1b0]
[util_backtrace]: /hpc/sw/buildbot/home/data0/DWD_nec/build/bin/icon() [0x600000033cc0]
[util_backtrace]: /hpc/sw/buildbot/home/data0/DWD_nec/build/bin/icon() [0x6000092a4b08]
[util_backtrace]: /opt/nec/ve/lib/libc.so.6(__libc_start_main+0x3a8) [0x600c02040818]
[util_backtrace]: /hpc/sw/buildbot/home/data0/DWD_nec/build/bin/icon() [0x600000014108]
[util_backtrace]: Use addr2line for addresses to line number conversion.
  • fallback for (__APPLE_)
No backtrace for APPLE available
  • fallback for (__NEC_)
no backtrace for NEC available, raise SIGSEGV instead 

Levante-gcc

 227: #0 0x1d1b0f8 - stacktrace_print in /work/mh0287/buildbot/levante5/levante_gcc/build/externals/fortran-support/src/util_backtrace.c:177
227: #1 0x1cf9299 - __mo_exception_MOD_finish in /work/mh0287/buildbot/levante5/levante_gcc/build/externals/fortran-support/src/mo_exception.f90:238
227: #2 0x16591e5 - __mo_nwp_gscp_interface_MOD_nwp_microphysics in /work/mh0287/buildbot/levante5/levante_gcc/build/src/atm_phy_nwp/mo_nwp_gscp_interface.f90:652
227: #3 0x134df36 - __mo_nh_interface_nwp_MOD_nwp_nh_interface in /work/mh0287/buildbot/levante5/levante_gcc/build/src/atm_phy_nwp/mo_nh_interface_nwp.f90:803
227: #4 0xec705f - __mo_nh_stepping_MOD_integrate_nh in /work/mh0287/buildbot/levante5/levante_gcc/build/src/atm_dyn_iconam/mo_nh_stepping.f90:2115
227: #5 0xecd287 - allocate_nh_stepping in /work/mh0287/buildbot/levante5/levante_gcc/build/src/atm_dyn_iconam/mo_nh_stepping.f90:1187
227: #6 0x710410 - __mo_atmo_nonhydrostatic_MOD_atmo_nonhydrostatic in /work/mh0287/buildbot/levante5/levante_gcc/build/src/drivers/mo_atmo_nonhydrostatic.f90:245
227: #7 0x44cb8c - __mo_atmo_model_MOD_atmo_model in /work/mh0287/buildbot/levante5/levante_gcc/build/src/drivers/mo_atmo_model.f90:209
227: #8 0x40f0f1 - MAIN__ in icon.f90:0

Levante-intel

 64: #0 0x2661c41 - mo_exception_mp_finish_ in /work/mh0287/buildbot/levante5/levante_intel/build/externals/fortran-support/src/mo_exception.f90:238
 64: #1 0x1e060d1 - mo_nwp_gscp_interface_mp_nwp_microphysics_ in /work/mh0287/buildbot/levante5/levante_intel/build/src/atm_phy_nwp/mo_nwp_gscp_interface.f90:652
 64: #2 0x19f1f03 - mo_nh_interface_nwp_mp_nwp_nh_interface_ in /work/mh0287/buildbot/levante5/levante_intel/build/src/atm_phy_nwp/mo_nh_interface_nwp.f90:789
 64: #3 0x130a31f - mo_nh_stepping_mp_integrate_nh_ in /work/mh0287/buildbot/levante5/levante_intel/build/src/atm_dyn_iconam/mo_nh_stepping.f90:2091
 64: #4 0x12f8fa1 - mo_nh_stepping_mp_perform_nh_timeloop_ in /work/mh0287/buildbot/levante5/levante_intel/build/src/atm_dyn_iconam/mo_nh_stepping.f90:1166
 64: #5 0x12f3df9 - deallocate_nh_stepping in /work/mh0287/buildbot/levante5/levante_intel/build/src/atm_dyn_iconam/mo_nh_stepping.f90:3326
 64: #6 0x8e44db - mo_atmo_nonhydrostatic_mp_atmo_nonhydrostatic_ in /work/mh0287/buildbot/levante5/levante_intel/build/src/drivers/mo_atmo_nonhydrostatic.f90:245
 64: #7 0x494698 - mo_atmo_model_mp_atmo_model_ in /work/mh0287/buildbot/levante5/levante_intel/build/src/drivers/mo_atmo_model.f90:202
 64: #8 0x41e959 - MAIN__ in /work/mh0287/buildbot/levante5/levante_intel/build/src/drivers/icon.f90:227
 64: #9 0x41e362 - main in /work/mh0287/buildbot/levante5/levante_intel/build/bin/icon:0
 64: #10 0x7fff7c791cf3 - ?? in /usr/lib64/libc-2.28.so:0
 64: #11 0x41e26e - _start in /work/mh0287/buildbot/levante5/levante_intel/build/bin/icon:0

levante-nag

 87: #0 0xa2055d8 - stacktrace_print in /work/mh0287/buildbot/levante1/levante_nag/build/externals/fortran-support/src/util_backtrace.c:177
 87: #1 0xa1b728e - mo_util_backtrace_MP_ftn_util_backtrace in /work/mh0287/buildbot/levante1/levante_nag/build/externals/fortran-support/src/mo_util_backtrace.f90:25
 87: #2 0xa147e35 - mo_exception_MP_finish in /work/mh0287/buildbot/levante1/levante_nag/build/externals/fortran-support/src/mo_exception.f90:238
 87: #3 0x82be13c - mo_nwp_gscp_interface_MP_nwp_microphysics in /work/mh0287/buildbot/levante1/levante_nag/build/src/atm_phy_nwp/mo_nwp_gscp_interface.f90:652
 87: #4 0x6d8e0ee - mo_nh_interface_nwp_MP_nwp_nh_interface in /work/mh0287/buildbot/levante1/levante_nag/build/src/atm_phy_nwp/mo_nh_interface_nwp.f90:789
 87: #5 0x462e945 - mo_nh_stepping_MP_integrate_nh in /work/mh0287/buildbot/levante1/levante_nag/build/src/atm_dyn_iconam/mo_nh_stepping.f90:2091
 87: #6 0x4654f3a - mo_nh_stepping_MP_perform_nh_timeloop in /work/mh0287/buildbot/levante1/levante_nag/build/src/atm_dyn_iconam/mo_nh_stepping.f90:1166
 87: #7 0x4676a41 - mo_nh_stepping_MP_perform_nh_stepping in /work/mh0287/buildbot/levante1/levante_nag/build/src/atm_dyn_iconam/mo_nh_stepping.f90:703
 87: #8 0x1af4401 - mo_atmo_nonhydrostatic_MP_atmo_nonhydrostatic in /work/mh0287/buildbot/levante1/levante_nag/build/src/drivers/mo_atmo_nonhydrostatic.f90:243
 87: #9 0x61bde0 - mo_atmo_model_MP_atmo_model in /work/mh0287/buildbot/levante1/levante_nag/build/src/drivers/mo_atmo_model.f90:204
 87: #10 0x426c01 - icon_ in /work/mh0287/buildbot/levante1/levante_nag/build/src/drivers/icon.f90:227
 87: #11 0x424222 - main in /work/mh0287/buildbot/levante1/levante_nag/build/src/drivers/icon.f90:16
 87: #12 0x7fff7c5dfcf3 - ?? in /usr/lib64/libc-2.28.so:0

levante_cpu_nvhpc

 0: #0 0x1808e88 - mo_util_backtrace_ftn_util_backtrace_ in /work/mh0287/buildbot/levante4/levante_cpu_nvhpc/build/externals/fortran-support/src/mo_util_backtrace.f90:25
 0: #1 0x17f3f49 - mo_exception_finish_ in /work/mh0287/buildbot/levante4/levante_cpu_nvhpc/build/externals/fortran-support/src/mo_exception.f90:238
 0: #2 0xd4e411 - mo_nh_stepping_perform_nh_stepping_ in /work/mh0287/buildbot/levante4/levante_cpu_nvhpc/build/src/atm_dyn_iconam/mo_nh_stepping.f90:297
 0: #3 0x720120 - mo_atmo_nonhydrostatic_atmo_nonhydrostatic_ in /work/mh0287/buildbot/levante4/levante_cpu_nvhpc/build/src/drivers/mo_atmo_nonhydrostatic.f90:238
 0: #4 0x463ae7 - mo_atmo_model_atmo_model_ in /work/mh0287/buildbot/levante4/levante_cpu_nvhpc/build/src/drivers/mo_atmo_model.f90:209
 0: #5 0x410483 - MAIN_ in /work/mh0287/buildbot/levante4/levante_cpu_nvhpc/build/src/drivers/icon.f90:265
 0: #6 0x40fcf3 - main in /work/mh0287/buildbot/levante4/levante_cpu_nvhpc/build/bin/icon:0
 0: #7 0x7fff7a83bcf3 - ?? in /usr/lib64/libc-2.28.so:0
 0: #8 0x40fbee - _start in /work/mh0287/buildbot/levante4/levante_cpu_nvhpc/build/bin/icon:0
 0: #9 (nil) - ?? in ??:

levante_gpu_nvhpc

 #0 0x245ed87 - mo_exception_finish_ in /work/mh0287/buildbot/levante5/levante_gpu_nvhpc/build/externals/fortran-support/src/mo_exception.f90:238
0: #1 0x102c46f - mo_nh_stepping_perform_nh_stepping_ in /work/mh0287/buildbot/levante5/levante_gpu_nvhpc/build/src/atm_dyn_iconam/mo_nh_stepping.f90:297
0: #2 0x791089 - mo_atmo_nonhydrostatic_atmo_nonhydrostatic_ in /work/mh0287/buildbot/levante5/levante_gpu_nvhpc/build/src/drivers/mo_atmo_nonhydrostatic.f90:238
0: #3 0x475777 - mo_atmo_model_atmo_model_ in /work/mh0287/buildbot/levante5/levante_gpu_nvhpc/build/src/drivers/mo_atmo_model.f90:209
0: #4 0x412ea4 - MAIN_ in /work/mh0287/buildbot/levante5/levante_gpu_nvhpc/build/src/drivers/icon.f90:265
0: #5 0x412833 - main in /work/mh0287/buildbot/levante5/levante_gpu_nvhpc/build/bin/icon:0
0: #6 0x7fff79de5cf3 - ?? in /usr/lib64/libc-2.28.so:0
0: #7 0x41270e - _start in /work/mh0287/buildbot/levante5/levante_gpu_nvhpc/build/bin/icon:0
0: #8 (nil) - ?? in ??

daint_cpu_cce

#0 0x2dbccda - resize_column$mo_util_table_ in /scratch/snx3000/icontest/buildbot/icon-new/daint103/DAINT_CPU_cce/build/bin/icon:0
#1 0x2d8db7b - stack_op_plus_eval_2d$mo_expression_ in /scratch/snx3000/icontest/buildbot/icon-new/daint103/DAINT_CPU_cce/build/bin/icon:0
#2 0x19adab1 - perform_nh_timeloop$mo_nh_stepping_ in /scratch/snx3000/icontest/buildbot/icon-new/daint103/DAINT_CPU_cce/build/bin/icon:0
#3 0xc83361 - allocate_grf_state$mo_grf_intp_state_ in /scratch/snx3000/icontest/buildbot/icon-new/daint103/DAINT_CPU_cce/build/bin/icon:0
#4 0x482f88 - write_geometry_info$mo_grid_geometry_info_ in /scratch/snx3000/icontest/buildbot/icon-new/daint103/DAINT_CPU_cce/build/bin/icon:0
#5 0x419cf9 - p_recv_int_3d$mo_mpi_ in /scratch/snx3000/icontest/buildbot/icon-new/daint103/DAINT_CPU_cce/build/bin/icon:0
#6 0x15554cd5b3ea - ?? in /lib64/libc-2.26.so:0

Question

Could we use this traceback also in util_signal.c for better feedback from the signla trap?

Edited by Jonas Jucker

Merge request reports

Loading