Skip to content

Cleanup for next release

Thomas Jahns requested to merge pre-release-cleanup into master
  1. This branch collects several fixes to make the recent features work in the various possible build configurations. This should now be valgrind-clean.

  2. The work-arounds for the various recent bad MPICH releases are not at all what I want them to be. I tried various more ambitious approaches but came up empty-handed, because unfortunately, libmpich.so is quite hostile to interpositioning work-arounds: the internal routines of mpich are not accessible in any way but I couldn't build a successful workaround without access to this "plumbing".

  3. I created a merge request to discuss what might still need fixing or might be missing from the current master branch.

  4. @k202077 Can you comment on the following issue I see when building with pgcc and OpenAcc active but testing on a cpu-only node:

     (gdb) c
     Continuing.
     [New Thread 0x7ffff0a1e700 (LWP 3594590)]
     [New Thread 0x7fffebfff700 (LWP 3594591)]
     [New Thread 0x7fffea28e700 (LWP 3594592)]
     -----------------------------------------------------------------------
     WARNING: yaxt was compiled with CUDA-support, but the library could not
              be loaded. CUDA-support will be deactivated. Try setting
              LD_LIBRARY_PATH to the location of libcuda.so.1 or set RPATH
              accordingly.
              To suppress this message set the
              XT_CUDA_WARN_ON_MISSING_LIBCUDA environment variable to "0"
     -----------------------------------------------------------------------
     Failing in Thread:0
     call to cuInit returned error 34: Other
     
     
     Thread 1 "test_ddt" hit Breakpoint 1, 0x00007ffff518cc20 in exit () from /usr/lib64/libc.so.6
     Missing separate debuginfos, use: yum debuginfo-install libibverbs-55mlnx37-1.55103.x86_64 librdmacm-55mlnx37-1.55103.x86_64
     (gdb) bt
     #0  0x00007ffff518cc20 in exit () from /usr/lib64/libc.so.6
     #1  0x00007ffff744c95a in __pgi_uacc_exit (s=0x7b4bd0 "call to cuInit returned error 34: Other\n") at ../../src/init.c:777
     #2  0x00007ffff6ed7d36 in __pgi_uacc_cuda_error_handler ()
        from /sw/spack-levante/nvhpc-22.5-v4oky3/Linux_x86_64/22.5/compilers/lib/libaccdevice.so
     #3  0x00007ffff6ee7f1f in __pgi_uacc_cuda_init () at ../../src/cuda_init.c:448
     #4  0x00007ffff744bd4d in __pgi_uacc_enumerate () at ../../src/init.c:553
     #5  0x00007ffff744c16c in __pgi_uacc_initialize () at ../../src/init.c:638
     #6  0x00007ffff744c18f in __pgi_uacc_initialize_all () at ../../src/init.c:652
     #7  0x00007ffff7440a3b in __pgi_uacc_dataenterstart2 (
         filename=0x415c60 <.F0003.7263> "/home/k/k202069/Documents/work/dkrz/build/yaxt-x64-linux-nvidia-22.5.0-openmpi-4.1.2/tests/../../../src/yaxt/tests/test_ddt.c", funcname=0x415ce0 <.F0004.7265> "check_xt_ddt_off", lineno=1047, endlineno=1097, 
         funcstartlineno=1005, funcendlineno=1108, async=-1, pdevid=0x7fffffff54c4, psavedevid=0x7fffffff5444, 
         parentconstruct=acc_construct_data) at ../../src/dataenterstart.c:57
     #8  0x000000000040572c in check_xt_ddt_off (mpi_ddt=<optimized out>, in_data_=<optimized out>, 
          in_data_offset=<optimized out>, in_data_size=4, ref_pack_size=4) at ../../../src/yaxt/tests/test_ddt.c:1047
     #9  0x0000000000402d66 in main (argc=1, argv=0x7fffffffac38) at ../../../src/yaxt/tests/test_ddt.c:1114
     (gdb)

    It seems to me that the openacc pragmas are missing some conditional on whether initialization of a gpu context succeeded or not?

  5. Please don't just merge this MR, as stated above I'd like to settle some discussion first.

Edited by Thomas Jahns

Merge request reports