Cleanup for next release
-
This branch collects several fixes to make the recent features work in the various possible build configurations. This should now be valgrind-clean.
-
The work-arounds for the various recent bad MPICH releases are not at all what I want them to be. I tried various more ambitious approaches but came up empty-handed, because unfortunately, libmpich.so is quite hostile to interpositioning work-arounds: the internal routines of mpich are not accessible in any way but I couldn't build a successful workaround without access to this "plumbing".
-
I created a merge request to discuss what might still need fixing or might be missing from the current master branch.
-
@k202077 Can you comment on the following issue I see when building with pgcc and OpenAcc active but testing on a cpu-only node:
(gdb) c Continuing. [New Thread 0x7ffff0a1e700 (LWP 3594590)] [New Thread 0x7fffebfff700 (LWP 3594591)] [New Thread 0x7fffea28e700 (LWP 3594592)] ----------------------------------------------------------------------- WARNING: yaxt was compiled with CUDA-support, but the library could not be loaded. CUDA-support will be deactivated. Try setting LD_LIBRARY_PATH to the location of libcuda.so.1 or set RPATH accordingly. To suppress this message set the XT_CUDA_WARN_ON_MISSING_LIBCUDA environment variable to "0" ----------------------------------------------------------------------- Failing in Thread:0 call to cuInit returned error 34: Other Thread 1 "test_ddt" hit Breakpoint 1, 0x00007ffff518cc20 in exit () from /usr/lib64/libc.so.6 Missing separate debuginfos, use: yum debuginfo-install libibverbs-55mlnx37-1.55103.x86_64 librdmacm-55mlnx37-1.55103.x86_64 (gdb) bt #0 0x00007ffff518cc20 in exit () from /usr/lib64/libc.so.6 #1 0x00007ffff744c95a in __pgi_uacc_exit (s=0x7b4bd0 "call to cuInit returned error 34: Other\n") at ../../src/init.c:777 #2 0x00007ffff6ed7d36 in __pgi_uacc_cuda_error_handler () from /sw/spack-levante/nvhpc-22.5-v4oky3/Linux_x86_64/22.5/compilers/lib/libaccdevice.so #3 0x00007ffff6ee7f1f in __pgi_uacc_cuda_init () at ../../src/cuda_init.c:448 #4 0x00007ffff744bd4d in __pgi_uacc_enumerate () at ../../src/init.c:553 #5 0x00007ffff744c16c in __pgi_uacc_initialize () at ../../src/init.c:638 #6 0x00007ffff744c18f in __pgi_uacc_initialize_all () at ../../src/init.c:652 #7 0x00007ffff7440a3b in __pgi_uacc_dataenterstart2 ( filename=0x415c60 <.F0003.7263> "/home/k/k202069/Documents/work/dkrz/build/yaxt-x64-linux-nvidia-22.5.0-openmpi-4.1.2/tests/../../../src/yaxt/tests/test_ddt.c", funcname=0x415ce0 <.F0004.7265> "check_xt_ddt_off", lineno=1047, endlineno=1097, funcstartlineno=1005, funcendlineno=1108, async=-1, pdevid=0x7fffffff54c4, psavedevid=0x7fffffff5444, parentconstruct=acc_construct_data) at ../../src/dataenterstart.c:57 #8 0x000000000040572c in check_xt_ddt_off (mpi_ddt=<optimized out>, in_data_=<optimized out>, in_data_offset=<optimized out>, in_data_size=4, ref_pack_size=4) at ../../../src/yaxt/tests/test_ddt.c:1047 #9 0x0000000000402d66 in main (argc=1, argv=0x7fffffffac38) at ../../../src/yaxt/tests/test_ddt.c:1114 (gdb)
It seems to me that the openacc pragmas are missing some conditional on whether initialization of a gpu context succeeded or not?
-
Please don't just merge this MR, as stated above I'd like to settle some discussion first.