Bug in internal device context management
Since we reduced the number of device contexts per GPU to one (and also only one context for the HOST), we have the invariant context_num = gpu_num + 1
. The internal context management in the C-bindings, however, (see here) creates GPU contexts within range(0, max_gpu)
which uses the configurated maximum number of GPUs - not the actual count of usable GPUs. This violates the above invariant assumption and the test assert(ctxs.size () == num_gpus+1);
fails if these counts disagree.
@m212005 The problem has been noticed for the Mmgr usage for tmx on GPUs.
(Annoying) Workaround: configure the memory manager with the same numbers that are getting used in the experiment.
The fix seems to be easy: Limit the internal context creation to the actual number of GPUs. @b382190 @m300488 Could we make a hotfix MR for this?