MPICH 4.x fails the tests
I was looking for a version/configuration of MPICH that would pass YAXT tests.
I have tested the following full matrix:
- MPICH versions:
main
(a.k.a.4.1.x
),4.0.x
,4.0.2
,3.4.x
,3.4.3
; - datatype engines:
yaksa
,dataloop
- YAXT versions:
master
,0.9.3.1
In all cases, I used the system installation of GCC 8.4.1 (i.e. the one from /usr/bin
) on Levante.
The result was that all tests passed except for:
-
test_redist_collection_parallel_run
fails in all configurations. It looks like a compiler bug to me: this condition is true although all values in the arrays are equal. The test does not fail with DKRZ-provided GCC 11.2.0 and on my machine with Debian-provided GCC 10.2.1. -
test_ddt_run
(was added after the YAXT release0.9.3.1
) fails with all4.x
versions of MPICH and datatype engineyaksa
(dataloop
is fine). I haven't looked deep into this one. The only thing I can tell is that the failure looks like:YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Floating point exception (signal 8)
. -
test_exchanger_parallel_run
fails with all4.x
versions of MPICH, regardless of the datatype engine. More specifically, the following command fails:$MPI_LAUNCH -n 3 ./test_exchanger_parallel -m neigh_alltoall
. It looks like there is an assertion that fails here (must be a different line in the case ofdataloop
). That happens because the variableoutcount
(here) has a very strange value8101236137501347664
, which comes from this line as the value of(rreq->dev.ch4.am).count
. And here I stopped digging.
I am attaching test_yaxt_mpich.sh, which reproduces all of the above on Levante (testing with GCC 10+ requires configuring MPICH with two additional arguments: FFLAGS=-fallow-argument-mismatch FCFLAGS=-fallow-argument-mismatch
).
It would be nice if someone took a deeper look at the second problem, came up with a minimal reproducer of the third one and submitted bug reports for both to the MPICH developers. It might also make sense to introduce a workaround for the first issue for the buggy compiler (not the first one, not the last one, I guess).