1. How it works#

Running in parallel on different processes (using MPI) requires special treatment in case of error.

In fact, if nothing is done, if an error does not occur on all the processors at the same time, i.e. between the same two communications, one processor stops and the others wait for it at the next call indefinitely until stop CPU and the entire calculation is lost.

This particular treatment consists in verifying before engaging in global communication that all the processors are there and in what condition they are (did they send an error message or not).

If they are all there and none has encountered an error, we continue with the scheduled communication.
If they are all there but at least one processor has issued an error message, all processors are asked to shut down as usual (by raising an exception).

The behavior is then the same as sequentially: error <S>(exception) and therefore saving the database files.

If at least one of the processors is not there (within an agreed period of time), it is because this processor is stuck on a task much longer than on the other processors, or in an infinite loop, a programming error, or it has quit abruptly.

In this case, it is necessary to abruptly interrupt the execution of the remaining processors. The base does not have to be saved.

In addition, a call is made in FIN to retrieve the number of alarms sent (and not ignored) by each of the processors. For the diagnosis, done on the processor #0, an alarm is sent which simply gives the number of alarms emitted by processor.

This prevents you from « missing » an alarm that would only have occurred on one processor.