2. Implementation details#

Note

This paragraph consists of development notes and should allow an outside eye to understand how this was done.

2.1. Overall status of implementation#

It is necessary to store the state of the processors to try to stop as many calculations as possible cleanly.

We have to store: ok/error, separate proc #0 /others

Necessary functions:

  • say that everything is ok everywhere.

  • say that an error was seen on proc #0 or others.

  • know if everything is ok.

  • find out if there was an error on proc #0 or others.

The state is stored in a COMMON and two routines exist to query (GTSTAT, for*get status*) and assign the state (STSTAT, for*set status*). Constants are used to simplify reading (see aster_constant.h).

Content of aster_constant.h:

#define ST_OK 0

#define ST_AL_PR0 1 /alarm on processor #0/

#define ST_AL_OTH 2 /alarm on another processor/

#define ST_ER_PR0 4 /error on processor #0/

#define ST_ER_OTH 8 /error on another processor/

#define ST_UN_OTH 16 /undefined status for another processor/

Bit by bit logic operations are used to store and know if a state is positioned.

2.2. Non-blocking communications#

The key to detecting that some processors are not responding to appointments is to use non-blocking MPI communications.

The starting point for specific processing is in u2mesg when the error message is sent.

2.2.1. U2mesg#

In case of error, proc #0 is notified by calling mpicmw (). Same in Utmess.py, by calling, aster_core.mpi_warn ().

2.2.2. Mpisst#

Send to proc #0 ST_OK or ST_ER (MPI_ISEND tag CHK, nonblocking send) and wait for the response from proc #0 (MPI_IRECV tag CNT, nonblocking receive). We set a timeout so as not to wait indefinitely. If proc #0 does not respond within the time limit, MPI_Abort is called via mpistp (1).

If we sent ST_OK, we want to know if we should continue or not.

If we sent ST_ER, we just want to know if proc #0 is responding (in this case we stop properly), otherwise we have to stop the execution.

If no timeout, we return the response from proc #0: ST_OK (everything is fine), ST_ER (do a clean stop).

2.2.3. Mpicmw#

Alert proc #0 that we have encountered a problem

  • On proc! = 0, we set ST_ER_OTH (error on a processor other than #0) and we send ST_ER to proc #0 with mpisst (ST_ER).

The answer of proc #0 is ST_ER, we continue as if sequentially (exception, base closure or abort).

  • On proc #0, we set ST_ER_PR0 (error specific to proc #0) and we call mpichk ().

2.2.4. Mpichk#

Called before making a global communication to check that everything is fine and if not act accordingly.

  • On proc! = 0, we send ST_OK to proc #0 with mpisst (ST_OK) and we wait for the response from proc #0 to know if we should continue or interrupt.

If proc #0 replies that the execution should be stopped, we call mpistp (2).

  • On proc #0, we wait for the response from all the other processors (MPI_IRECV tag CHK nonblocking receive) + a timeout so as not to wait indefinitely.

  • If a proc encountered an error (and therefore sent ST_ER), message « error on proc #i » + STSTAT (ST_ER_OTH).

  • If one of the procs does not respond within the deadline, message « proc #0 waited too long » and error “E” « processor #i did not respond » + STSTAT (ST_UN_OTH).

  • To the processors present at the appointment, we answer “continue” or “interrupt” (MPI_SEND tag CNT, blocking send). In case of error on proc #0, we send interrupt. To interrupt, call mpistp (2).

  • If one of the processors did not make the appointment, proc #0 stops the execution with MPI_Abort: we call mpistp (1).

mpichk provides a return code: 0 = ok, 1 = nook.

2.2.5. Mpistp#

Used to stop execution.

  • mpistp (2): all processors have communicated their status, so you can properly interrupt the execution with u2mess (“M”, “APPELMPI_95”). If an exception has already been raised by the previous u2mess (“F” or “S”), avoid recursion and not throw another exception. If no error has already been issued, the behavior is that of an ordinary “F” error.

  • mpistp (1): at least one processor did not respond (maybe proc #0), we must interrupt everyone including this one who is not responding. We issue a u2mess (“D”, “APPELMPI_99”) which prints the message with “F” (for diagnosis) but does not emit an exception – which would cause an unplug, and we would therefore not execute the next step – then we call JEFINI (“ERREUR”) to trigger MPI_Abort.

  • if ERREUR_F =” ABORT “, mpistp (2) becomes mpistp (1).

  • One should not execute instructions after a call to mpistp (2), do GOTO end of routine when calling mpichk ().

2.2.6. mpicm1/mpicm2#

Before starting a call, we call mpichk () to check that there was no problem. Take into account the return code and stop without making the call!

2.2.7. I’m done/MPI_Abort#

Instead of stopping with ABORT (), we call ASABRT (6) (6 corresponds to SIGABRT) which calls MPI_Abort.

It is essential to call MPI_Abort to be able to stop everyone, including the blocked processors. However MPI_Abort implies the end of the script launched by mpirun and therefore the copying of the results from the #0 proc directory to the global directory cannot take place (well, this may depend on the MPI implementation).

So « error in MPI » should result in « no database saved » and in case of error, the diagnosis may not be very detailed (depending on the MPI implementation, the fort.8/fort.9 files are or are not copied into the global working directory). The diagnosis is likely to be <F>_ ABNORMAL_ABORT instead of <F>_ ERROR.

2.2.8. Additional notes, precautions#

MPI_Abort did not stop the execution.

In MPI, all processors have to go through MPI_Finalize before going out.

However, in the Python interpreter, you exit by “sys.exit () “which probably calls the system function” exit “and therefore you cannot add a call to MPI_Finalize before exiting. That’s why we register a function that executes MPI_Finalize via “atexit”.

The problem is that this function is also called after a MPI_Abort. Execution is therefore blocked without interrupting all processors. We therefore define a function ASABRT that does” abort “or” MPI_Abort “in parallel and that sets a flag so as not to go through MPI_Finalize in the” terminate “function (cf. « aster_error.c/h »).

Precaution for Fortran calls from C

Since we call Fortran routines from C, knowing that almost all routines are likely to emit u2mess and therefore raise exceptions, it is imperative that the C extension of the module (aster or aster_core) provide a try/except (in C) to handle this exception (and return NULL in case of error).

Indeed, the exception causes the execution to be unplugged. If there is no try, you may not be plugged back in where you think. A programming error would alert if no try has been set up above.

Example:

static PyObject*aster_mpi_warn (PyObject*self, PyObject*args)

{

try {

CALL_MPICMW ();

}

ExceptAll {

raiseException ();

}

endTry ();

Py_ INCREF (Py_None);

return Py_None;

}

2.3. Delay values#

These are the deadlines granted to latecomers during non-blocking communications.

Difference between two processors:

#0 ===== | t0|… |you|

#i ===== ===== ==|ti|…

\(\mathrm{[}\mathit{ti}\mathrm{-}\mathit{t0}\mathrm{]}\): delay granted by #0 to the #i processors. So if #1 arrives before #0, he should give him the same deadline: \(\mathit{t0}\mathrm{-}\mathit{t1}\mathrm{=}\mathit{ti}\mathrm{-}\mathit{t0}\).

The extreme case is:

#0 ===== ===== | t0|t0|~~~~~~~~~|ti|.v=====

^^v

#1 ===== | t1|… ^~~~~~~~~~~~~~^~~v=====

^

#i ===== ===== ===== ===== ===== || Ti|~|tf|=====

  • \(\mathit{t1}\): arrival of the first #1 processor

  • \(\mathit{t0}\): arrival from #0, #0 receives CHK from #1

  • \(\mathit{t0}+\mathit{dt}\): #1 awaits response from #0

  • \(\mathit{ti}\): arrival from #i, #0 receives CHK from #i, #0 sends CNT to #1 and #i

  • \(\mathit{ti}+\mathit{dt}\mathrm{=}\mathit{tf}\): #1 and #i get CNT from #0

So we only need \(\mathit{tf}\mathrm{-}\mathit{t0}>\mathit{ti}\mathrm{-}\mathit{t0}\). We limit the time to receive the response from #0 to \(1.2\mathrm{\times }\mathrm{[}\mathit{ti}\mathrm{-}\mathit{t0}\mathrm{]}\)

The timeout value is set to 20% of the remaining CPU time.