Parallelism and intensive computing
===============================


Integrated operators
-------------------

**The operators** CALC_MODES **and** INFO_MODE **can benefit from a first level of parallelism: that intrinsic to the linear solver MUMPS.**

However, the parallel efficiency of MUMPS in a modal calculation is more limited than for other types of analyses. In general, parallel time efficiency of the order of 0.2 to 0.3 is observed for a small range of processors: from 2 to 16. Beyond that you no longer earn anything.

This can be explained in particular by:


* the virtual uniqueness of certain dynamic work matrices,


* a very unfavorable "number of descents/ascents to the number of factorizations" ratio,


* the significant cost of the analysis phase compared to that of factorization,


* a very unfavorable "time cost/memory of the linear solver"/"time/memory cost of the modal solver" ratio.

For more technical and functional information, you can consult the documentation [:external:ref:`R6.02.03 <R6.02.03>`], [:external:ref:`U4.50.01 <U4.50.01>`] and [:external:ref:`U2.08.03/06 <U2.08.03/06>`].

In order to improve in terms of performance, it is proposed to break down the initial calculation into more efficient and more accurate sub-calculations: this is the subject of one of the functionalities of the CALC_MODES operator with OPTION =' BANDE 'and a list of n>2 values given under CALC_FREQ =_F (FREQ), detailed in the following paragraph. In addition, this algorithmic rewriting of the problem shows two levels of parallelism that are more relevant and more effective for **"boosting" Code_Aster modal calculations**.


Operator CALC_MODES, option 'BANDE' split into sub-bands
-----------------------------

**To deal effectively with large modal problems** (in terms of mesh size and/or number of modes sought), it is recommended to use the **CALC_MODES** operator with the 'BANDE' option divided into sub-bands. It breaks down the modal calculation of a standard GEP (symmetric and real), into a succession of independent, less expensive, more robust and more accurate sub-calculations.

Just sequentially, the **gains can be noteable**: factors 2 to 5 in time, 2 or 3 in peak RAM and 10 to 104 on the average error of the modes.

In addition, **its multi-level parallelism,** by reserving around sixty processors, **can provide additional gains** of the order of 20 in time and 2 in peak RAM (cf. tables 10-1). And this, without loss of precision, or restriction of scope and with the same numerical behavior.


.. csv-table::

    "**perf016a test case**
    (N=4M, 50 modes)
    splitting into **8 sub-bands**", "**Time elapsed**", "**Memory peak RAM**"
    "1 processor", "5524s", "16.9GB"
    "8 processors", "1002s", "19.5Go"
    "32 processors", "643s", "13.4GB"
    "division into **4 sub-bands**", "", ""
    "1 processor", "3569s", "17.2GB"
    "4 processors", "1121s", "19.5GB"
    "16 processors", "663s", "12.9GB"


.. image:: images/100002000000006100000093173F01BE9043012A.png
    :width: 1.4571in
    :height: 1.7909in


.. _RefImage_100002000000006100000093173F01BE9043012A.png:


.. image:: images/1000020000000070000000638B5ED7ADADB0857C.png
    :width: 1.628in
    :height: 1.5126in


.. _RefImage_1000020000000070000000638B5ED7ADADB0857C.png:


.. csv-table::

    "**Seismic study**
    (N=0.7M, 450 modes)
    splitting into **20 sub-bands**", "**Time elapsed**", "**Memory peak RAM**"
    "1 processor", "5200s", "10.5GB"
    "20 processors", "407s", "12.1GB"
    "80 processors", "270s", "9.4GB"
    "division into **5 sub-bands**", "", ""
    "1 processor", "4660s", "8.2GB"
    "5 processors", "1097s", "11.8GB"
    "20 processors", "925s", "9.5GB"


*Figures-Tables 10-1a/b. Some test results of* *CALC_MODES parallel with the default settings (+* *SOLVEUR =' MUMPS 'en* *IN_COREet* * *RENUM =' QAMD ') .*

*Code_Aster v11.3.11 on machine IVANOE (1 or 2 MPI processes per node) .*

**The principle of** **CALC_MODES** [:external:ref:`U4.52.02 <U4.52.02>`] with the 'BAND' option divided into sub-bands, is based on the fact that the calculation and memory costs of modal algorithms depend more than linearly on the number of modes sought. So, as with domain decomposition |R6.01.03], we're going to break down the search for hundreds of modes into more reasonable sized packages.

A package of the order of forty modes seems to be a sequential empirical optimum. At the same time, we can continue to improve performances by going down to fifteen.

The example in Figure 10-2 thus illustrates a global CALC_MODES calculation in the band

:math:`\mathrm{[}{\mathit{freq}}_{\mathit{min}},{\mathit{freq}}_{\mathit{max}}\mathrm{]}`

which is often advantageously replaced by ten CALC_MODES calculations targeted at equivalent contiguous sub-bands

:math:`\mathrm{[}{\mathit{freq}}_{1}\mathrm{=}{\mathit{freq}}_{\mathit{min}},{\mathit{freq}}_{2}\mathrm{]},\mathrm{[}{\mathit{freq}}_{\mathrm{2,}}{\mathit{freq}}_{3}\mathrm{]},\mathrm{...}\mathrm{[}{\mathit{freq}}_{\mathrm{10,}}{\mathit{freq}}_{11}\mathrm{=}{\mathit{freq}}_{\mathit{max}}\mathrm{]}`.

On the other hand, this type of decomposition makes it possible to:


* reduce robustness problems,


* improve and homogenize modal errors.

In practice, **this modal operator dedicated to HPC** is broken down into four main steps:


1. **Modal precalibration** (*via* INFO_MODE) of sub-bands configured by the user:

So potentially, a loop of nb_freq independent calculations on each modal frequency position (cf. [:external:ref:`R5.01.04 <R5.01.04>`]).


1. **Effective modal calculation** (*via* CALC_MODES + OPTION =' BANDE '+ CALC_FREQ =_F (TABLE_FREQ)) of the modes contained in each non-empty sub-band (by sharing the modal calibrations from step 1):

So potentially, a loop of nb_sbande_nonempty < nb_freq independent calculations.


1. **Post-verification** with a Sturm test on the extreme limits of the calculated modes (*via* INFO_MODE):

So potentially, a loop of 2 independent calculations.


1. **Post-treatments** on all the modes obtained: normalization (*via* NORM_MODE) and filtering (*via* EXTR_MODE).


.. image:: images/1000020000000250000001A42E56421D95A93CE4.png
    :width: 5.8772in
    :height: 3.7311in


.. _RefImage_1000020000000250000001A42E56421D95A93CE4.png:

*Figure 10-2. Principle of decomposition of the calculations of* *CALC_MODES with* *the option* *' BANDE 'divided into sub-bands.*

In parallel, **each of the calculation steps can identify at least one level of parallelism:**


* The first two by distributing the calculations for each sub-band over the same number of processor packages.


* The third, by distributing the modal positions of the terminals of the verification interval over two processor packages.


* The fourth step, which is inexpensive, remains sequential.

If the number of processors and the settings allow it (in particular, if we use the linear solver MUMPS), we can use a second level of parallelism.

Figure 10-3 illustrates a modal calculation seeking to take advantage of 40 processors by breaking down the initial calculation into ten search sub-bands. Each benefits from the support of 4 MUMPS occurrences for the inversions of linear systems intensively required by modal solvers.

For an exhaustive presentation of this multi-level parallelism, its challenges and some technical and functional details, you can consult the documentation [:external:ref:`R5.01.04 <R5.01.04>`], [:external:ref:`U4.52.01/02 <U4.52.01/02>`] and [:external:ref:`U2.08.06 <U2.08.06>`].


.. image:: images/100002000000023C000001A38B0F01FF7F09E47B.png
    :width: 5.1181in
    :height: 3.7362in


.. _RefImage_100002000000023C000001A38B0F01FF7F09E47B.png:

*Figure 10-3. Example of two levels of parallelism in* the INFO_MODE *of pre-processing*

*and in the loop of* sub-bands *of* CALC_MODES *.*

*Distribution on* nb_proc=4*0 processors with a division into 10 sub-bands*

*(parallelism called "10x4") .*

*Here we use the linear solver MUMPS and the parallelism setting by default ('COMPLET') .*

Notes:


* In parallel MPI, the main steps concern the distribution of tasks and their communications. For CALC_MODES with the 'BANDE' option divided into sub-bands, the distribution is done in the macro python as well as in the fortran. The two communicate through hidden keywords: PARALLELISME_MACRO. But all MPI calls are restricted to only the F77/F90 layers.


* The global communications of the first level, those of the values and eigenvectors, are carried out at the end of the modal calculation on each sub-band. At an intermediate level, between the simple communication of linear algebra results (of the type of what is done around MUMPS/PETSc) and the communication of Aster data structures in Python after filtering (optimal in terms of performance but much more complicated to implement).


* The ideal would have been to be able to balance the frequency sub-bands empirically to limit the load imbalances linked to the distribution of modes by sub-bands and those linked to the modal calculation itself. It would thus have been possible to provide 2, 4 or 8 sub-band calculations per processor. This would also make it possible to benefit from the gains of the decomposition of the macro calculation, even on a few processors. Unfortunately, computer contingencies for manipulating potentially empty user concepts did not make it possible to validate this more ambitious scenario.