Parallelism and intensive computing =============================== Integrated operators ------------------- **The operators** CALC_MODES **and** INFO_MODE **can benefit from a first level of parallelism: that intrinsic to the linear solver MUMPS.** However, the parallel efficiency of MUMPS in a modal calculation is more limited than for other types of analyses. In general, parallel time efficiency of the order of 0.2 to 0.3 is observed for a small range of processors: from 2 to 16. Beyond that you no longer earn anything. This can be explained in particular by: * the virtual uniqueness of certain dynamic work matrices, * a very unfavorable "number of descents/ascents to the number of factorizations" ratio, * the significant cost of the analysis phase compared to that of factorization, * a very unfavorable "time cost/memory of the linear solver"/"time/memory cost of the modal solver" ratio. For more technical and functional information, you can consult the documentation [:external:ref:`R6.02.03 `], [:external:ref:`U4.50.01 `] and [:external:ref:`U2.08.03/06 `]. In order to improve in terms of performance, it is proposed to break down the initial calculation into more efficient and more accurate sub-calculations: this is the subject of one of the functionalities of the CALC_MODES operator with OPTION =' BANDE 'and a list of n>2 values given under CALC_FREQ =_F (FREQ), detailed in the following paragraph. In addition, this algorithmic rewriting of the problem shows two levels of parallelism that are more relevant and more effective for **"boosting" Code_Aster modal calculations**. Operator CALC_MODES, option 'BANDE' split into sub-bands ----------------------------- **To deal effectively with large modal problems** (in terms of mesh size and/or number of modes sought), it is recommended to use the **CALC_MODES** operator with the 'BANDE' option divided into sub-bands. It breaks down the modal calculation of a standard GEP (symmetric and real), into a succession of independent, less expensive, more robust and more accurate sub-calculations. Just sequentially, the **gains can be noteable**: factors 2 to 5 in time, 2 or 3 in peak RAM and 10 to 104 on the average error of the modes. In addition, **its multi-level parallelism,** by reserving around sixty processors, **can provide additional gains** of the order of 20 in time and 2 in peak RAM (cf. tables 10-1). And this, without loss of precision, or restriction of scope and with the same numerical behavior. .. csv-table:: "**perf016a test case** (N=4M, 50 modes) splitting into **8 sub-bands**", "**Time elapsed**", "**Memory peak RAM**" "1 processor", "5524s", "16.9GB" "8 processors", "1002s", "19.5Go" "32 processors", "643s", "13.4GB" "division into **4 sub-bands**", "", "" "1 processor", "3569s", "17.2GB" "4 processors", "1121s", "19.5GB" "16 processors", "663s", "12.9GB" .. image:: images/100002000000006100000093173F01BE9043012A.png :width: 1.4571in :height: 1.7909in .. _RefImage_100002000000006100000093173F01BE9043012A.png: .. image:: images/1000020000000070000000638B5ED7ADADB0857C.png :width: 1.628in :height: 1.5126in .. _RefImage_1000020000000070000000638B5ED7ADADB0857C.png: .. csv-table:: "**Seismic study** (N=0.7M, 450 modes) splitting into **20 sub-bands**", "**Time elapsed**", "**Memory peak RAM**" "1 processor", "5200s", "10.5GB" "20 processors", "407s", "12.1GB" "80 processors", "270s", "9.4GB" "division into **5 sub-bands**", "", "" "1 processor", "4660s", "8.2GB" "5 processors", "1097s", "11.8GB" "20 processors", "925s", "9.5GB" *Figures-Tables 10-1a/b. Some test results of* *CALC_MODES parallel with the default settings (+* *SOLVEUR =' MUMPS 'en* *IN_COREet* * *RENUM =' QAMD ') .* *Code_Aster v11.3.11 on machine IVANOE (1 or 2 MPI processes per node) .* **The principle of** **CALC_MODES** [:external:ref:`U4.52.02 `] with the 'BAND' option divided into sub-bands, is based on the fact that the calculation and memory costs of modal algorithms depend more than linearly on the number of modes sought. So, as with domain decomposition |R6.01.03], we're going to break down the search for hundreds of modes into more reasonable sized packages. A package of the order of forty modes seems to be a sequential empirical optimum. At the same time, we can continue to improve performances by going down to fifteen. The example in Figure 10-2 thus illustrates a global CALC_MODES calculation in the band :math:`\mathrm{[}{\mathit{freq}}_{\mathit{min}},{\mathit{freq}}_{\mathit{max}}\mathrm{]}` which is often advantageously replaced by ten CALC_MODES calculations targeted at equivalent contiguous sub-bands :math:`\mathrm{[}{\mathit{freq}}_{1}\mathrm{=}{\mathit{freq}}_{\mathit{min}},{\mathit{freq}}_{2}\mathrm{]},\mathrm{[}{\mathit{freq}}_{\mathrm{2,}}{\mathit{freq}}_{3}\mathrm{]},\mathrm{...}\mathrm{[}{\mathit{freq}}_{\mathrm{10,}}{\mathit{freq}}_{11}\mathrm{=}{\mathit{freq}}_{\mathit{max}}\mathrm{]}`. On the other hand, this type of decomposition makes it possible to: * reduce robustness problems, * improve and homogenize modal errors. In practice, **this modal operator dedicated to HPC** is broken down into four main steps: 1. **Modal precalibration** (*via* INFO_MODE) of sub-bands configured by the user: So potentially, a loop of nb_freq independent calculations on each modal frequency position (cf. [:external:ref:`R5.01.04 `]). 1. **Effective modal calculation** (*via* CALC_MODES + OPTION =' BANDE '+ CALC_FREQ =_F (TABLE_FREQ)) of the modes contained in each non-empty sub-band (by sharing the modal calibrations from step 1): So potentially, a loop of nb_sbande_nonempty < nb_freq independent calculations. 1. **Post-verification** with a Sturm test on the extreme limits of the calculated modes (*via* INFO_MODE): So potentially, a loop of 2 independent calculations. 1. **Post-treatments** on all the modes obtained: normalization (*via* NORM_MODE) and filtering (*via* EXTR_MODE). .. image:: images/1000020000000250000001A42E56421D95A93CE4.png :width: 5.8772in :height: 3.7311in .. _RefImage_1000020000000250000001A42E56421D95A93CE4.png: *Figure 10-2. Principle of decomposition of the calculations of* *CALC_MODES with* *the option* *' BANDE 'divided into sub-bands.* In parallel, **each of the calculation steps can identify at least one level of parallelism:** * The first two by distributing the calculations for each sub-band over the same number of processor packages. * The third, by distributing the modal positions of the terminals of the verification interval over two processor packages. * The fourth step, which is inexpensive, remains sequential. If the number of processors and the settings allow it (in particular, if we use the linear solver MUMPS), we can use a second level of parallelism. Figure 10-3 illustrates a modal calculation seeking to take advantage of 40 processors by breaking down the initial calculation into ten search sub-bands. Each benefits from the support of 4 MUMPS occurrences for the inversions of linear systems intensively required by modal solvers. For an exhaustive presentation of this multi-level parallelism, its challenges and some technical and functional details, you can consult the documentation [:external:ref:`R5.01.04 `], [:external:ref:`U4.52.01/02 `] and [:external:ref:`U2.08.06 `]. .. image:: images/100002000000023C000001A38B0F01FF7F09E47B.png :width: 5.1181in :height: 3.7362in .. _RefImage_100002000000023C000001A38B0F01FF7F09E47B.png: *Figure 10-3. Example of two levels of parallelism in* the INFO_MODE *of pre-processing* *and in the loop of* sub-bands *of* CALC_MODES *.* *Distribution on* nb_proc=4*0 processors with a division into 10 sub-bands* *(parallelism called "10x4") .* *Here we use the linear solver MUMPS and the parallelism setting by default ('COMPLET') .* Notes: * In parallel MPI, the main steps concern the distribution of tasks and their communications. For CALC_MODES with the 'BANDE' option divided into sub-bands, the distribution is done in the macro python as well as in the fortran. The two communicate through hidden keywords: PARALLELISME_MACRO. But all MPI calls are restricted to only the F77/F90 layers. * The global communications of the first level, those of the values and eigenvectors, are carried out at the end of the modal calculation on each sub-band. At an intermediate level, between the simple communication of linear algebra results (of the type of what is done around MUMPS/PETSc) and the communication of Aster data structures in Python after filtering (optimal in terms of performance but much more complicated to implement). * The ideal would have been to be able to balance the frequency sub-bands empirically to limit the load imbalances linked to the distribution of modes by sub-bands and those linked to the modal calculation itself. It would thus have been possible to provide 2, 4 or 8 sub-band calculations per processor. This would also make it possible to benefit from the gains of the decomposition of the macro calculation, even on a few processors. Unfortunately, computer contingencies for manipulating potentially empty user concepts did not make it possible to validate this more ambitious scenario.