10. Parallelism and intensive computing#

10.1. Integrated operators#

The operators CALC_MODES and INFO_MODE can benefit from a first level of parallelism: that intrinsic to the linear solver MUMPS.

However, the parallel efficiency of MUMPS in a modal calculation is more limited than for other types of analyses. In general, parallel time efficiency of the order of 0.2 to 0.3 is observed for a small range of processors: from 2 to 16. Beyond that you no longer earn anything.

This can be explained in particular by:

  • the virtual uniqueness of certain dynamic work matrices,

  • a very unfavorable « number of descents/ascents to the number of factorizations » ratio,

  • the significant cost of the analysis phase compared to that of factorization,

  • a very unfavorable « time cost/memory of the linear solver »/ »time/memory cost of the modal solver » ratio.

For more technical and functional information, you can consult the documentation [R6.02.03], [U4.50.01] and [U2.08.03/06].

In order to improve in terms of performance, it is proposed to break down the initial calculation into more efficient and more accurate sub-calculations: this is the subject of one of the functionalities of the CALC_MODES operator with OPTION =” BANDE “and a list of n>2 values given under CALC_FREQ =_F (FREQ), detailed in the following paragraph. In addition, this algorithmic rewriting of the problem shows two levels of parallelism that are more relevant and more effective for « boosting » Code_Aster modal calculations.

10.2. Operator CALC_MODES, option “BANDE” split into sub-bands#

To deal effectively with large modal problems (in terms of mesh size and/or number of modes sought), it is recommended to use the CALC_MODES operator with the “BANDE” option divided into sub-bands. It breaks down the modal calculation of a standard GEP (symmetric and real), into a succession of independent, less expensive, more robust and more accurate sub-calculations.

Just sequentially, the gains can be noteable: factors 2 to 5 in time, 2 or 3 in peak RAM and 10 to 104 on the average error of the modes.

In addition, its multi-level parallelism, by reserving around sixty processors, can provide additional gains of the order of 20 in time and 2 in peak RAM (cf. tables 10-1). And this, without loss of precision, or restriction of scope and with the same numerical behavior.

perf016a test case (N=4M, 50 modes) splitting into 8 sub-bands

Time elapsed

Memory peak RAM

1 processor

5524s

16.9GB

8 processors

1002s

19.5Go

32 processors

643s

13.4GB

division into 4 sub-bands

1 processor

3569s

17.2GB

4 processors

1121s

19.5GB

16 processors

663s

12.9GB

_images/100002000000006100000093173F01BE9043012A.png _images/1000020000000070000000638B5ED7ADADB0857C.png

Seismic study (N=0.7M, 450 modes) splitting into 20 sub-bands

Time elapsed

Memory peak RAM

1 processor

5200s

10.5GB

20 processors

407s

12.1GB

80 processors

270s

9.4GB

division into 5 sub-bands

1 processor

4660s

8.2GB

5 processors

1097s

11.8GB

20 processors

925s

9.5GB

Figures-Tables 10-1a/b. Some test results of CALC_MODES parallel with the default settings (+ SOLVEUR =” MUMPS “en IN_COREet * RENUM =” QAMD “) .

Code_Aster v11.3.11 on machine IVANOE (1 or 2 MPI processes per node) .

The principle of CALC_MODES [U4.52.02] with the “BAND” option divided into sub-bands, is based on the fact that the calculation and memory costs of modal algorithms depend more than linearly on the number of modes sought. So, as with domain decomposition |R6.01.03], we’re going to break down the search for hundreds of modes into more reasonable sized packages.

A package of the order of forty modes seems to be a sequential empirical optimum. At the same time, we can continue to improve performances by going down to fifteen.

The example in Figure 10-2 thus illustrates a global CALC_MODES calculation in the band

\(\mathrm{[}{\mathit{freq}}_{\mathit{min}},{\mathit{freq}}_{\mathit{max}}\mathrm{]}\)

which is often advantageously replaced by ten CALC_MODES calculations targeted at equivalent contiguous sub-bands

\(\mathrm{[}{\mathit{freq}}_{1}\mathrm{=}{\mathit{freq}}_{\mathit{min}},{\mathit{freq}}_{2}\mathrm{]},\mathrm{[}{\mathit{freq}}_{\mathrm{2,}}{\mathit{freq}}_{3}\mathrm{]},\mathrm{...}\mathrm{[}{\mathit{freq}}_{\mathrm{10,}}{\mathit{freq}}_{11}\mathrm{=}{\mathit{freq}}_{\mathit{max}}\mathrm{]}\).

On the other hand, this type of decomposition makes it possible to:

  • reduce robustness problems,

  • improve and homogenize modal errors.

In practice, this modal operator dedicated to HPC is broken down into four main steps:

  1. Modal precalibration (via INFO_MODE) of sub-bands configured by the user:

So potentially, a loop of nb_freq independent calculations on each modal frequency position (cf. [R5.01.04]).

  1. Effective modal calculation (via CALC_MODES + OPTION =” BANDE “+ CALC_FREQ =_F (TABLE_FREQ)) of the modes contained in each non-empty sub-band (by sharing the modal calibrations from step 1):

So potentially, a loop of nb_sbande_nonempty < nb_freq independent calculations.

  1. Post-verification with a Sturm test on the extreme limits of the calculated modes (via INFO_MODE):

So potentially, a loop of 2 independent calculations.

  1. Post-treatments on all the modes obtained: normalization (via NORM_MODE) and filtering (via EXTR_MODE).

_images/1000020000000250000001A42E56421D95A93CE4.png

Figure 10-2. Principle of decomposition of the calculations of CALC_MODES with the option “ BANDE “divided into sub-bands.

In parallel, each of the calculation steps can identify at least one level of parallelism:

  • The first two by distributing the calculations for each sub-band over the same number of processor packages.

  • The third, by distributing the modal positions of the terminals of the verification interval over two processor packages.

  • The fourth step, which is inexpensive, remains sequential.

If the number of processors and the settings allow it (in particular, if we use the linear solver MUMPS), we can use a second level of parallelism.

Figure 10-3 illustrates a modal calculation seeking to take advantage of 40 processors by breaking down the initial calculation into ten search sub-bands. Each benefits from the support of 4 MUMPS occurrences for the inversions of linear systems intensively required by modal solvers.

For an exhaustive presentation of this multi-level parallelism, its challenges and some technical and functional details, you can consult the documentation [R5.01.04], [U4.52.01/02] and [U2.08.06].

_images/100002000000023C000001A38B0F01FF7F09E47B.png

Figure 10-3. Example of two levels of parallelism in the INFO_MODE of pre-processing

and in the loop of sub-bands of CALC_MODES .

Distribution on nb_proc=4*0 processors with a division into 10 sub-bands*

(parallelism called « 10x4 ») .

Here we use the linear solver MUMPS and the parallelism setting by default (“COMPLET”) .

Notes:

  • In parallel MPI, the main steps concern the distribution of tasks and their communications. For CALC_MODES with the “BANDE” option divided into sub-bands, the distribution is done in the macro python as well as in the fortran. The two communicate through hidden keywords: PARALLELISME_MACRO. But all MPI calls are restricted to only the F77/F90 layers.

  • The global communications of the first level, those of the values and eigenvectors, are carried out at the end of the modal calculation on each sub-band. At an intermediate level, between the simple communication of linear algebra results (of the type of what is done around MUMPS/PETSc) and the communication of Aster data structures in Python after filtering (optimal in terms of performance but much more complicated to implement).

  • The ideal would have been to be able to balance the frequency sub-bands empirically to limit the load imbalances linked to the distribution of modes by sub-bands and those linked to the modal calculation itself. It would thus have been possible to provide 2, 4 or 8 sub-band calculations per processor. This would also make it possible to benefit from the gains of the decomposition of the macro calculation, even on a few processors. Unfortunately, computer contingencies for manipulating potentially empty user concepts did not make it possible to validate this more ambitious scenario.